Serbia- AI Workloads Engineer

RS

Description

NextSilicon is reimagining high-performance computing (HPC & AI). Our accelerated compute solutions leverage intelligent adaptive algorithms to vastly accelerate supercomputers, driving them forward into a new generation. We have developed a novel software-defined hardware architecture that is achieving significant advancements in both the HPC and AI domains.

At NextSilicon, everything we do is guided by three core values:

  • Professionalism: We strive for exceptional results through professionalism and unwavering dedication to quality and performance. 
  • Unity: Collaboration is key to success. That's why we foster a work environment where every employee can feel valued and heard. 
  • Impact: We're passionate about developing technologies that make a meaningful impact on industries, communities, and individuals worldwide.

The AI Workloads team is responsible for modeling and enabling end-to-end AI workflows on NextSilicon’s next-generation hardware platforms. As an AI Workloads Engineer in Belgrade, you’ll build workflow modeling infrastructure, run and adapt open-source AI systems, and use real workloads to drive performance improvements from chip design through production.

Requirements

  • 4+ years of experience in software engineering.
  • Strong Python and PyTorch development experience.
  • Solid understanding of LLMs and modern inference workflows (e.g., KV cache, paged attention, speculative/assisted decoding, batching/scheduling)
  • Experience running, profiling, and instrumenting open-source AI inference systems (e.g., vLLM or similar)
  • Proficiency in C++ for developing software that models or interacts with hardware execution behavior (latency, dataflow, memory access patterns).
  • Experience with distributed inference and collectives (e.g., NCCL) and parallelism strategies (TP/PP/EP) is an advantage
  • Experience with dynamic batching systems (e.g., vLLM, TensorRT-LLM) is an advantage
  • Familiarity with MLPerf Inference benchmarks and methodology (Server/Offline, latency constraints, request arrival patterns) is an advantage
  • Experience programming custom kernels (e.g., CUDA, Triton, or similar)  is an advantage
  • Background in performance analysis, simulation, compiler/runtime profiling, or workload modeling is an advantage


Responsibilities

  • Model and analyze end-to-end AI workflows (e.g., assisted decoding, dynamic batching, dynamic KV cache, MLPerf-like scenarios) on NextSilicon platforms, from simulation through production.
  • Run and adapt open-source AI workloads, collecting and analyzing metrics such as latency, throughput, and traversal or arrival statistics.
  • Use SDK and framework-integration tools to profile full-stack behavior, identify performance bottlenecks, and drive improvements with compiler, runtime, and hardware design teams.
  • Prototype custom kernels or runtime components when needed to enable or optimize new AI workflows on NextSilicon hardware.


NextSilicon is proud to be an Equal Opportunity Employer. We do not discriminate based upon race, religion, color, age, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, genetic information, status as a protected veteran, status as an individual with physical or mental disability, or other applicable legally protected characteristics. This policy applies to all employment practices within our organization, including hiring, recruiting, promotion, termination, layoff, recall, leave of absence, compensation, benefits, training, and apprenticeship. NextSilicon makes hiring decisions based solely on qualifications, merit, and business needs at the time.

Send us your CV
jobs@nextsilicon.com
Good luck!

For any questions please ask us at questions@nextsilicon.com

STAY UP
TO SPEED

Subscribe to NextLetter for exclusive insights on our latest innovations, performance benchmarks, white papers, event highlights, and success stories. Stay in the know!