Senior Inference Engineer
Onsite
[
New Delhi
]
About the Role
We’re building a high-performance AI inference platform designed for massive scale, low latency, and industry-leading cost efficiency. As a Senior Inference Engineer, you will own critical parts of the inference stack—driving architectural decisions, pushing hardware efficiency limits, and ensuring production-grade reliability across diverse workloads.
This role is for engineers who thrive at the intersection of systems engineering, ML performance, and infrastructure economics.
What You’ll Do
- Design and own end-to-end inference architecture for LLM and multimodal models
- Lead optimization of latency, throughput, tail latency (p95/p99), and cost per token
- Architect batching, KV-cache management, speculative decoding, and parallelism strategies
- Drive model optimization: quantization (INT8/FP8), pruning, compilation, kernel fusion
- Build and scale inference services across heterogeneous hardware (GPU, CPU, accelerators)
- Evaluate and integrate inference engines (TensorRT-LLM, vLLM, Triton, custom runtimes)
- Profile and debug performance across compute, memory, interconnect, and networking
- Establish production best practices: autoscaling, rollout strategies, monitoring, SLOs
- Mentor engineers and review performance-critical code
- Partner with product and business teams to translate requirements into system design
Required Skills
- Strong experience in Python and C++ (or Rust) for performance-critical systems
- Deep understanding of ML inference internals (transformers, attention, KV cache)
- Proven experience optimizing inference for LLMs or large vision models
- Strong knowledge of GPU architecture, memory hierarchies, and parallel programming
- Hands-on experience with CUDA, Triton, or similar kernel-level optimization tools
- Experience running production inference at scale (multi-node, multi-GPU systems)
- Solid background in Linux, containers, and cloud / bare-metal infrastructure
Nice to Have
- Experience with custom CUDA kernels or compiler toolchains
- Familiarity with distributed inference (tensor/pipeline parallelism, NCCL)
- Knowledge of inference on alternative hardware (Trainium, Inferentia, TPUs, ASICs)
- Experience optimizing for power efficiency and $/token economics
- Contributions to open-source inference frameworks or performance tooling
What You’ll Gain
- Ownership of core inference systems used in production at scale
- Direct impact on cost structure, margins, and customer experience
- Opportunity to shape technical direction and platform architecture
- Work on some of the most performance-critical systems in applied AI