Senior Inference Engineer

Onsite

[

New Delhi

]

About the Role

We’re building a high-performance AI inference platform designed for massive scale, low latency, and industry-leading cost efficiency. As a Senior Inference Engineer, you will own critical parts of the inference stack—driving architectural decisions, pushing hardware efficiency limits, and ensuring production-grade reliability across diverse workloads.

This role is for engineers who thrive at the intersection of systems engineering, ML performance, and infrastructure economics.

What You’ll Do

Design and own end-to-end inference architecture for LLM and multimodal models
Lead optimization of latency, throughput, tail latency (p95/p99), and cost per token
Architect batching, KV-cache management, speculative decoding, and parallelism strategies
Drive model optimization: quantization (INT8/FP8), pruning, compilation, kernel fusion
Build and scale inference services across heterogeneous hardware (GPU, CPU, accelerators)
Evaluate and integrate inference engines (TensorRT-LLM, vLLM, Triton, custom runtimes)
Profile and debug performance across compute, memory, interconnect, and networking
Establish production best practices: autoscaling, rollout strategies, monitoring, SLOs
Mentor engineers and review performance-critical code
Partner with product and business teams to translate requirements into system design

Required Skills

Strong experience in Python and C++ (or Rust) for performance-critical systems
Deep understanding of ML inference internals (transformers, attention, KV cache)
Proven experience optimizing inference for LLMs or large vision models
Strong knowledge of GPU architecture, memory hierarchies, and parallel programming
Hands-on experience with CUDA, Triton, or similar kernel-level optimization tools
Experience running production inference at scale (multi-node, multi-GPU systems)
Solid background in Linux, containers, and cloud / bare-metal infrastructure

Nice to Have

Experience with custom CUDA kernels or compiler toolchains
Familiarity with distributed inference (tensor/pipeline parallelism, NCCL)
Knowledge of inference on alternative hardware (Trainium, Inferentia, TPUs, ASICs)
Experience optimizing for power efficiency and $/token economics
Contributions to open-source inference frameworks or performance tooling

What You’ll Gain

Ownership of core inference systems used in production at scale
Direct impact on cost structure, margins, and customer experience
Opportunity to shape technical direction and platform architecture
Work on some of the most performance-critical systems in applied AI