Senior Inference Engineer

Onsite
[
New Delhi
]

About the Role

We’re building a high-performance AI inference platform designed for massive scale, low latency, and industry-leading cost efficiency. As a Senior Inference Engineer, you will own critical parts of the inference stack—driving architectural decisions, pushing hardware efficiency limits, and ensuring production-grade reliability across diverse workloads.

This role is for engineers who thrive at the intersection of systems engineering, ML performance, and infrastructure economics.


What You’ll Do

  • Design and own end-to-end inference architecture for LLM and multimodal models
  • Lead optimization of latency, throughput, tail latency (p95/p99), and cost per token
  • Architect batching, KV-cache management, speculative decoding, and parallelism strategies
  • Drive model optimization: quantization (INT8/FP8), pruning, compilation, kernel fusion
  • Build and scale inference services across heterogeneous hardware (GPU, CPU, accelerators)
  • Evaluate and integrate inference engines (TensorRT-LLM, vLLM, Triton, custom runtimes)
  • Profile and debug performance across compute, memory, interconnect, and networking
  • Establish production best practices: autoscaling, rollout strategies, monitoring, SLOs
  • Mentor engineers and review performance-critical code
  • Partner with product and business teams to translate requirements into system design

Required Skills

  • Strong experience in Python and C++ (or Rust) for performance-critical systems
  • Deep understanding of ML inference internals (transformers, attention, KV cache)
  • Proven experience optimizing inference for LLMs or large vision models
  • Strong knowledge of GPU architecture, memory hierarchies, and parallel programming
  • Hands-on experience with CUDA, Triton, or similar kernel-level optimization tools
  • Experience running production inference at scale (multi-node, multi-GPU systems)
  • Solid background in Linux, containers, and cloud / bare-metal infrastructure

Nice to Have

  • Experience with custom CUDA kernels or compiler toolchains
  • Familiarity with distributed inference (tensor/pipeline parallelism, NCCL)
  • Knowledge of inference on alternative hardware (Trainium, Inferentia, TPUs, ASICs)
  • Experience optimizing for power efficiency and $/token economics
  • Contributions to open-source inference frameworks or performance tooling

What You’ll Gain

  • Ownership of core inference systems used in production at scale
  • Direct impact on cost structure, margins, and customer experience
  • Opportunity to shape technical direction and platform architecture
  • Work on some of the most performance-critical systems in applied AI