Inference Engineer
Onsite
[
New Delhi
]
About the Role
We’re building a high-performance AI inference platform focused on delivering low-latency, cost-efficient model serving at scale. As an Inference Engineer, you’ll work close to the metal—optimizing model execution, improving throughput, and helping design reliable inference pipelines used in real production workloads.
This role is ideal for an engineer with strong fundamentals who wants deep exposure to model serving, hardware efficiency, and distributed systems.
What You’ll Do
- Implement and optimize model inference pipelines for LLMs and vision models
- Work with inference frameworks (e.g., TensorRT, ONNX Runtime, vLLM, Triton)
- Optimize latency, throughput, memory usage, and cost per token
- Assist in deploying and maintaining inference services on GPU and CPU clusters
- Profile model execution (compute, memory, bandwidth) and identify bottlenecks
- Support model quantization, batching, caching, and parallelism strategies
- Collaborate with platform, infra, and product teams to ship production features
- Monitor inference workloads and help improve system reliability
Required Skills
- Strong fundamentals in Python (required); C++ or Rust is a plus
- Understanding of machine learning inference vs training
- Familiarity with at least one ML framework (PyTorch, TensorFlow, JAX)
- Basic knowledge of GPU/CPU architecture, memory, and parallelism
- Experience with Linux, containers (Docker), and basic cloud workflows
- Ability to read research papers or performance benchmarks and apply learnings
Nice to Have
- Exposure to LLMs (LLaMA, Mistral, Qwen, etc.) or vision models
- Experience with quantization (INT8/FP8) or model compression
- Familiarity with CUDA concepts, Triton kernels, or low-level optimization
- Knowledge of distributed systems or RPC frameworks
- Prior work on inference benchmarks or cost optimization
What You’ll Gain
- Hands-on experience with real-world AI inference at scale
- Deep understanding of $/token economics and system trade-offs
- Opportunity to grow into senior inference, systems, or infra roles
- Work on problems that directly impact product performance and margins