Inference Engineer

Onsite

[

New Delhi

]

About the Role

We’re building a high-performance AI inference platform focused on delivering low-latency, cost-efficient model serving at scale. As an Inference Engineer, you’ll work close to the metal—optimizing model execution, improving throughput, and helping design reliable inference pipelines used in real production workloads.

This role is ideal for an engineer with strong fundamentals who wants deep exposure to model serving, hardware efficiency, and distributed systems.

What You’ll Do

Implement and optimize model inference pipelines for LLMs and vision models
Work with inference frameworks (e.g., TensorRT, ONNX Runtime, vLLM, Triton)
Optimize latency, throughput, memory usage, and cost per token
Assist in deploying and maintaining inference services on GPU and CPU clusters
Profile model execution (compute, memory, bandwidth) and identify bottlenecks
Support model quantization, batching, caching, and parallelism strategies
Collaborate with platform, infra, and product teams to ship production features
Monitor inference workloads and help improve system reliability

Required Skills

Strong fundamentals in Python (required); C++ or Rust is a plus
Understanding of machine learning inference vs training
Familiarity with at least one ML framework (PyTorch, TensorFlow, JAX)
Basic knowledge of GPU/CPU architecture, memory, and parallelism
Experience with Linux, containers (Docker), and basic cloud workflows
Ability to read research papers or performance benchmarks and apply learnings

Nice to Have

Exposure to LLMs (LLaMA, Mistral, Qwen, etc.) or vision models
Experience with quantization (INT8/FP8) or model compression
Familiarity with CUDA concepts, Triton kernels, or low-level optimization
Knowledge of distributed systems or RPC frameworks
Prior work on inference benchmarks or cost optimization

What You’ll Gain

Hands-on experience with real-world AI inference at scale
Deep understanding of $/token economics and system trade-offs
Opportunity to grow into senior inference, systems, or infra roles
Work on problems that directly impact product performance and margins