Senior HPC Engineer
Onsite
[
New Delhi
]
About the Role
We’re building compute infrastructure for high-throughput, low-latency workloads that push modern hardware to its limits. As a Senior HPC Engineer, you will design, optimize, and operate large-scale compute systems powering AI inference, simulation, and data-intensive workloads.
This role is for engineers who excel in parallel computing, performance engineering, and systems-level optimization across heterogeneous hardware.
What You’ll Do
- Design and optimize HPC systems and clusters for performance, scalability, and reliability
- Develop and tune parallel applications using MPI, OpenMP, CUDA, or hybrid models
- Optimize workloads across CPU, GPU, and accelerator-based architectures
- Profile and analyze performance bottlenecks (compute, memory, I/O, interconnect)
- Architect high-performance networking (InfiniBand, RDMA) and storage pipelines
- Improve job scheduling, resource utilization, and queue efficiency
- Lead benchmarking, capacity planning, and performance regression testing
- Build automation for cluster provisioning, monitoring, and fault recovery
- Mentor engineers and establish best practices for performance engineering
- Collaborate with ML, inference, and infra teams on cross-domain optimization
Required Skills
- Strong experience in C/C++ (required); Python for orchestration and tooling
- Deep knowledge of parallel computing models (MPI, OpenMP, CUDA)
- Strong understanding of CPU/GPU architecture, memory hierarchy, and NUMA
- Hands-on experience with HPC clusters and workload managers (Slurm, PBS)
- Experience with high-performance networking (InfiniBand, RDMA, NCCL)
- Strong Linux systems knowledge and performance debugging skills
- Proven experience operating or optimizing production-scale HPC environments
Nice to Have
- Experience with AI/ML workloads on HPC systems
- Familiarity with GPU-direct storage, NVMe-oF, or Lustre/GPFS
- Knowledge of power, thermal, and performance tuning at rack or node level
- Experience with containerized HPC (Apptainer/Singularity, Docker)
- Contributions to HPC tooling, benchmarks, or open-source projects
What You’ll Gain
- Ownership of high-performance compute infrastructure at scale
- Direct impact on throughput, efficiency, and infrastructure cost
- Opportunity to influence hardware selection and system architecture
- Work on some of the most demanding compute workloads in production