
DataOps Engineer (AI Platform Engineer)
6.0/10
Exness
Not specified
Office / on-site
mid
27 days ago
aitechGPU infrastructureKubernetesPythonGoLinuxCI/CDobservability tools
AI Summary
The vacancy is well-defined but lacks compensation details, affecting overall attractiveness to applicants.
Check Match — Just drop your CV
See your fit for DataOps Engineer (AI Platform Engineer) in seconds.
Description
What you'll actually do
- •Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
- •Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
- •Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
- •Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
- •Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
- •Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
- •Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
- •Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
- •Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
- •Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
- •Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
- •Ensure compliance with security requirements for platform development.
- •Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.
What we offer
- •Full relocation support for you and your family to make your move smooth and worry-free.
Requirements
Who we’re looking for
- •Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- •5+ years of experience in infrastructure, platform engineering, or distributed systems.
- •Hands-on experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments.
- •Strong experience with Kubernetes, including deploying and operating production workloads.
- •Experience with Linux-based environments.
- •Strong programming skills in Python and/or Go.
- •Understanding of distributed systems and multi-node workloads.
- •Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe).
- •Experience with CI/CD pipelines and automation for deploying services or models.
- •Experience with monitoring and observability tools (metrics, tracing, logging).
- •Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking).
- •Good communication and problem-solving skills.
- •Ability to use advanced English for different work and business purposes.
- •Critical thinking and attention to detail.
- •Decision-making skills and the ability to adapt to new changes.
Loading similar jobs...