Exness

DataOps Engineer (AI Platform Engineer)

6.0/10

Exness

Not specified
Office / on-site
mid
27 days ago
aitechGPU infrastructureKubernetesPythonGoLinuxCI/CDobservability tools

AI Summary

The vacancy is well-defined but lacks compensation details, affecting overall attractiveness to applicants.

Check Match — Just drop your CV

See your fit for DataOps Engineer (AI Platform Engineer) in seconds.

Description

What you'll actually do

  • Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
  • Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
  • Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
  • Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
  • Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
  • Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
  • Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
  • Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
  • Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
  • Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
  • Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
  • Ensure compliance with security requirements for platform development.
  • Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.

What we offer

  • Full relocation support for you and your family to make your move smooth and worry-free.

Requirements

Who we’re looking for

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in infrastructure, platform engineering, or distributed systems.
  • Hands-on experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments.
  • Strong experience with Kubernetes, including deploying and operating production workloads.
  • Experience with Linux-based environments.
  • Strong programming skills in Python and/or Go.
  • Understanding of distributed systems and multi-node workloads.
  • Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe).
  • Experience with CI/CD pipelines and automation for deploying services or models.
  • Experience with monitoring and observability tools (metrics, tracing, logging).
  • Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking).
  • Good communication and problem-solving skills.
  • Ability to use advanced English for different work and business purposes.
  • Critical thinking and attention to detail.
  • Decision-making skills and the ability to adapt to new changes.
Loading similar jobs...