DataOps Engineer (AI Platform Engineer)

6.0/10

Exness

Not specified

Office / on-site

mid

27 days ago

aitechGPU infrastructureKubernetesPythonGoLinuxCI/CDobservability tools

AI Summary

The vacancy is well-defined but lacks compensation details, affecting overall attractiveness to applicants.

Check Match — Just drop your CV

See your fit for DataOps Engineer (AI Platform Engineer) in seconds.

Description

What you'll actually do

•Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
•Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
•Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
•Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
•Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
•Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
•Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
•Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
•Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
•Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
•Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
•Ensure compliance with security requirements for platform development.
•Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.

What we offer

•Full relocation support for you and your family to make your move smooth and worry-free.

Requirements

Who we’re looking for

•Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
•5+ years of experience in infrastructure, platform engineering, or distributed systems.
•Hands-on experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments.
•Strong experience with Kubernetes, including deploying and operating production workloads.
•Experience with Linux-based environments.
•Strong programming skills in Python and/or Go.
•Understanding of distributed systems and multi-node workloads.
•Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe).
•Experience with CI/CD pipelines and automation for deploying services or models.
•Experience with monitoring and observability tools (metrics, tracing, logging).
•Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking).
•Good communication and problem-solving skills.
•Ability to use advanced English for different work and business purposes.
•Critical thinking and attention to detail.
•Decision-making skills and the ability to adapt to new changes.

Salary not listed

Market range for similar roles

Based on 355 comparable Other openings (annual, USD)

$70k–$160k

Typical midpoint $110k

$70k$160k

Company Info

Exness

FinTech

Exness is a global multi-asset retail broker founded in 2008 that provides online trading services to over 1 million clients worldwide. The company processes 40+ million trades daily and operates as an ethical trading platform across 13 countries.

Limassol, Cyprus

1000+ employees

Founded 2008

Website

More at Exness· 7 open