DevOps Engineer for NVIDIA AI Cloud (medior/senior) | ID:335 - Neohunter

DevOps Engineer for NVIDIA AI Cloud (medior/senior) | ID:335

Share On

Apply Now

Remote
Full Time
€2500 - €4500 / Month

Join a stable, large technology organization with deep infrastructure and operations experience, on a project that decides how European companies train and run their own AI.

Job Description

One of the largest AI factories in Europe is being built right now, and this role sits at the operational core of it. The platform runs thousands of NVIDIA Blackwell GPUs (DGX B200 and RTX Pro Servers) as a sovereign, secure cloud for European industry, research, and the public sector.

This is hands-on platform engineering combined with direct contact with the enterprise customers who run their AI workloads on the cloud.

Role overview:

You will design, automate, and run services on the AI cloud platform, and you will work directly with enterprise customers as they bring their workloads onto it. On the platform side, you own the lifecycle of CI/CD pipelines and Infrastructure as Code, build automation, and keep GPU workloads reliable and observable. On the customer side, you guide onboarding, set up proofs of concept, and help teams get real performance out of their GPU clusters and AI toolchains. You work across infrastructure, networking, security, and AI services, and you are the technical point of contact who connects those pieces. The role is open from mid-level to senior level, so the scope and salary will scale with your experience.

Key Responsibilities:

Build and maintain CI/CD pipelines and Infrastructure as Code (Terraform, Ansible, Helm) for the AI cloud platform.
Automate provisioning, deployment, and scaling of GPU workloads across Kubernetes and container-based environments.
Run and tune GPU workload scheduling (Slurm, Run:AI) so customers get the throughput they pay for.
Set up and lead proofs of concept with enterprise customers: environment setup, data pipelines, deployment workflows.
Onboard and train customer teams on how to use their GPU clusters, LLM toolchains, and AI environments.
Troubleshoot performance, fine-tuning, and integration issues across the full customer lifecycle, from first deployment to production.
Set up monitoring, observability, and capacity planning for GPU utilization (Prometheus, Grafana, Alertmanager).
Act as the technical point of contact across infrastructure, networking, security, and AI services teams.
Translate customer business needs into technical specifications and feed them back into platform improvements.
Apply security, reliability, and responsible-AI practices across everything you ship.

Requirements

3+ years (senior) to 5+ years (principal) building and running cloud infrastructure in production (IaaS, PaaS, or SaaS).
Strong Linux administration, ideally Ubuntu.
Solid Kubernetes and containerized workflows.
Hands-on Infrastructure as Code: Terraform, Ansible, or Helm.
CI/CD in a Kubernetes environment, plus Git-based automation (GitHub or GitLab, Actions or CI/CD).
Scripting in Python and Bash for automation and tooling.
Working knowledge of NVIDIA GPU-accelerated platforms, or clear readiness to specialize in them fast.
Monitoring and observability with Prometheus and Grafana.
English at B2/C1, used actively in customer-facing communication.
Comfortable working independently and coordinating across cross-functional teams.

A strong plus:

German (active), especially for work with German-speaking enterprise customers.
GPU workload schedulers (Slurm, Run:AI) and high-performance network architectures.
Self-hosted LLMs: fine-tuning and inference tuning.
AI/ML frameworks: PyTorch, TensorFlow, Hugging Face, Triton Inference Server.
LLM architectures, embeddings, vector databases, and RAG pipelines.
VMware Tanzu Kubernetes and Software-Defined Networking (SDN).

Required Education

University degree in IT, computer science, or a related field. Strong hands-on experience counts more than a specific diploma, so we look at what you have built, not only at your formal degree.

Required Language

Suitable For Graduates

Skills

Employee Benefits

Work on a unique European AI Cloud project, one of the largest builds of its kind on the continent.
A cloud platform developed in-house, without vendor lock-in. You work on real engineering, not on wiring together someone else’s managed services.
Access to hardware at a scale most engineers never get near: around 10,000 NVIDIA GPUs running in production.
A modern technical stack you use every day: Go, Kubernetes, Terraform, GPU compute, and SDN.
Strong focus on security, compliance, and sensitive data, with customers in banking, insurance, and defense. The work is held to a high bar because of who relies on it.
Work alongside top infrastructure engineers, on a team where you learn from the people next to you.
Room to grow technically and to help shape the team as the platform scales.
An internal reskilling program, training, and a learning budget, so building deep GPU cloud and AI skills is part of the job, not something you do on weekends.
Hybrid or full remote within Slovakia, with occasional travel to the data center, customers, or project sites.
A stable employer with a large engineering base, so the project has real backing and the role has runway.
Over 25 company benefits across finance, health and sport, learning and development, and family and work-life balance.

Apply for Job

Contact Person Details

Name:

Juraj Lovas

Email:

Juraj.Lovas@neohunter.io

Phone: