AI for DevOps & Platform Engineers

MLOps is DevOps for AI. If you know CI/CD, you're closer than you think.

The AI era creates huge new infrastructure work — model deployment, GPU management, drift monitoring, and AI observability. This track teaches DevOps engineers to build and operate production ML infrastructure using patterns they already understand.

Duration

8 weeks · 6hrs/week

Format

Live online + async labs with HYVE engineer code review

Prerequisite

Docker, CI/CD, at least one cloud provider

Curriculum

MLOps Foundations for DevOps

6hrs

How ML workflows differ from software. Experiment tracking, model versioning, the MLOps lifecycle — mapped to DevOps concepts you already know.

Lab: Set up MLflow experiment tracking on AWS with proper IAM and versioning.

Containerising & Serving ML Models

8hrs

Docker for ML, model artifacts, dependency management, FastAPI, Triton Inference Server, and vLLM for LLMs.

Lab: Containerise and deploy a RAG API with FastAPI — targeting <150ms p95 latency.

CI/CD for ML Models

8hrs

Automated ML testing, model validation gates, canary/blue-green/shadow deployments, GitHub Actions integration.

Lab: Build a CI/CD pipeline that trains, evaluates, and deploys with a quality gate.

GPU Infrastructure & Cost Management

6hrs

GPU instance selection, spot instances, inference optimisation, quantisation, and cost dashboards.

Lab: Reduce inference cost of an LLM deployment by 40% using quantisation and batching.

Model Monitoring & Drift Detection

8hrs

Data drift, concept drift, performance monitoring, anomaly detection, alerting pipelines.

Lab: Build a drift detection pipeline alerting on significant input distribution shifts.

Feature Stores & Data Pipelines

6hrs

Feature engineering, feature stores (Feast, Tecton), data versioning with DVC.

Lab: Build a feature store pipeline serving real-time features to a fraud detection model.

AI Observability & Incidents

6hrs

Token cost dashboards, latency profiling, prompt logging, AI incident runbooks.

Lab: Build a full observability stack for an LLM API with cost and latency SLOs.

Production AI Architecture

6hrs

Reference architectures for RAG, agents, batch inference, real-time scoring — from HYVE's UAE deployments.

Lab: Design a production architecture for a UAE bank AI system.

Learning Outcomes

✓Build CI/CD pipelines for ML models
✓Deploy & scale AI APIs in production
✓Monitor drift and trigger automated retraining
✓Manage GPU costs
✓Build complete MLOps infrastructure

AI for DevOps & Platform Engineers

Curriculum

MLOps Foundations for DevOps

Containerising & Serving ML Models

CI/CD for ML Models

GPU Infrastructure & Cost Management

Model Monitoring & Drift Detection

Feature Stores & Data Pipelines

AI Observability & Incidents

Production AI Architecture

Learning Outcomes

FAQs