⚙️
DevOps, SRE & Platform Engineers

AI for DevOps & Platform Engineers

MLOps is DevOps for AI. If you know CI/CD, you're closer than you think.

The AI era creates huge new infrastructure work — model deployment, GPU management, drift monitoring, and AI observability. This track teaches DevOps engineers to build and operate production ML infrastructure using patterns they already understand.

Duration
8 weeks · 6hrs/week
Format
Live online + async labs with HYVE engineer code review
Prerequisite
Docker, CI/CD, at least one cloud provider

Curriculum

01

MLOps Foundations for DevOps

6hrs

How ML workflows differ from software. Experiment tracking, model versioning, the MLOps lifecycle — mapped to DevOps concepts you already know.

Lab: Set up MLflow experiment tracking on AWS with proper IAM and versioning.
02

Containerising & Serving ML Models

8hrs

Docker for ML, model artifacts, dependency management, FastAPI, Triton Inference Server, and vLLM for LLMs.

Lab: Containerise and deploy a RAG API with FastAPI — targeting <150ms p95 latency.
03

CI/CD for ML Models

8hrs

Automated ML testing, model validation gates, canary/blue-green/shadow deployments, GitHub Actions integration.

Lab: Build a CI/CD pipeline that trains, evaluates, and deploys with a quality gate.
04

GPU Infrastructure & Cost Management

6hrs

GPU instance selection, spot instances, inference optimisation, quantisation, and cost dashboards.

Lab: Reduce inference cost of an LLM deployment by 40% using quantisation and batching.
05

Model Monitoring & Drift Detection

8hrs

Data drift, concept drift, performance monitoring, anomaly detection, alerting pipelines.

Lab: Build a drift detection pipeline alerting on significant input distribution shifts.
06

Feature Stores & Data Pipelines

6hrs

Feature engineering, feature stores (Feast, Tecton), data versioning with DVC.

Lab: Build a feature store pipeline serving real-time features to a fraud detection model.
07

AI Observability & Incidents

6hrs

Token cost dashboards, latency profiling, prompt logging, AI incident runbooks.

Lab: Build a full observability stack for an LLM API with cost and latency SLOs.
08

Production AI Architecture

6hrs

Reference architectures for RAG, agents, batch inference, real-time scoring — from HYVE's UAE deployments.

Lab: Design a production architecture for a UAE bank AI system.

Learning Outcomes

  • Build CI/CD pipelines for ML models
  • Deploy & scale AI APIs in production
  • Monitor drift and trigger automated retraining
  • Manage GPU costs
  • Build complete MLOps infrastructure

FAQs