MLOps in Practice: Deploying AI Models with Docker and FastAPI

By undefined · 2026-07-05 · 13 min read

A complete production guide to containerizing and serving machine learning models — covering FastAPI model servers, multi-stage Docker builds, GPU support, heal

Introduction: The Gap Between Notebook and Production

Most machine learning projects die in notebooks. A model achieves 94% accuracy on a validation set, the data scientist shares a screenshot, and then nothing ships. The gap between a working notebook and a reliable production system is where most ML value is destroyed.

MLOps — Machine Learning Operations — is the discipline that closes this gap. At its core, MLOps asks a simple question: how do we serve a model reliably, repeatedly, and at scale? This guide answers that question with a concrete stack: FastAPI for the serving layer and Docker for packaging and deployment. By the end, you will have a production-ready ML service that can run on any machine, from a local workstation to a cloud VM, with no environment surprises.

Architecture Overview

Before writing a single line of code, it is worth understanding what we are building. The system has three layers:

Model layer — a serialized ML model (scikit-learn, PyTorch, ONNX) loaded into memory once at startup
API layer — a FastAPI application that exposes inference endpoints, validates input, and returns structured predictions
Infrastructure layer — Docker containers that package both layers with their exact dependencies, making the service reproducible everywhere


+--------------------------------------------------+
|                   Docker Host                    |
|                                                  |
|   +-----------------+   +--------------------+  |
|   |   FastAPI App   |   |   Model Storage    |  |
|   |                 |   |   (volume mount)   |  |
|   |  /predict  -----+--->  model.pt          |  |
|   |  /health        |   |  tokenizer/        |  |
|   |  /metrics       |   +--------------------+  |
|   +--------+--------+                           |
|            |  :8000                              |
+------------+------------------------------------+
             |
         Client / Load Balancer / Nginx

1. Building the FastAPI Model Server

FastAPI is the right choice for ML APIs for three reasons: automatic request validation via Pydantic, async support for concurrent requests, and automatic OpenAPI documentation generation. Let us build a text classification server as a concrete example.

Project Structure

ml-service/
├── app/
│   ├── __init__.py
│   ├── main.py          ← FastAPI application
│   ├── model.py         ← Model loading and inference logic
│   ├── schemas.py       ← Pydantic request/response models
│   └── config.py        ← Settings via environment variables
├── models/
│   └── classifier.pkl   ← Serialized model file
├── tests/
│   └── test_api.py
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Pydantic Schemas — schemas.py

from pydantic import BaseModel, Field
from typing import List

class PredictRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=4096,
                      description="Input text to classify")
    top_k: int = Field(default=1, ge=1, le=5,
                       description="Number of top predictions to return")

class Prediction(BaseModel):
    label: str
    confidence: float
    rank: int

class PredictResponse(BaseModel):
    predictions: List[Prediction]
    model_version: str
    processing_time_ms: float

class HealthResponse(BaseModel):
    status: str
    model_loaded: bool
    uptime_seconds: float

Model Wrapper — model.py

import time
import pickle
import logging
from pathlib import Path
from typing import List, Tuple

logger = logging.getLogger(__name__)

class ClassifierModel:
    """
    Thread-safe singleton wrapper around a scikit-learn classifier.
    Loads once at startup; all requests share the same instance.
    """
    _instance = None

    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance

    def load(self, model_path: Path) -> None:
        logger.info(f"Loading model from {model_path}")
        with open(model_path, "rb") as f:
            self._pipeline = pickle.load(f)
        self._loaded_at = time.time()
        self.version = model_path.stem
        logger.info("Model loaded successfully")

    @property
    def is_loaded(self) -> bool:
        return hasattr(self, "_pipeline")

    def predict(self, text: str, top_k: int = 1) -> List[Tuple[str, float]]:
        if not self.is_loaded:
            raise RuntimeError("Model is not loaded")
        proba = self._pipeline.predict_proba([text])[0]
        classes = self._pipeline.classes_
        ranked = sorted(zip(classes, proba), key=lambda x: x[1], reverse=True)
        return ranked[:top_k]

model = ClassifierModel()

FastAPI Application — main.py

import time
import logging
from contextlib import asynccontextmanager
from pathlib import Path

from fastapi import FastAPI, HTTPException, status
from fastapi.middleware.cors import CORSMiddleware

from app.model import model
from app.schemas import PredictRequest, PredictResponse, Prediction, HealthResponse
from app.config import settings

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
_start_time = time.time()

@asynccontextmanager
async def lifespan(app: FastAPI):
    model_path = Path(settings.MODEL_PATH)
    if not model_path.exists():
        raise FileNotFoundError(f"Model not found at {model_path}")
    model.load(model_path)
    logger.info("Service ready")
    yield
    logger.info("Shutting down")

app = FastAPI(
    title="ML Classifier API",
    version="1.0.0",
    lifespan=lifespan,
)

app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
    if not model.is_loaded:
        raise HTTPException(status_code=503, detail="Model not ready")
    t0 = time.perf_counter()
    try:
        raw = model.predict(request.text, top_k=request.top_k)
    except Exception as e:
        logger.exception("Inference error")
        raise HTTPException(status_code=500, detail=str(e))
    elapsed_ms = (time.perf_counter() - t0) * 1000
    predictions = [
        Prediction(label=label, confidence=round(conf, 4), rank=i + 1)
        for i, (label, conf) in enumerate(raw)
    ]
    return PredictResponse(
        predictions=predictions,
        model_version=model.version,
        processing_time_ms=round(elapsed_ms, 2),
    )

@app.get("/health", response_model=HealthResponse)
async def health():
    return HealthResponse(
        status="ok" if model.is_loaded else "starting",
        model_loaded=model.is_loaded,
        uptime_seconds=round(time.time() - _start_time, 1),
    )

@app.get("/ready")
async def readiness():
    if not model.is_loaded:
        raise HTTPException(status_code=503, detail="Model not loaded yet")
    return {"status": "ready", "model_version": model.version}

2. Writing the Dockerfile

A naive Dockerfile copies everything into a single fat image. A production Dockerfile uses multi-stage builds to separate the dependency installation from the runtime, resulting in a smaller, more secure final image.

CPU Model — Dockerfile

## Stage 1: dependency builder
FROM python:3.11-slim AS builder

WORKDIR /build

RUN apt-get update && apt-get install -y --no-install-recommends     gcc g++ && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

## Stage 2: runtime image
FROM python:3.11-slim AS runtime

RUN addgroup --system app && adduser --system --ingroup app app

WORKDIR /app

COPY --from=builder /root/.local /home/app/.local
COPY --chown=app:app app/ ./app/
COPY --chown=app:app models/ ./models/

USER app

ENV PATH="/home/app/.local/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

EXPOSE 8000

CMD ["gunicorn", "app.main:app",      "--worker-class", "uvicorn.workers.UvicornWorker",      "--workers", "2",      "--bind", "0.0.0.0:8000",      "--timeout", "120"]

GPU Model — Dockerfile.gpu

For PyTorch or TensorFlow models running on GPU, swap the base image for an NVIDIA CUDA image.

## GPU base: CUDA 12.1 + cuDNN 8 + Python 3.11
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 AS runtime

RUN apt-get update && apt-get install -y python3.11 python3-pip &&     rm -rf /var/lib/apt/lists/*

# The rest of the Dockerfile is identical to the CPU version

3. Docker Compose: Orchestrating the Full Stack

Docker Compose lets you define and run multiple containers as a single application. This configuration adds a Nginx reverse proxy in front of FastAPI — a pattern used by virtually every production ML deployment.

version: "3.9"

services:
  ml-api:
    build:
      context: .
      dockerfile: Dockerfile
    image: ml-classifier:latest
    container_name: ml_api
    restart: unless-stopped
    volumes:
      - ./models:/app/models:ro
    environment:
      - MODEL_PATH=models/classifier.pkl
      - LOG_LEVEL=INFO
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s
    networks:
      - ml-network

  nginx:
    image: nginx:1.27-alpine
    container_name: ml_nginx
    restart: unless-stopped
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
    depends_on:
      ml-api:
        condition: service_healthy
    networks:
      - ml-network

networks:
  ml-network:
    driver: bridge

Nginx Configuration — nginx.conf

upstream ml_api {
    server ml-api:8000;
}

server {
    listen 80;

    # Rate limiting: 30 requests per minute per IP
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;

    location /predict {
        limit_req zone=api_limit burst=10 nodelay;
        proxy_pass         http://ml_api;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_read_timeout 120s;
    }

    location /health {
        proxy_pass http://ml_api;
        access_log off;
    }

    location / {
        proxy_pass http://ml_api;
    }
}

4. CI/CD with GitHub Actions

A CI/CD pipeline automates the build-test-deploy cycle. This workflow runs on every push to main: builds the Docker image, runs tests, pushes to the GitHub Container Registry, and deploys to production via SSH.

# .github/workflows/deploy.yml
name: Build and Deploy ML Service

on:
  push:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: your-username/ml-classifier

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip
      - run: pip install -r requirements.txt pytest httpx
      - run: pytest tests/ -v

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Deploy via SSH
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PROD_HOST }}
          username: ${{ secrets.PROD_USER }}
          key: ${{ secrets.PROD_SSH_KEY }}
          script: |
            cd /opt/ml-service
            docker compose pull
            docker compose up -d --no-build
            docker system prune -f

5. Running the Service

Local development

# Build
docker build -t ml-classifier:dev .

# Run with live code reload
docker run --rm -p 8000:8000   -v $(pwd)/app:/app/app   -v $(pwd)/models:/app/models   ml-classifier:dev

Production

# Start full stack
docker compose up -d

# Check health
curl http://localhost/health

# Run a prediction
curl -X POST http://localhost/predict   -H "Content-Type: application/json"   -d '{"text": "This product is excellent!", "top_k": 3}'

# Expected response:
# {
#   "predictions": [
#     {"label": "positive", "confidence": 0.9821, "rank": 1},
#     {"label": "neutral",  "confidence": 0.0143, "rank": 2},
#     {"label": "negative", "confidence": 0.0036, "rank": 3}
#   ],
#   "model_version": "classifier",
#   "processing_time_ms": 4.72
# }

Key Takeaways

Deploying ML models is primarily a software engineering problem, not a data science one. The patterns covered here — a typed FastAPI layer, multi-stage Docker builds, health probes, and automated CI/CD — apply regardless of whether the model is a simple scikit-learn pipeline, a fine-tuned transformer, or a computer vision model running on GPU.

Start with a CPU build and a single container. Add Nginx when you need rate limiting or TLS. Add GPU support only when your latency numbers demand it. Ship the simplest thing that serves reliably, then scale the bottleneck.

Hamza Boughanim – AI/ML Engineer & Full-Stack Developer