MLOps in Practice: Deploying AI Models with Docker and FastAPI
By undefined · 2026-07-05 · 13 min read
A complete production guide to containerizing and serving machine learning models — covering FastAPI model servers, multi-stage Docker builds, GPU support, heal
Introduction: The Gap Between Notebook and Production
Most machine learning projects die in notebooks. A model achieves 94% accuracy on a validation set, the data scientist shares a screenshot, and then nothing ships. The gap between a working notebook and a reliable production system is where most ML value is destroyed.
MLOps — Machine Learning Operations — is the discipline that closes this gap. At its core, MLOps asks a simple question: how do we serve a model reliably, repeatedly, and at scale? This guide answers that question with a concrete stack: FastAPI for the serving layer and Docker for packaging and deployment. By the end, you will have a production-ready ML service that can run on any machine, from a local workstation to a cloud VM, with no environment surprises.
Architecture Overview
Before writing a single line of code, it is worth understanding what we are building. The system has three layers:
- Model layer — a serialized ML model (scikit-learn, PyTorch, ONNX) loaded into memory once at startup
- API layer — a FastAPI application that exposes inference endpoints, validates input, and returns structured predictions
- Infrastructure layer — Docker containers that package both layers with their exact dependencies, making the service reproducible everywhere
+--------------------------------------------------+
| Docker Host |
| |
| +-----------------+ +--------------------+ |
| | FastAPI App | | Model Storage | |
| | | | (volume mount) | |
| | /predict -----+---> model.pt | |
| | /health | | tokenizer/ | |
| | /metrics | +--------------------+ |
| +--------+--------+ |
| | :8000 |
+------------+------------------------------------+
|
Client / Load Balancer / Nginx
1. Building the FastAPI Model Server
FastAPI is the right choice for ML APIs for three reasons: automatic request validation via Pydantic, async support for concurrent requests, and automatic OpenAPI documentation generation. Let us build a text classification server as a concrete example.
Project Structure
ml-service/
├── app/
│ ├── __init__.py
│ ├── main.py ← FastAPI application
│ ├── model.py ← Model loading and inference logic
│ ├── schemas.py ← Pydantic request/response models
│ └── config.py ← Settings via environment variables
├── models/
│ └── classifier.pkl ← Serialized model file
├── tests/
│ └── test_api.py
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
Pydantic Schemas — schemas.py
from pydantic import BaseModel, Field
from typing import List
class PredictRequest(BaseModel):
text: str = Field(..., min_length=1, max_length=4096,
description="Input text to classify")
top_k: int = Field(default=1, ge=1, le=5,
description="Number of top predictions to return")
class Prediction(BaseModel):
label: str
confidence: float
rank: int
class PredictResponse(BaseModel):
predictions: List[Prediction]
model_version: str
processing_time_ms: float
class HealthResponse(BaseModel):
status: str
model_loaded: bool
uptime_seconds: float
Model Wrapper — model.py
import time
import pickle
import logging
from pathlib import Path
from typing import List, Tuple
logger = logging.getLogger(__name__)
class ClassifierModel:
"""
Thread-safe singleton wrapper around a scikit-learn classifier.
Loads once at startup; all requests share the same instance.
"""
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
def load(self, model_path: Path) -> None:
logger.info(f"Loading model from {model_path}")
with open(model_path, "rb") as f:
self._pipeline = pickle.load(f)
self._loaded_at = time.time()
self.version = model_path.stem
logger.info("Model loaded successfully")
@property
def is_loaded(self) -> bool:
return hasattr(self, "_pipeline")
def predict(self, text: str, top_k: int = 1) -> List[Tuple[str, float]]:
if not self.is_loaded:
raise RuntimeError("Model is not loaded")
proba = self._pipeline.predict_proba([text])[0]
classes = self._pipeline.classes_
ranked = sorted(zip(classes, proba), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
model = ClassifierModel()
FastAPI Application — main.py
import time
import logging
from contextlib import asynccontextmanager
from pathlib import Path
from fastapi import FastAPI, HTTPException, status
from fastapi.middleware.cors import CORSMiddleware
from app.model import model
from app.schemas import PredictRequest, PredictResponse, Prediction, HealthResponse
from app.config import settings
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
_start_time = time.time()
@asynccontextmanager
async def lifespan(app: FastAPI):
model_path = Path(settings.MODEL_PATH)
if not model_path.exists():
raise FileNotFoundError(f"Model not found at {model_path}")
model.load(model_path)
logger.info("Service ready")
yield
logger.info("Shutting down")
app = FastAPI(
title="ML Classifier API",
version="1.0.0",
lifespan=lifespan,
)
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
if not model.is_loaded:
raise HTTPException(status_code=503, detail="Model not ready")
t0 = time.perf_counter()
try:
raw = model.predict(request.text, top_k=request.top_k)
except Exception as e:
logger.exception("Inference error")
raise HTTPException(status_code=500, detail=str(e))
elapsed_ms = (time.perf_counter() - t0) * 1000
predictions = [
Prediction(label=label, confidence=round(conf, 4), rank=i + 1)
for i, (label, conf) in enumerate(raw)
]
return PredictResponse(
predictions=predictions,
model_version=model.version,
processing_time_ms=round(elapsed_ms, 2),
)
@app.get("/health", response_model=HealthResponse)
async def health():
return HealthResponse(
status="ok" if model.is_loaded else "starting",
model_loaded=model.is_loaded,
uptime_seconds=round(time.time() - _start_time, 1),
)
@app.get("/ready")
async def readiness():
if not model.is_loaded:
raise HTTPException(status_code=503, detail="Model not loaded yet")
return {"status": "ready", "model_version": model.version}
2. Writing the Dockerfile
A naive Dockerfile copies everything into a single fat image. A production Dockerfile uses multi-stage builds to separate the dependency installation from the runtime, resulting in a smaller, more secure final image.
CPU Model — Dockerfile
## Stage 1: dependency builder
FROM python:3.11-slim AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y --no-install-recommends gcc g++ && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
## Stage 2: runtime image
FROM python:3.11-slim AS runtime
RUN addgroup --system app && adduser --system --ingroup app app
WORKDIR /app
COPY --from=builder /root/.local /home/app/.local
COPY --chown=app:app app/ ./app/
COPY --chown=app:app models/ ./models/
USER app
ENV PATH="/home/app/.local/bin:$PATH"
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
EXPOSE 8000
CMD ["gunicorn", "app.main:app", "--worker-class", "uvicorn.workers.UvicornWorker", "--workers", "2", "--bind", "0.0.0.0:8000", "--timeout", "120"]
GPU Model — Dockerfile.gpu
For PyTorch or TensorFlow models running on GPU, swap the base image for an NVIDIA CUDA image.
## GPU base: CUDA 12.1 + cuDNN 8 + Python 3.11
FROM nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 AS runtime
RUN apt-get update && apt-get install -y python3.11 python3-pip && rm -rf /var/lib/apt/lists/*
# The rest of the Dockerfile is identical to the CPU version
3. Docker Compose: Orchestrating the Full Stack
Docker Compose lets you define and run multiple containers as a single application. This configuration adds a Nginx reverse proxy in front of FastAPI — a pattern used by virtually every production ML deployment.
version: "3.9"
services:
ml-api:
build:
context: .
dockerfile: Dockerfile
image: ml-classifier:latest
container_name: ml_api
restart: unless-stopped
volumes:
- ./models:/app/models:ro
environment:
- MODEL_PATH=models/classifier.pkl
- LOG_LEVEL=INFO
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 20s
networks:
- ml-network
nginx:
image: nginx:1.27-alpine
container_name: ml_nginx
restart: unless-stopped
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
depends_on:
ml-api:
condition: service_healthy
networks:
- ml-network
networks:
ml-network:
driver: bridge
Nginx Configuration — nginx.conf
upstream ml_api {
server ml-api:8000;
}
server {
listen 80;
# Rate limiting: 30 requests per minute per IP
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=30r/m;
location /predict {
limit_req zone=api_limit burst=10 nodelay;
proxy_pass http://ml_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 120s;
}
location /health {
proxy_pass http://ml_api;
access_log off;
}
location / {
proxy_pass http://ml_api;
}
}
4. CI/CD with GitHub Actions
A CI/CD pipeline automates the build-test-deploy cycle. This workflow runs on every push to
main: builds the Docker image, runs tests, pushes to the GitHub Container Registry,
and deploys to production via SSH.
# .github/workflows/deploy.yml
name: Build and Deploy ML Service
on:
push:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: your-username/ml-classifier
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip
- run: pip install -r requirements.txt pytest httpx
- run: pytest tests/ -v
build-and-push:
needs: test
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Deploy via SSH
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.PROD_HOST }}
username: ${{ secrets.PROD_USER }}
key: ${{ secrets.PROD_SSH_KEY }}
script: |
cd /opt/ml-service
docker compose pull
docker compose up -d --no-build
docker system prune -f
5. Running the Service
Local development
# Build
docker build -t ml-classifier:dev .
# Run with live code reload
docker run --rm -p 8000:8000 -v $(pwd)/app:/app/app -v $(pwd)/models:/app/models ml-classifier:dev
Production
# Start full stack
docker compose up -d
# Check health
curl http://localhost/health
# Run a prediction
curl -X POST http://localhost/predict -H "Content-Type: application/json" -d '{"text": "This product is excellent!", "top_k": 3}'
# Expected response:
# {
# "predictions": [
# {"label": "positive", "confidence": 0.9821, "rank": 1},
# {"label": "neutral", "confidence": 0.0143, "rank": 2},
# {"label": "negative", "confidence": 0.0036, "rank": 3}
# ],
# "model_version": "classifier",
# "processing_time_ms": 4.72
# }
Key Takeaways
Deploying ML models is primarily a software engineering problem, not a data science one. The patterns covered here — a typed FastAPI layer, multi-stage Docker builds, health probes, and automated CI/CD — apply regardless of whether the model is a simple scikit-learn pipeline, a fine-tuned transformer, or a computer vision model running on GPU.
Start with a CPU build and a single container. Add Nginx when you need rate limiting or TLS. Add GPU support only when your latency numbers demand it. Ship the simplest thing that serves reliably, then scale the bottleneck.