Cloud & DevOps

MLOps

ML models in production — versioned, monitored and retrained without heroics.

A model that lives in a notebook is not in production — it is a prototype. Getting it live means an inference API that handles load, a versioning strategy so you can roll back a bad release, monitoring for data drift and degradation, and a retraining pipeline that is not dependent on a single engineer running a script by hand. This is the difference between a model that works at the demo and a model that keeps working six months later.

We build the MLOps layer around your model regardless of the framework it was built in: containerised model serving, experiment tracking with reproducible lineage, deployment via feature flags or shadow mode to validate before full cut-over, and the drift detection that tells you when the world has changed enough that the model needs updating. Everything runs on your infrastructure and is operable by your team without us in the loop.

Start a project Ask a quick question

what's included

What this covers

Model packaging and containerised serving (FastAPI, TorchServe, TF Serving, custom)

Experiment tracking and model registry with MLflow or Weights & Biases

Automated retraining pipelines triggered by schedule or data-drift thresholds

Deployment strategies: blue-green, canary and shadow mode for safe model roll-outs

Data drift and model performance monitoring with alerting on degradation

Feature store setup for consistent feature computation between training and serving

GPU and inference cost optimisation: batching, caching and right-sized compute

what you get

Deliverables

A production inference API serving your model with latency SLA and auto-scaling
Experiment tracking and model registry with full lineage from data to deployment
Automated retraining pipeline with drift detection triggers and approval gate
Monitoring dashboard covering prediction latency, throughput, cost and model performance

tools & stack

What we build with

MLflowWeights & BiasesKubeflowArgo WorkflowsFastAPITorchServeTriton Inference ServerDockerKubernetesAWS SageMakerGCP Vertex AIPrometheusGrafanaEvidently AI

what we mean

MLOps

MLOps is the discipline of treating a machine learning model as a software artefact: versioned, testable, deployable, and observable in production. It occupies the boundary between model development (data science) and platform engineering — its job is to eliminate the gap between a model that performs on a held-out test set and a model that keeps performing on real production data six months after deployment.

The scope is bounded: MLOps does not own the model architecture or the training algorithm — those belong to the ML team. It owns the infrastructure that makes training reproducible, the registry that tracks what is deployed, the inference serving layer that meets a latency SLA, and the monitoring that detects when the real-world input distribution has shifted far enough from training data that the model's predictions can no longer be trusted.

how we work

MLOps Delivery Lifecycle

The four stages below cover the path from an existing trained model to a production inference system with automated retraining and drift detection. Each stage is a prerequisite for the next — a model registry is not optional scaffolding, it is the address book that deployment and monitoring both depend on.

Experiment Tracking & Model Registry

Before a model can be deployed or retrained reproducibly, every training run must produce a complete, queryable record: the dataset version, hyperparameters, environment, and evaluation metrics that produced the artefact. The model registry is the single source of truth that maps a registered model version to the run that produced it.

MLflow or Weights & Biases tracking server deployed with a versioned backend store (S3 / GCS for artefacts, Postgres for metadata)
Instrumentation of existing training code to log parameters, metrics, and the trained model artefact to the tracking server with a single decorator or context manager
Model registration workflow: promotion from an experiment run to a named model version in the registry, with stage transitions (Staging, Production, Archived) requiring explicit sign-off
Dataset versioning with DVC or artefact lineage tags so every registered model version is traceable to the exact data snapshot it was trained on
Reproducibility check: a registered model version must be retrainable from its recorded parameters and data reference without any tribal knowledge

Model Packaging & Inference Serving

A model registered in MLflow is a Python object. Making it serve HTTP traffic under a latency SLA requires a containerised serving layer with a defined interface, resource limits, and autoscaling behaviour. The serving contract is established here and must not change without a versioned API update.

Model packaging into a Docker image using the MLflow model flavour or a custom FastAPI wrapper for models that require pre- or post-processing logic outside the model artefact
Latency SLA definition and profiling: P50/P95/P99 latency measured under representative load before deployment commitment
Batching strategy: synchronous single-sample inference for sub-100 ms SLA; dynamic micro-batching (Triton Inference Server) for GPU-backed models where batching reduces per-sample cost
Horizontal autoscaling configuration: HPA on request rate or GPU utilisation; KEDA ScaledObject if inference requests queue in a message broker
GPU scheduling for deep learning models: NVIDIA device plugin, resource requests with GPU fraction where fractional GPU is available (MIG on A100)

Deployment Strategy & Shadow Mode

Moving a new model version into production without validating it on real traffic first is the MLOps equivalent of deploying code without tests. Shadow deployment and canary rollout provide the validation gate between a good test-set metric and a confirmed production improvement.

Shadow deployment: new model version receives a copy of production traffic and logs predictions without serving them; live predictions still come from the current champion model
Champion-challenger comparison: shadow predictions are compared against champion predictions and against ground truth where labels arrive asynchronously (e.g. conversion events, downstream decisions)
Canary rollout: shadow validation passes, challenger serves 5-10 % of live traffic with an analysis gate checking business-critical metrics (click-through rate, fraud precision, recommendation acceptance) before full promotion
Feature store consistency check: training features and serving features are generated by the same transformation logic — any serving/training skew is a blocking defect, detected by comparing feature distributions at serving time against training-time statistics
Rollback protocol: champion model is never decommissioned until challenger completes a full promotion cycle; reverting is a registry pointer update, not a retraining run

Drift Detection & Retraining Automation

A model is a snapshot of a statistical relationship that existed in historical data. As the world changes — user behaviour shifts, upstream data pipelines change schema, seasonal patterns emerge — that relationship degrades. Drift detection operationalises the question: 'is this model still fit for purpose?' and answers it continuously, not on a quarterly review schedule.

Data drift monitoring with Evidently AI: population stability index (PSI) on categorical features, Kolmogorov-Smirnov test on continuous features, computed on a rolling 7-day window against training baseline distribution
Prediction drift monitoring: distribution shift in model output (prediction score histogram) as an early warning before ground-truth labels arrive
Concept drift detection where labels are available with low latency: performance metrics (AUC, F1, precision at K) computed on a rolling evaluation window, alerting when they breach the pre-agreed degradation threshold
Retraining pipeline trigger: Argo Workflow or Kubeflow Pipeline DAG triggered by a drift alert or a cron schedule, executing data pull, preprocessing, training, evaluation, and registry promotion gated on metric pass/fail
Human-in-the-loop gate before production promotion of a retrained model: automated evaluation determines candidacy, a platform engineer approves the registry stage transition to Production

how we think

MLOps Engineering Principles

These constraints govern every MLOps engagement. They exist to prevent the most common failure mode in ML systems: a model that is impressive at demo time and unreliable or unobservable in production.

Treat training runs as immutable experiments, not scripts

A training run that cannot be reproduced from its recorded parameters and dataset reference is not a run — it is a black box with a metric attached. Every run is logged to the tracking server with the full parameter set, the dataset version, the environment hash, and the resulting artefact digest. If a run cannot be recreated by someone who was not present when it ran, it has not been logged correctly. This is what makes champion-challenger comparisons trustworthy: both models are known quantities.

The feature store is the contract between training and serving

Training-serving skew — the model was trained on features computed one way and is served features computed a different way — is one of the hardest bugs to diagnose in an ML system because it often manifests as gradually degrading performance rather than an error. A feature store enforces a single feature computation path for both training (offline retrieval) and serving (online retrieval). Any divergence between the two is a contract violation caught in the shadow deployment phase, not discovered months later in a drift alert.

Shadow deployment is not optional for consequential models

A new model version that passes evaluation on a held-out test set has proven it performs on historical data. It has not proven it performs on the live distribution, which may have shifted since the training data was collected, may include edge cases not represented in the test set, or may trigger unexpected behaviour from downstream systems. Shadow deployment validates the challenger against current production traffic with zero risk to live users before the first live request is routed. Skipping this step trades a short deployment cycle for an unbounded production incident risk.

Drift alerts must route to a decision, not just a dashboard

A drift alert that fires into a Slack channel and triggers a manual investigation is better than no drift detection — but only marginally. Drift should route to an automated retraining pipeline with a defined evaluation gate: if the retrained model passes the threshold, it enters the registry as a candidate; if it fails, a human is paged with the evaluation report. The goal is that a detected drift event results in a corrective action within a defined SLA, not in someone noting it on the next sprint planning call.

decision guides

How we'd choose

There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.

Self-hosted MLflow / Kubeflow vs Managed MLOps (SageMaker / Vertex AI)

The build-vs-buy decision for MLOps infrastructure is frequently made without accounting for the full operational cost of self-hosting. This guide makes the hidden costs visible on both sides before the architecture is committed.

Criterion	Self-hosted (MLflow + Kubeflow / Argo Workflows)	Managed MLOps (AWS SageMaker / GCP Vertex AI)
Infrastructure ownership burden	Full ownership: Kubernetes cluster, tracking server database, artefact store, and pipeline orchestrator all require provisioning, upgrading, and on-call coverage by the platform team	Control plane is fully managed; team owns pipeline definitions and model code, not the underlying orchestration infrastructure; upgrades are vendor-managed with deprecation notices
Vendor lock-in	Low: MLflow experiment format and ONNX model serialisation are open standards; migrating to a different execution layer (Argo to Kubeflow, or to a managed service) requires pipeline rewrite but model artefacts are portable	High: SageMaker Training Jobs and Vertex AI custom training containers use provider-specific APIs; feature store schemas and pipeline definitions are not portable; migration cost is significant at scale
GPU scheduling support	Manual configuration: NVIDIA device plugin, node taints for GPU nodes, resource requests per pod; fractional GPU via MIG requires additional setup; Spot GPU instance handling requires eviction-tolerant training code	First-class: SageMaker Training Jobs and Vertex AI Training abstract GPU provisioning; Spot/Preemptible instance handling with automatic checkpoint resume is a managed feature; no manual device plugin required
Feature store availability	Self-built or third-party (Feast on Kubernetes): operational overhead is significant; online store (Redis) and offline store (S3/GCS) require separate provisioning and consistency guarantees are the team's responsibility	Managed feature store included: SageMaker Feature Store and Vertex AI Feature Store provide online and offline storage with point-in-time correct lookups and consistency guarantees as a service
Cost at low experiment volume	Fixed cluster baseline cost regardless of experiment frequency; economical only when the team runs experiments continuously at high utilisation; idle tracking server and pipeline controller waste spend at low volume	Pay-per-use for training jobs and pipeline runs; near-zero cost when no experiments are running; most cost-effective for teams with bursty or low-volume experiment cadence
Operational expertise required	High: requires engineers comfortable with Kubernetes administration, Helm chart management, database backup and recovery, and distributed systems debugging — all for infrastructure that does not ship product features	Medium: requires understanding the managed service API and its constraints, but not the underlying distributed systems; most operational incidents are handled by the provider, not the team

what we avoid

MLOps Failure Modes

Notebook-as-deployment-pipeline

Running a Jupyter notebook manually to retrain a model and then uploading the resulting file to a shared drive is a deployment pipeline only in the sense that it eventually produces a deployed model. It has no versioning, no reproducibility, no parameter logging, no evaluation gate, and no record of what data was used. The model in production is not traceable to the run that produced it. When it degrades, nobody knows which version is running, what it was trained on, or how to produce the same artefact again.

Training code is extracted from notebooks into Python scripts or packages, parameterised via config files or CLI arguments, and executed by an orchestrated pipeline (Argo Workflows, Kubeflow Pipelines, or SageMaker Pipelines). Every invocation logs to the tracking server. The trained artefact is registered in the model registry with the run ID attached. The notebook survives as an exploration and visualisation tool — not as the production training path.

Ignoring cold-start cost on GPU inference

Deploying a large deep learning model on a GPU-backed Kubernetes node with HPA targeting CPU utilisation produces a system that auto-scales correctly under sustained load but accumulates minutes of cold-start latency when a new GPU node must be provisioned to handle a traffic spike. The P99 latency during scale-out is measured in minutes, not milliseconds, because GPU node provisioning is slower than CPU node provisioning and the model loading time adds further delay after the node is ready.

For latency-sensitive GPU inference, maintain a minimum of one warm replica at all times and configure Karpenter to keep a small buffer of GPU nodes in a pre-provisioned state. For batch or asynchronous inference where latency is not a P99 concern, queue the requests in a broker (SQS, Pub/Sub) and use KEDA to scale replicas on queue depth — cold-start delay is absorbed by the queue, not felt by the caller. The serving vs batching decision is made at design time, not discovered under the first production load spike.

Promoting models without a signed-off degradation threshold

Retraining pipelines that promote a model automatically if it beats the previous version on a single metric (AUC, RMSE) optimise for that metric in isolation. A model that improves AUC by 0.3 % while increasing P99 inference latency by 40 ms, or improving overall precision while dramatically worsening recall on a minority class, will be promoted automatically and the regression will be discovered in production — either through a latency SLA breach or through a business metric decline that is several weeks upstream of the technical signal.

The evaluation gate in the retraining pipeline checks a signed-off scorecard: primary business metric, secondary quality metrics (precision/recall breakdown by segment), latency P99 under representative load, and model size. Passing requires all criteria to be met, not just the primary metric. The scorecard is agreed with the ML and product teams before the first retraining run and is checked into source control alongside the pipeline definition — it is not a verbal understanding between engineers.

good to know

Common questions

We have a working model but no deployment pipeline — where do you start?

We start with a containerised inference API and basic monitoring so the model is live and observable. From there we add the retraining pipeline, drift detection and experiment tracking in priority order based on how much manual effort each saves your team.

How do you handle model versioning and rollbacks?

Every trained model is registered with a version number, the training run that produced it and the dataset snapshot it was trained on. Deployment is a pointer update in the model registry. Rolling back is pointing that pointer at the previous version — a one-minute operation, not a retraining job.

Related capabilities

DevOps & Cloud Automation All of Cloud & DevOps

Have something in mind?

Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.

Start a project Chat on WhatsApp