04 / cloud

Cloud & DevOps

Scalable infrastructure, automation and reliable delivery.

Building something is half the job. Getting it live, keeping it up at 3 am, and not opening a cloud bill that ruins your month — that is the other half, and it is where small teams get stuck. We treat infrastructure as a first-class part of the build, not an afterthought bolted on before launch.

We set up the cloud architecture most studios skip: reproducible environments provisioned with infrastructure-as-code, automated deployment pipelines, container orchestration that runs identically in development and production, and observability so you hear about problems before your users do. Then we right-size it so you are not paying enterprise prices for a startup's traffic.

We cover the full range — from a single developer's first VPS to multi-region Kubernetes clusters on AWS, GCP or Azure — and we include MLOps as one specialised lane within the broader cloud practice. Everything is documented on your accounts, portable and operable by any competent engineer.

Releases that ship safely, consistently and without downtime drama
Issues surfaced by monitoring — not reported by frustrated users
A cloud bill that scales with your business, not against it

Start a project Ask a quick question

what we do here

Capabilities

Cloud architecture & IaC

AWS, GCP and Azure architecture designed for your workload, provisioned with Terraform so environments are reproducible, auditable and never clicked together by hand.

CI/CD pipelines & automation

Automated test, build and release pipelines in GitHub Actions, GitLab CI or similar — so shipping to production is routine rather than an all-hands event.

Containers & Kubernetes

Docker-based environments and Kubernetes orchestration that run identically in development and production, with zero 'works on my machine' surprises.

Observability & monitoring

Structured logging, metrics, distributed tracing, uptime checks and sensible alerting — you know what is slow or broken before customers report it.

Reliability, SRE & cost optimisation

Right-sized resources, auto-scaling, backup strategies, incident runbooks and a cloud bill that reflects your actual traffic rather than accumulated waste.

Security & compliance posture

IAM least-privilege, secrets management, network segmentation, vulnerability scanning and the audit trails that regulated industries require.

MLOps & model serving

Deploy ML models behind reliable, versioned APIs with retraining hooks, A/B routing and drift monitoring so they keep performing after their initial launch.

what you get

Deliverables

Your application or model live on infrastructure you own and control
Automated deploy pipeline from a merged pull request to production
Infrastructure-as-code repository with full documentation
Monitoring, alerting and backup strategy in place from day one

who it's for

Best suited for

Engineering teams without dedicated platform or DevOps experience
Startups whose infrastructure was clicked together and needs to be made reproducible
Teams whose ML model is stuck in a notebook and needs a production home
Any product that needs to stay reliably online as it grows

tools & stack

What we build with

AWSGCPAzureDockerKubernetesTerraformGitHub ActionsGitLab CIPrometheusGrafanaDatadogMLflowArgo CDHelmVaultVercel

what we mean

DevOps, SRE & Platform Engineering

DevOps is an organisational and technical practice that collapses the wall between software delivery and infrastructure operations. It is not a job title or a toolchain — it is a feedback-loop discipline: shorten the time from code commit to production signal so that teams can learn and correct course faster. Done well, it manifests as measurable DORA metrics (deployment frequency, lead time for change, change failure rate, time to restore), not a Jira board full of automation tickets.

Site Reliability Engineering (SRE) picks up where DevOps ends. SRE defines reliability as a feature with a quantified budget: Service Level Objectives and error budgets set an explicit ceiling on how much unreliability is acceptable, which forces a conversation between product and engineering that no amount of 'we care about quality' rhetoric achieves. SRE teams own incident response, postmortem culture, and the toil-elimination mandate — if a task is manual, repetitive, and automatable, it must be automated before it consumes more than 50 % of engineering time.

Platform Engineering is the newest discipline of the three. It builds and operates the Internal Developer Platform (IDP) that surfaces self-service golden paths to product teams. Where DevOps asks every team to own their pipeline and SRE asks them to write runbooks, Platform Engineering removes those burdens by offering paved roads: templated repositories, curated runtime environments, cost dashboards, and policy guardrails baked into the platform rather than bolted on by audit. The boundary is sharp: Platform Engineering ends at the contract surface (APIs, CLI, portal) it exposes; product teams own what they deploy through it.

how we work

Platform Delivery Lifecycle

Each engagement follows a six-stage progression that mirrors how a platform actually matures — from understanding current pain to handing off an operating model that the team can sustain. Stages are sequential but overlap at their edges; stage outputs feed the next stage as direct inputs.

Foundation Assessment

Audit the current infrastructure posture, delivery pipeline, and operational practice before prescribing anything. This stage exists to prevent importing existing dysfunction into shiny new tooling.

DORA metrics baseline measurement — deployment frequency, lead time, MTTR, change failure rate
Architecture inventory: cloud accounts, regions, VPCs, IAM boundaries, and drift between documented and actual state
Pipeline archaeology: trace a production deployment end-to-end and document every manual step, approval gate, and tribal-knowledge dependency
SLO gap analysis: identify which services have SLIs defined, which have alerts with no SLO backing them, and which have no observability at all
FinOps snapshot: pull 90-day spend by service and team to surface waste and missing cost attribution before architecture decisions are made

Infrastructure Architecture

Design the target-state infrastructure as code, including network topology, identity model, secrets hierarchy, and multi-environment promotion strategy. Every design decision is written as an Architecture Decision Record before implementation begins.

Reference architecture diagram: accounts/projects, VPC layout, transit gateway or VPC peering, egress strategy
IaC module library design: define the Terraform/Pulumi module boundaries and the contract each module exports
Identity and access model: IAM roles per workload, least-privilege boundaries, federation to identity provider
Secrets hierarchy: define which secrets live in Vault, which in cloud-native secret stores, and the rotation schedule for each class
Environment promotion model: trunk-based branching strategy, ephemeral PR environments, staging parity constraints

CI/CD Engineering

Build the delivery pipeline from commit to production, including build standardisation, artefact promotion, and progressive delivery gates. The pipeline is itself versioned and tested — it is not exempt from engineering standards.

Standardised build graphs: Dockerfiles, build caches, SBOM generation at build time
Artefact registry and promotion policy: immutable tags, attestation signing, promotion gates between registries
GitHub Actions / CI platform configuration with reusable workflows and matrix builds
GitOps delivery configuration: Argo CD ApplicationSets or Flux Kustomizations per environment tier
Progressive delivery: canary or blue-green rollout configuration with automated analysis gates (Argo Rollouts + Prometheus metrics)

Observability & SRE

Instrument the platform and applications to produce the signals needed to defend SLOs, not just to produce dashboards. Observability is only complete when an on-call engineer can answer 'what is broken, where, and why' without SSH-ing into a box.

SLI/SLO definition workshops per service: latency, availability, saturation targets with error budget burn-rate alerts
OpenTelemetry collector deployment and SDK instrumentation guide for product teams
Prometheus scrape configs, recording rules, and alerting rules aligned to SLOs (not just thresholds)
Grafana dashboard library: golden signals per service, platform health, cost-per-service overlays
Incident response runbook library and postmortem template embedded in the internal developer portal

DevSecOps Hardening

Embed security controls into the pipeline and the platform so that policy violations are caught before merge, not after a quarterly pen-test. Zero-trust posture means no implicit trust at the network or identity layer.

Supply chain hardening: Trivy image scanning, SBOM attestation with Cosign, dependency vulnerability scanning in CI
Policy-as-code: OPA/Conftest rules for Kubernetes manifests, Terraform plans, and Dockerfile patterns gated at CI
CIS Benchmark baseline for cloud accounts applied as Terraform and enforced via AWS Config / GCP Security Command Center
Zero-trust network policy: Kubernetes NetworkPolicy and cloud VPC firewall rules generated from service mesh or policy engine
Secrets rotation automation: Vault dynamic credentials for databases and cloud APIs, eliminating long-lived static secrets

Day-2 Operations

Transition from delivery to sustained operation — establishing the operating model, toil-elimination backlog, and capacity planning cadence that keeps the platform healthy without heroics.

On-call rotation design and escalation paths documented in the incident management tool
Toil register: catalogue all manual, repetitive operational tasks with estimated hours/week and automation priority
Cost optimisation cadence: monthly FinOps review, rightsizing recommendations, reserved-instance or committed-use planning
Platform changelog and internal developer portal updates to communicate breaking changes and new golden-path offerings
Quarterly reliability review: SLO burn rate trends, error budget consumption, and DORA metrics progression against baseline

how we think

Engineering Principles

These are the opinionated constraints we apply to every Cloud & DevOps engagement. They are not aspirational values — they are guardrails that prevent the most common and expensive failure modes.

Treat infrastructure state as immutable

Servers and managed resources that are modified in-place accumulate undocumented state until no one knows what is actually running. Immutability — replacing rather than patching — eliminates configuration drift as a category of failure. When every change is a replacement provisioned from source-controlled IaC, the blast radius of any mutation is bounded and reversible. This is why we mandate IaC for all persistent resources and prohibit direct console changes outside a documented break-glass procedure.

Define SLOs before instrumenting metrics

Teams that instrument first and define SLOs later end up with hundreds of dashboards and zero actionable alerts. An SLO forces you to answer 'reliable enough for whom, under what conditions, measured how?' before you write a single PromQL query. Starting from the SLO means every metric you collect has a consumer, every alert has a budget, and every on-call page is justified by an error budget burn rate — not just a threshold someone chose arbitrarily.

Security is a property of the pipeline, not a final gate

A security review at the end of a delivery cycle is an approval ritual, not a control. By the time a vulnerability reaches a pre-production scan, it has already been committed, built, tested, and promoted — ripping it out is expensive. Policy-as-code in CI (OPA/Conftest), image scanning at build time (Trivy), and secret detection at pre-commit shift the enforcement point left by weeks. The goal is to make it impossible to merge a non-compliant artefact, not to catch it later.

Speed without guardrails just ships bad changes faster

High deployment frequency is a DORA research finding about elite teams, not a goal to chase with volume alone. Teams that push to production 50 times a day without automated rollback, feature flags, or canary analysis just discover their mistakes faster and in front of customers. Progressive delivery — canary releases, blue-green deployments, feature flags backed by an error budget burn gate — is what makes high frequency safe rather than reckless.

GitOps means the cluster state is a function of the repository

In a true GitOps model, the Kubernetes cluster has no desired state that is not expressed in a Git commit. Operators apply changes by merging pull requests; the GitOps controller (Argo CD or Flux) continuously reconciles actual state to declared state. This eliminates entire classes of 'who changed what and when' incidents, provides a free audit trail, and means rollback is a git revert — not a production kubectl command issued under pressure at 2 AM.

Cost is an engineering concern, not a finance report

Cloud spend is a direct consequence of architectural choices: instance families, storage tiers, data transfer patterns, idle capacity. Treating it as a finance problem means engineers never see the cost signal until it is too large to ignore. FinOps tooling should be surfaced in the same internal developer portal as deployment status — so the team that chose a GP3 volume over GP2, or left a NAT gateway serving a defunct service, sees the cost attribution before the monthly bill.

Every secret has a rotation schedule or it is a liability

A static credential that never rotates is a credential that will eventually be found in a git history, a log file, an environment variable dump, or a breach. Vault dynamic secrets, IRSA/Workload Identity, and cloud-native secret managers make rotation automatic and make long-lived static secrets unnecessary for the majority of use cases. Any service that cannot tolerate credential rotation on a defined schedule has an architectural problem, not a security exemption.

The platform must eat its own dog food

An Internal Developer Platform that the platform team itself does not use is a platform that will not evolve to meet real needs. Platform teams must deploy their own tooling through the same golden paths, observe it through the same dashboards, and carry it in the same on-call rotation as product workloads. This is the fastest feedback loop for discovering where the paved road is actually a gravel track.

our stack & why

Reference Stack

Tool choices are justified by engineering constraints, not popularity. Each selection reflects a specific capability gap it closes or failure mode it eliminates.

IaC & Provisioning

Terraform + TerragruntTerraform's provider ecosystem is unmatched — virtually every cloud resource and SaaS API has a maintained provider. Terragrunt layers on top to solve Terraform's native weaknesses: DRY remote-state configuration, dependency ordering across modules, and environment-specific variable inheritance without copy-paste. The combination gives declarative state management at scale without forcing a rewrite into a proprietary graph engine.
PulumiPulumi is the right choice when the team's primary language is TypeScript, Python, or Go and the cognitive overhead of HCL is a genuine impediment to adoption. Its real advantage is first-class support for dynamic resource graphs — generating 50 identical microservice stacks from a loop is idiomatic Pulumi and awkward Terraform. It also integrates natively with existing CI test suites, which matters when the IaC itself needs unit tests.
AWS CDK / GCP Deployment ManagerUsed selectively for resources that are deeply coupled to the cloud provider's native constructs and where the CDK's higher-level abstractions (L2/L3 constructs) genuinely reduce boilerplate. Not used as a primary IaC layer for multi-cloud workloads — the provider-specific DSL becomes a liability the moment a workload needs to span clouds.

CI/CD & Delivery

GitHub ActionsNative to the source control system, which eliminates an entire class of webhook and credential synchronisation problems. Reusable workflows and composite actions provide the modularisation needed to avoid copy-pasting pipeline configuration across 20 repositories. The marketplace is broad enough that most integrations exist as maintained actions, reducing the amount of shell glue engineers have to write and maintain.
Argo CDArgo CD is the reference implementation of GitOps for Kubernetes. ApplicationSets allow a single manifest to generate Application objects for every environment, cluster, and service combination — eliminating per-environment pipeline branches. Its reconciliation loop continuously detects and surfaces drift between the live cluster and the declared Git state, which is the core property GitOps needs to deliver on its promise.
Argo RolloutsProgressive delivery without a dedicated rollout controller collapses to a binary switch: old is gone, new is live. Argo Rollouts adds canary and blue-green strategies with analysis gates backed by Prometheus metrics or DataDog queries. The gate asks 'is the error rate on the canary population within the SLO error budget?' before promoting — this is what makes high deployment frequency safe rather than reckless.

Containers & Orchestration

Docker / BuildKitBuildKit's parallel stage execution and cache-mount support cut median build times significantly compared to classic Docker builds. More importantly, BuildKit's --sbom and --provenance flags produce SLSA-compliant artefact attestations at build time without a separate pipeline step, which is the prerequisite for supply-chain policy enforcement downstream.
KubernetesKubernetes is chosen when the workload requires fine-grained scheduling, multi-tenancy with namespace isolation, or a service mesh. Its API extensibility (CRDs, admission webhooks, operators) is what makes it the right substrate for a self-service platform — every capability the platform team wants to surface can be exposed as a Kubernetes resource with a controller managing the lifecycle. It is deliberately not the default for every workload — stateless functions and batch jobs often belong in serverless compute.
Helm + HelmfileHelm packages Kubernetes manifests into versioned, parameterised charts that can be promoted through environments by changing a value file rather than a manifest. Helmfile declaratively composes multiple releases with shared value inheritance, which solves the environment-fan-out problem that makes raw Helm charts painful at scale. Charts are stored in OCI registries and version-pinned, making the exact set of running manifests auditable from the GitOps repository.

Observability

Prometheus + AlertmanagerPrometheus's pull model and its rich label cardinality are what make SLO-based alerting tractable. Multi-window, multi-burn-rate alerts — the pattern recommended by the Google SRE Workbook — require a time-series store that can efficiently evaluate recording rules at different time windows simultaneously. Prometheus's operator pattern (kube-prometheus-stack) also means the monitoring stack itself is managed as code and can be templated per cluster.
GrafanaGrafana is the visualisation layer because its dashboard-as-code support (Grafonnet / JSON model in Git) allows dashboards to be version-controlled and promoted alongside the service definitions they describe. The ability to mix Prometheus, Loki, and Tempo data sources in a single panel is essential for correlating a latency spike (metrics) to a specific trace and its associated log lines without switching tools.
OpenTelemetryOpenTelemetry solves vendor lock-in at the instrumentation layer. SDK instrumentation emits to a collector; the collector routes to whichever backend is current (Jaeger, Tempo, commercial APM). The cost of switching observability backends drops from 'reinstrument every service' to 'change a collector config'. The OTel semantic conventions also standardise the attribute names that SLI queries rely on, which reduces the per-service instrumentation work for product teams.
LokiLoki indexes only log metadata (labels), not the full log body, which keeps storage costs an order of magnitude below Elasticsearch at equivalent log volume. Because its label model mirrors Prometheus labels, the same service/namespace/pod selectors that appear in metrics queries work in log queries — reducing the cognitive load for engineers who switch between signals during an incident.

Security & Policy

OPA / ConftestOpen Policy Agent provides a unified policy language (Rego) that can evaluate Terraform plans, Kubernetes admission requests, and Dockerfile instructions against the same policy corpus. Conftest is the CLI wrapper that runs these policies in CI. The result is that a policy written once ('containers must not run as root') is enforced at plan time, at admission, and in CI — rather than being checked in three different places by three different tools.
TrivyTrivy scans container images, filesystem layers, IaC files, and SBOMs for CVEs in a single tool invocation. Its SBOM export (CycloneDX or SPDX) satisfies supply-chain transparency requirements and feeds into admission policies that can block promotion of images with critical unpatched CVEs. The zero-configuration mode makes it trivial to add to any CI pipeline without a dedicated scanning service.
HashiCorp VaultVault's dynamic secret engines — database credentials, AWS IAM credentials, PKI certificates — mean that the secret a service receives exists only for the duration of its lease and is revoked automatically. This eliminates the 'leaked static credential' class of incident. Vault's audit log provides a complete record of every secret access, which is a prerequisite for compliance frameworks (SOC 2, ISO 27001) that require demonstrating least-privilege access.
Cosign + SLSACosign signs container image digests and SBOM attestations with a keyless signing workflow backed by Sigstore's transparency log. An admission webhook that verifies signatures before allowing image deployment closes the gap between 'we scanned the image in CI' and 'we are running the image we scanned' — which is the gap where supply-chain attacks live. SLSA provenance levels provide a graduated framework for hardening the build pipeline itself.

Cloud & FinOps

AWS / GCP / AzureProvider selection is driven by workload requirements and existing organisational investment, not familiarity. AWS is chosen for breadth of managed services and maturity of IAM. GCP is chosen for AI/ML workloads (TPU access, Vertex AI) and data platform integration. Azure is chosen when Active Directory integration and Microsoft licensing are architectural constraints. All three are wrapped in the same IaC layer so the infrastructure code is not harder to port than the application.
AWS Cost Explorer / GCP Billing + InfracostInfracost runs in CI and posts a cost estimate diff on every pull request that changes infrastructure — engineers see the monthly cost delta of their change before it merges, not after the bill arrives. Cloud-native billing APIs provide the actuals; Infracost provides the pre-merge signal. Together they close the feedback loop that FinOps practice requires: cost must be visible at the point of decision, not at the point of payment.
Karpenter / KEDAKarpenter replaces the Kubernetes Cluster Autoscaler with a provisioner that can select the cheapest available instance type for a given workload shape, including Spot instances with automatic diversification across availability zones. KEDA scales workloads on arbitrary event sources (queue depth, Prometheus metrics, cron) rather than just CPU/memory, which eliminates over-provisioned headroom kept to absorb traffic spikes that are actually predictable.

decision guides

How we'd choose

There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.

Kubernetes vs Serverless vs VMs

Compute substrate choice has long-term implications for operational model, cost profile, and capability ceiling. Use this guide to match workload characteristics to the right substrate before committing to a platform.

Criterion	Kubernetes	Serverless (FaaS)	Virtual Machines
Cold-start latency tolerance	Sub-second with pre-warmed pods; not suitable for latency-critical sub-10ms P99	100 ms–2 s cold start; unsuitable for synchronous low-latency APIs without provisioned concurrency	No cold start; process is always running; suitable for any latency profile
Operational complexity	High — control plane, node pools, networking CNI, storage CSI all require expertise	Low — zero infrastructure management; vendor manages runtime, scaling, and patching	Medium — OS patching, image baking, and auto-scaling group configuration are team responsibilities
Cost model at low load	Fixed cluster baseline cost regardless of traffic; economical only above a workload density threshold	Pay-per-invocation; near-zero cost at low request rates; optimal for infrequent or bursty workloads	Fixed per-instance cost; idle VMs waste spend; Reserved Instances reduce this at sustained load
Workload duration	Unlimited duration; suitable for long-running daemons, websocket connections, batch jobs via Job resource	Maximum execution timeout (typically 15 min); unsuitable for long-running streaming or stateful processes	Unlimited duration; no platform-imposed timeout; suitable for any workload length
Multi-tenancy & isolation	Namespace-level isolation with NetworkPolicy and RBAC; strong but requires policy discipline	Full process isolation per invocation; no shared-state risk between tenants	Hypervisor-level isolation; strongest boundary; suitable for regulated or hostile-tenant workloads
Self-service platform fit	Excellent — CRDs and operators make Kubernetes the right substrate for an IDP; product teams deploy via GitOps	Moderate — golden-path templates per function type work well; harder to standardise cross-cutting concerns	Poor — VM lifecycle management at scale requires significant tooling investment to match Kubernetes self-service

GitOps vs Push-based CI/CD

Both models can achieve continuous delivery, but they differ in security posture, auditability, and how well they scale to many environments. The choice affects your incident response workflow as much as your deployment workflow.

Criterion	GitOps (pull)	Push-based CI/CD
Credential exposure	CI system holds no cluster credentials; the GitOps agent running inside the cluster pulls desired state from Git — no inbound firewall rules or kubeconfig files in CI secrets	CI runner requires cluster credentials (kubeconfig, service account token) to push changes; credentials must be rotated and are a high-value compromise target
Drift detection	Continuous reconciliation loop detects and surfaces (or auto-corrects) any manual change within seconds; drift is a first-class observable	No ongoing reconciliation; drift between pipeline-applied state and actual cluster state is invisible until the next deployment or an incident surfaces it
Rollback speed	Rollback is a git revert merged to the target branch; the reconciler applies the previous state within the sync interval — no human touching kubectl	Rollback requires re-triggering the pipeline with a previous artefact tag, or a manual kubectl apply; speed depends on pipeline queue depth
Multi-cluster fan-out	ApplicationSets (Argo CD) or Flux cluster generators scale a single manifest to N clusters; adding a cluster is a label or a row in a config file	Requires explicit per-cluster stages in the pipeline; each new cluster adds pipeline complexity and another set of credentials to manage
Audit trail	Every change to the cluster is a Git commit with author, timestamp, and diff — the Git log is the audit log with no additional tooling	Audit trail lives in CI logs; correlating a production change to a pipeline run requires cross-system query and depends on log retention policy

Terraform vs Pulumi vs CDK

IaC tool selection should be driven by team language fluency, the complexity of the resource graph, and how tightly the workload is coupled to a single cloud provider.

Criterion	Terraform (HCL)	Pulumi (TS/Python/Go)	AWS CDK (TypeScript)
Language & learning curve	HCL is purpose-built and readable, but lacks loops, conditionals, and abstractions of a general-purpose language; count/for_each workarounds are brittle	Uses the team's existing language — TypeScript, Python, or Go — with full IDE support, unit testing, and familiar patterns; steeper initial Pulumi SDK learning curve	TypeScript only (primarily); constructs are idiomatic but the abstraction levels (L1/L2/L3) require understanding how CloudFormation renders them
Provider ecosystem	Broadest ecosystem — Terraform Registry has thousands of providers including every major cloud, SaaS, and database; most stable and battle-tested	Supports all Terraform providers via the compatibility bridge plus native Pulumi providers; bridge providers occasionally lag on new resources	AWS-only by design; no multi-cloud support; tight integration with CloudFormation means AWS-native resources appear on day one
Multi-cloud support	First-class — providers for AWS, GCP, Azure, Kubernetes, and hundreds of SaaS platforms; the canonical choice for multi-cloud IaC	First-class — same provider breadth as Terraform via bridge; native providers for major clouds; multi-cloud is idiomatic	None — CDK is AWS-specific; CDK for Terraform (cdktf) and CDK for Kubernetes (cdk8s) exist but are separate tools with different maturity
Dynamic resource graphs	Static graph computed before apply; count and for_each are limited; generating 50 services from a data structure requires complex module composition	Resource graph is evaluated at runtime using real language constructs; a for-loop over a config array is idiomatic and produces clean plan output	L3 constructs (Patterns) encapsulate repeating patterns well; lower-level dynamic generation is possible but requires understanding CloudFormation limits
State management	Explicit state file in remote backend (S3 + DynamoDB, Terraform Cloud); team must manage locking, versioning, and state migration on resource renames	State managed by Pulumi Cloud or a self-hosted backend; automatic locking; resource renames tracked without manual state manipulation	No explicit state file — state is CloudFormation stack state managed by AWS; simplifies operations but limits visibility and portability

maturity model

Platform Maturity Model

This model describes the progression from undifferentiated manual operations to a self-service internal developer platform. Most organisations starting an engagement sit at Level 1. CypherSage's delivery target is Level 3.

Level 1

Level 1 — Manual / ClickOps

Infrastructure is provisioned through cloud consoles or ad-hoc scripts with no IaC; state is undocumented and drifts from intent within days
Deployments are manual SSH or console operations performed by a small group with shared credentials; no audit trail
Monitoring consists of cloud-native default dashboards and email alerts with no SLOs; on-call is reactive and relies on customer complaints as the primary signal
Security controls are applied manually after the fact; credentials are static and long-lived; there is no secrets rotation process
DORA metrics are unmeasured; deployment frequency is below once per week; MTTR is measured in hours to days

Level 2

Level 2 — Automated

All persistent infrastructure is managed by versioned IaC (Terraform or Pulumi) with remote state; direct console changes are treated as incidents
CI/CD pipelines automate build, test, and deployment; production deployments are triggered by merge to main and require no manual steps for standard releases
SLOs are defined for tier-1 services; error budget burn-rate alerts replace threshold alerts; on-call rotation is documented with escalation paths
Image scanning and dependency vulnerability checks run in CI; secrets are managed in a central store with access auditing; static credentials are eliminated for cloud APIs
DORA metrics are measured continuously; deployment frequency is daily or better; change failure rate is below 15 %; MTTR is below one hour for P1 incidents

Where we operateLevel 3

Level 3 — Self-service Platform

Product teams provision new services through a self-service portal or CLI backed by opinionated Terraform/Pulumi modules; the platform team reviews and merges, not provisions
Ephemeral preview environments are created automatically for every pull request and torn down on merge; environment parity between preview and production is enforced by the platform
Progressive delivery is the default release mechanism; canary analysis gates backed by SLO metrics block automatic promotion when error budgets are burning
Policy-as-code (OPA/Conftest) enforces security, cost, and reliability constraints at CI and admission time; non-compliant workloads cannot reach production without an explicit exception
FinOps tooling surfaces cost attribution per service and team in the developer portal; Infracost estimates appear on IaC pull requests; idle resource cleanup is automated

what we avoid

Failure Modes to Avoid

These are the four most common and expensive failure modes in Cloud & DevOps engagements. Each one is a pattern that appears reasonable under short-term pressure but compounds into a structural liability.

ClickOps at Scale

Infrastructure provisioned through cloud consoles is undocumented by definition. What was clicked, when, by whom, and why is not recorded. Within weeks, the actual state of the account diverges from any diagram or mental model. Incident response becomes archaeology — tracing what is actually running before you can determine what is broken. Disaster recovery becomes impossible because there is nothing to recover from except the memory of whoever last touched the console.

Enforce IaC from day one by revoking write permissions from the AWS/GCP/Azure console for all persistent resource classes and routing all changes through reviewed pull requests. The break-glass procedure for genuine emergencies is documented, requires two-person authorisation, and triggers an automatic IaC reconciliation ticket. The first sprint of any engagement includes bringing existing ClickOps resources into Terraform state via import — not a future backlog item.

Configuration Drift

Drift is the gap between what the IaC says should exist and what is actually running. It emerges from three sources: hotfixes applied directly to production, IaC modules that lag behind manual changes, and out-of-band configuration by vendor support or platform teams. Drift is invisible until it causes an incident — a terraform plan that proposes to destroy a resource that was manually added six months ago, or a security group that was manually opened and never closed. In a drifted environment, no one trusts the IaC, which means no one runs it, which accelerates the drift.

Continuous reconciliation is the only durable fix. For Kubernetes workloads, Argo CD's self-heal mode corrects drift within the sync interval. For cloud infrastructure, terraform plan is run as a scheduled CI job and any non-zero diff raises an alert. Manual changes are not forbidden in emergencies but must be immediately followed by an IaC update PR — the alert closes only when the plan output is clean. Drift tolerance is zero; exceptions require a written decision record.

Security Bolted On at the End

Security reviews positioned as the last gate before production create a false sense of control. By the time a pen-test or vulnerability scan runs, the vulnerable dependency has been in production for weeks, the insecure Dockerfile pattern has been copied into 12 service repositories, and the IAM role with wildcard permissions has been assumed by a production workload. The cost of remediation scales linearly with the number of places the problem has propagated. A 'pass/fail at the gate' model also creates adversarial dynamics — security says no, delivery asks for exceptions, exceptions become the norm.

Security controls belong at the earliest enforcement point. Pre-commit hooks run secret detection. CI runs Trivy image scanning and OPA/Conftest policy checks. The admission webhook rejects non-compliant Kubernetes manifests. The IaC pipeline runs a CIS Benchmark check against the plan before apply. The security team writes policies, not approvals — the pipeline enforces them. A finding in CI is a build failure, not a review comment, which means it is fixed in minutes not weeks.

Alert Fatigue from Missing SLOs

Teams that instrument first and define SLOs later end up with hundreds of alerts set at arbitrary thresholds — CPU over 80 %, latency over 500 ms, error rate over 1 % — with no relationship to whether customers are actually experiencing a degraded service. When everything alerts, nothing is urgent. On-call engineers learn to ignore pages; critical incidents get buried in the noise. MTTR climbs not because the team is slow but because distinguishing real incidents from threshold-breaching background noise requires manual investigation every time.

Define SLOs before writing alerting rules. An SLO requires answering: what is the user-facing behaviour being protected, how is it measured (the SLI), and what is the acceptable failure rate over a rolling window (the objective). Alerts fire on error budget burn rate — the rate at which the budget is being consumed relative to the allowed rate — not on instantaneous metric thresholds. A 5 % burn rate over one hour is informational; a 14x burn rate over five minutes is a page. This model produces fewer, higher-confidence alerts and eliminates the alert-fatigue spiral.

specialised services

Go deeper

Cloud & DevOpscovers a lot of ground. Here are the specific things we're most often asked to build.

DevOps & Cloud Automation

Cloud architecture, CI/CD pipelines, containerisation, infrastructure-as-code and observability — the engineering layer that keeps everything above it running reliably.

Learn more

MLOps

MLOps infrastructure for teams that have models to ship or models already running that need proper versioning, monitoring, retraining pipelines and cost control.

Learn more

good to know

Common questions

Our app already runs but it is fragile — can you help?

That is a common starting point. We audit what you have, stabilise deployments and add basic monitoring first, then improve reliability and cut costs from a stable foundation.

Will you lock us into your tooling or your accounts?

No. We use standard, portable tooling — Docker, Terraform, mainstream clouds — on your accounts and document everything so any engineer can take it from here.

keep exploring

Other services

Product & Software Engineering AI & Machine Learning Data & Analytics Design & Product Strategy

Have something in mind?

Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.

Start a project Chat on WhatsApp