Cloud & DevOps
Scalable infrastructure, automation and reliable delivery.
Building something is half the job. Getting it live, keeping it up at 3 am, and not opening a cloud bill that ruins your month — that is the other half, and it is where small teams get stuck. We treat infrastructure as a first-class part of the build, not an afterthought bolted on before launch.
We set up the cloud architecture most studios skip: reproducible environments provisioned with infrastructure-as-code, automated deployment pipelines, container orchestration that runs identically in development and production, and observability so you hear about problems before your users do. Then we right-size it so you are not paying enterprise prices for a startup's traffic.
We cover the full range — from a single developer's first VPS to multi-region Kubernetes clusters on AWS, GCP or Azure — and we include MLOps as one specialised lane within the broader cloud practice. Everything is documented on your accounts, portable and operable by any competent engineer.
- Releases that ship safely, consistently and without downtime drama
- Issues surfaced by monitoring — not reported by frustrated users
- A cloud bill that scales with your business, not against it
Capabilities
Cloud architecture & IaC
AWS, GCP and Azure architecture designed for your workload, provisioned with Terraform so environments are reproducible, auditable and never clicked together by hand.
CI/CD pipelines & automation
Automated test, build and release pipelines in GitHub Actions, GitLab CI or similar — so shipping to production is routine rather than an all-hands event.
Containers & Kubernetes
Docker-based environments and Kubernetes orchestration that run identically in development and production, with zero 'works on my machine' surprises.
Observability & monitoring
Structured logging, metrics, distributed tracing, uptime checks and sensible alerting — you know what is slow or broken before customers report it.
Reliability, SRE & cost optimisation
Right-sized resources, auto-scaling, backup strategies, incident runbooks and a cloud bill that reflects your actual traffic rather than accumulated waste.
Security & compliance posture
IAM least-privilege, secrets management, network segmentation, vulnerability scanning and the audit trails that regulated industries require.
MLOps & model serving
Deploy ML models behind reliable, versioned APIs with retraining hooks, A/B routing and drift monitoring so they keep performing after their initial launch.
Deliverables
- Your application or model live on infrastructure you own and control
- Automated deploy pipeline from a merged pull request to production
- Infrastructure-as-code repository with full documentation
- Monitoring, alerting and backup strategy in place from day one
Best suited for
- Engineering teams without dedicated platform or DevOps experience
- Startups whose infrastructure was clicked together and needs to be made reproducible
- Teams whose ML model is stuck in a notebook and needs a production home
- Any product that needs to stay reliably online as it grows
What we build with
DevOps, SRE & Platform Engineering
DevOps is an organisational and technical practice that collapses the wall between software delivery and infrastructure operations. It is not a job title or a toolchain — it is a feedback-loop discipline: shorten the time from code commit to production signal so that teams can learn and correct course faster. Done well, it manifests as measurable DORA metrics (deployment frequency, lead time for change, change failure rate, time to restore), not a Jira board full of automation tickets.
Site Reliability Engineering (SRE) picks up where DevOps ends. SRE defines reliability as a feature with a quantified budget: Service Level Objectives and error budgets set an explicit ceiling on how much unreliability is acceptable, which forces a conversation between product and engineering that no amount of 'we care about quality' rhetoric achieves. SRE teams own incident response, postmortem culture, and the toil-elimination mandate — if a task is manual, repetitive, and automatable, it must be automated before it consumes more than 50 % of engineering time.
Platform Engineering is the newest discipline of the three. It builds and operates the Internal Developer Platform (IDP) that surfaces self-service golden paths to product teams. Where DevOps asks every team to own their pipeline and SRE asks them to write runbooks, Platform Engineering removes those burdens by offering paved roads: templated repositories, curated runtime environments, cost dashboards, and policy guardrails baked into the platform rather than bolted on by audit. The boundary is sharp: Platform Engineering ends at the contract surface (APIs, CLI, portal) it exposes; product teams own what they deploy through it.
Platform Delivery Lifecycle
Each engagement follows a six-stage progression that mirrors how a platform actually matures — from understanding current pain to handing off an operating model that the team can sustain. Stages are sequential but overlap at their edges; stage outputs feed the next stage as direct inputs.
- DORA metrics baseline measurement — deployment frequency, lead time, MTTR, change failure rate
- Architecture inventory: cloud accounts, regions, VPCs, IAM boundaries, and drift between documented and actual state
- Pipeline archaeology: trace a production deployment end-to-end and document every manual step, approval gate, and tribal-knowledge dependency
- SLO gap analysis: identify which services have SLIs defined, which have alerts with no SLO backing them, and which have no observability at all
- FinOps snapshot: pull 90-day spend by service and team to surface waste and missing cost attribution before architecture decisions are made
- Reference architecture diagram: accounts/projects, VPC layout, transit gateway or VPC peering, egress strategy
- IaC module library design: define the Terraform/Pulumi module boundaries and the contract each module exports
- Identity and access model: IAM roles per workload, least-privilege boundaries, federation to identity provider
- Secrets hierarchy: define which secrets live in Vault, which in cloud-native secret stores, and the rotation schedule for each class
- Environment promotion model: trunk-based branching strategy, ephemeral PR environments, staging parity constraints
- Standardised build graphs: Dockerfiles, build caches, SBOM generation at build time
- Artefact registry and promotion policy: immutable tags, attestation signing, promotion gates between registries
- GitHub Actions / CI platform configuration with reusable workflows and matrix builds
- GitOps delivery configuration: Argo CD ApplicationSets or Flux Kustomizations per environment tier
- Progressive delivery: canary or blue-green rollout configuration with automated analysis gates (Argo Rollouts + Prometheus metrics)
- SLI/SLO definition workshops per service: latency, availability, saturation targets with error budget burn-rate alerts
- OpenTelemetry collector deployment and SDK instrumentation guide for product teams
- Prometheus scrape configs, recording rules, and alerting rules aligned to SLOs (not just thresholds)
- Grafana dashboard library: golden signals per service, platform health, cost-per-service overlays
- Incident response runbook library and postmortem template embedded in the internal developer portal
- Supply chain hardening: Trivy image scanning, SBOM attestation with Cosign, dependency vulnerability scanning in CI
- Policy-as-code: OPA/Conftest rules for Kubernetes manifests, Terraform plans, and Dockerfile patterns gated at CI
- CIS Benchmark baseline for cloud accounts applied as Terraform and enforced via AWS Config / GCP Security Command Center
- Zero-trust network policy: Kubernetes NetworkPolicy and cloud VPC firewall rules generated from service mesh or policy engine
- Secrets rotation automation: Vault dynamic credentials for databases and cloud APIs, eliminating long-lived static secrets
- On-call rotation design and escalation paths documented in the incident management tool
- Toil register: catalogue all manual, repetitive operational tasks with estimated hours/week and automation priority
- Cost optimisation cadence: monthly FinOps review, rightsizing recommendations, reserved-instance or committed-use planning
- Platform changelog and internal developer portal updates to communicate breaking changes and new golden-path offerings
- Quarterly reliability review: SLO burn rate trends, error budget consumption, and DORA metrics progression against baseline
Foundation Assessment
Audit the current infrastructure posture, delivery pipeline, and operational practice before prescribing anything. This stage exists to prevent importing existing dysfunction into shiny new tooling.
Infrastructure Architecture
Design the target-state infrastructure as code, including network topology, identity model, secrets hierarchy, and multi-environment promotion strategy. Every design decision is written as an Architecture Decision Record before implementation begins.
CI/CD Engineering
Build the delivery pipeline from commit to production, including build standardisation, artefact promotion, and progressive delivery gates. The pipeline is itself versioned and tested — it is not exempt from engineering standards.
Observability & SRE
Instrument the platform and applications to produce the signals needed to defend SLOs, not just to produce dashboards. Observability is only complete when an on-call engineer can answer 'what is broken, where, and why' without SSH-ing into a box.
DevSecOps Hardening
Embed security controls into the pipeline and the platform so that policy violations are caught before merge, not after a quarterly pen-test. Zero-trust posture means no implicit trust at the network or identity layer.
Day-2 Operations
Transition from delivery to sustained operation — establishing the operating model, toil-elimination backlog, and capacity planning cadence that keeps the platform healthy without heroics.
Engineering Principles
These are the opinionated constraints we apply to every Cloud & DevOps engagement. They are not aspirational values — they are guardrails that prevent the most common and expensive failure modes.
Treat infrastructure state as immutable
Servers and managed resources that are modified in-place accumulate undocumented state until no one knows what is actually running. Immutability — replacing rather than patching — eliminates configuration drift as a category of failure. When every change is a replacement provisioned from source-controlled IaC, the blast radius of any mutation is bounded and reversible. This is why we mandate IaC for all persistent resources and prohibit direct console changes outside a documented break-glass procedure.
Define SLOs before instrumenting metrics
Teams that instrument first and define SLOs later end up with hundreds of dashboards and zero actionable alerts. An SLO forces you to answer 'reliable enough for whom, under what conditions, measured how?' before you write a single PromQL query. Starting from the SLO means every metric you collect has a consumer, every alert has a budget, and every on-call page is justified by an error budget burn rate — not just a threshold someone chose arbitrarily.
Security is a property of the pipeline, not a final gate
A security review at the end of a delivery cycle is an approval ritual, not a control. By the time a vulnerability reaches a pre-production scan, it has already been committed, built, tested, and promoted — ripping it out is expensive. Policy-as-code in CI (OPA/Conftest), image scanning at build time (Trivy), and secret detection at pre-commit shift the enforcement point left by weeks. The goal is to make it impossible to merge a non-compliant artefact, not to catch it later.
Speed without guardrails just ships bad changes faster
High deployment frequency is a DORA research finding about elite teams, not a goal to chase with volume alone. Teams that push to production 50 times a day without automated rollback, feature flags, or canary analysis just discover their mistakes faster and in front of customers. Progressive delivery — canary releases, blue-green deployments, feature flags backed by an error budget burn gate — is what makes high frequency safe rather than reckless.
GitOps means the cluster state is a function of the repository
In a true GitOps model, the Kubernetes cluster has no desired state that is not expressed in a Git commit. Operators apply changes by merging pull requests; the GitOps controller (Argo CD or Flux) continuously reconciles actual state to declared state. This eliminates entire classes of 'who changed what and when' incidents, provides a free audit trail, and means rollback is a git revert — not a production kubectl command issued under pressure at 2 AM.
Cost is an engineering concern, not a finance report
Cloud spend is a direct consequence of architectural choices: instance families, storage tiers, data transfer patterns, idle capacity. Treating it as a finance problem means engineers never see the cost signal until it is too large to ignore. FinOps tooling should be surfaced in the same internal developer portal as deployment status — so the team that chose a GP3 volume over GP2, or left a NAT gateway serving a defunct service, sees the cost attribution before the monthly bill.
Every secret has a rotation schedule or it is a liability
A static credential that never rotates is a credential that will eventually be found in a git history, a log file, an environment variable dump, or a breach. Vault dynamic secrets, IRSA/Workload Identity, and cloud-native secret managers make rotation automatic and make long-lived static secrets unnecessary for the majority of use cases. Any service that cannot tolerate credential rotation on a defined schedule has an architectural problem, not a security exemption.
The platform must eat its own dog food
An Internal Developer Platform that the platform team itself does not use is a platform that will not evolve to meet real needs. Platform teams must deploy their own tooling through the same golden paths, observe it through the same dashboards, and carry it in the same on-call rotation as product workloads. This is the fastest feedback loop for discovering where the paved road is actually a gravel track.
Reference Stack
Tool choices are justified by engineering constraints, not popularity. Each selection reflects a specific capability gap it closes or failure mode it eliminates.
IaC & Provisioning
- Terraform + TerragruntTerraform's provider ecosystem is unmatched — virtually every cloud resource and SaaS API has a maintained provider. Terragrunt layers on top to solve Terraform's native weaknesses: DRY remote-state configuration, dependency ordering across modules, and environment-specific variable inheritance without copy-paste. The combination gives declarative state management at scale without forcing a rewrite into a proprietary graph engine.
- PulumiPulumi is the right choice when the team's primary language is TypeScript, Python, or Go and the cognitive overhead of HCL is a genuine impediment to adoption. Its real advantage is first-class support for dynamic resource graphs — generating 50 identical microservice stacks from a loop is idiomatic Pulumi and awkward Terraform. It also integrates natively with existing CI test suites, which matters when the IaC itself needs unit tests.
- AWS CDK / GCP Deployment ManagerUsed selectively for resources that are deeply coupled to the cloud provider's native constructs and where the CDK's higher-level abstractions (L2/L3 constructs) genuinely reduce boilerplate. Not used as a primary IaC layer for multi-cloud workloads — the provider-specific DSL becomes a liability the moment a workload needs to span clouds.
CI/CD & Delivery
- GitHub ActionsNative to the source control system, which eliminates an entire class of webhook and credential synchronisation problems. Reusable workflows and composite actions provide the modularisation needed to avoid copy-pasting pipeline configuration across 20 repositories. The marketplace is broad enough that most integrations exist as maintained actions, reducing the amount of shell glue engineers have to write and maintain.
- Argo CDArgo CD is the reference implementation of GitOps for Kubernetes. ApplicationSets allow a single manifest to generate Application objects for every environment, cluster, and service combination — eliminating per-environment pipeline branches. Its reconciliation loop continuously detects and surfaces drift between the live cluster and the declared Git state, which is the core property GitOps needs to deliver on its promise.
- Argo RolloutsProgressive delivery without a dedicated rollout controller collapses to a binary switch: old is gone, new is live. Argo Rollouts adds canary and blue-green strategies with analysis gates backed by Prometheus metrics or DataDog queries. The gate asks 'is the error rate on the canary population within the SLO error budget?' before promoting — this is what makes high deployment frequency safe rather than reckless.
Containers & Orchestration
- Docker / BuildKitBuildKit's parallel stage execution and cache-mount support cut median build times significantly compared to classic Docker builds. More importantly, BuildKit's --sbom and --provenance flags produce SLSA-compliant artefact attestations at build time without a separate pipeline step, which is the prerequisite for supply-chain policy enforcement downstream.
- KubernetesKubernetes is chosen when the workload requires fine-grained scheduling, multi-tenancy with namespace isolation, or a service mesh. Its API extensibility (CRDs, admission webhooks, operators) is what makes it the right substrate for a self-service platform — every capability the platform team wants to surface can be exposed as a Kubernetes resource with a controller managing the lifecycle. It is deliberately not the default for every workload — stateless functions and batch jobs often belong in serverless compute.
- Helm + HelmfileHelm packages Kubernetes manifests into versioned, parameterised charts that can be promoted through environments by changing a value file rather than a manifest. Helmfile declaratively composes multiple releases with shared value inheritance, which solves the environment-fan-out problem that makes raw Helm charts painful at scale. Charts are stored in OCI registries and version-pinned, making the exact set of running manifests auditable from the GitOps repository.
Observability
- Prometheus + AlertmanagerPrometheus's pull model and its rich label cardinality are what make SLO-based alerting tractable. Multi-window, multi-burn-rate alerts — the pattern recommended by the Google SRE Workbook — require a time-series store that can efficiently evaluate recording rules at different time windows simultaneously. Prometheus's operator pattern (kube-prometheus-stack) also means the monitoring stack itself is managed as code and can be templated per cluster.
- GrafanaGrafana is the visualisation layer because its dashboard-as-code support (Grafonnet / JSON model in Git) allows dashboards to be version-controlled and promoted alongside the service definitions they describe. The ability to mix Prometheus, Loki, and Tempo data sources in a single panel is essential for correlating a latency spike (metrics) to a specific trace and its associated log lines without switching tools.
- OpenTelemetryOpenTelemetry solves vendor lock-in at the instrumentation layer. SDK instrumentation emits to a collector; the collector routes to whichever backend is current (Jaeger, Tempo, commercial APM). The cost of switching observability backends drops from 'reinstrument every service' to 'change a collector config'. The OTel semantic conventions also standardise the attribute names that SLI queries rely on, which reduces the per-service instrumentation work for product teams.
- LokiLoki indexes only log metadata (labels), not the full log body, which keeps storage costs an order of magnitude below Elasticsearch at equivalent log volume. Because its label model mirrors Prometheus labels, the same service/namespace/pod selectors that appear in metrics queries work in log queries — reducing the cognitive load for engineers who switch between signals during an incident.
Security & Policy
- OPA / ConftestOpen Policy Agent provides a unified policy language (Rego) that can evaluate Terraform plans, Kubernetes admission requests, and Dockerfile instructions against the same policy corpus. Conftest is the CLI wrapper that runs these policies in CI. The result is that a policy written once ('containers must not run as root') is enforced at plan time, at admission, and in CI — rather than being checked in three different places by three different tools.
- TrivyTrivy scans container images, filesystem layers, IaC files, and SBOMs for CVEs in a single tool invocation. Its SBOM export (CycloneDX or SPDX) satisfies supply-chain transparency requirements and feeds into admission policies that can block promotion of images with critical unpatched CVEs. The zero-configuration mode makes it trivial to add to any CI pipeline without a dedicated scanning service.
- HashiCorp VaultVault's dynamic secret engines — database credentials, AWS IAM credentials, PKI certificates — mean that the secret a service receives exists only for the duration of its lease and is revoked automatically. This eliminates the 'leaked static credential' class of incident. Vault's audit log provides a complete record of every secret access, which is a prerequisite for compliance frameworks (SOC 2, ISO 27001) that require demonstrating least-privilege access.
- Cosign + SLSACosign signs container image digests and SBOM attestations with a keyless signing workflow backed by Sigstore's transparency log. An admission webhook that verifies signatures before allowing image deployment closes the gap between 'we scanned the image in CI' and 'we are running the image we scanned' — which is the gap where supply-chain attacks live. SLSA provenance levels provide a graduated framework for hardening the build pipeline itself.
Cloud & FinOps
- AWS / GCP / AzureProvider selection is driven by workload requirements and existing organisational investment, not familiarity. AWS is chosen for breadth of managed services and maturity of IAM. GCP is chosen for AI/ML workloads (TPU access, Vertex AI) and data platform integration. Azure is chosen when Active Directory integration and Microsoft licensing are architectural constraints. All three are wrapped in the same IaC layer so the infrastructure code is not harder to port than the application.
- AWS Cost Explorer / GCP Billing + InfracostInfracost runs in CI and posts a cost estimate diff on every pull request that changes infrastructure — engineers see the monthly cost delta of their change before it merges, not after the bill arrives. Cloud-native billing APIs provide the actuals; Infracost provides the pre-merge signal. Together they close the feedback loop that FinOps practice requires: cost must be visible at the point of decision, not at the point of payment.
- Karpenter / KEDAKarpenter replaces the Kubernetes Cluster Autoscaler with a provisioner that can select the cheapest available instance type for a given workload shape, including Spot instances with automatic diversification across availability zones. KEDA scales workloads on arbitrary event sources (queue depth, Prometheus metrics, cron) rather than just CPU/memory, which eliminates over-provisioned headroom kept to absorb traffic spikes that are actually predictable.
How we'd choose
There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.
Kubernetes vs Serverless vs VMs
Compute substrate choice has long-term implications for operational model, cost profile, and capability ceiling. Use this guide to match workload characteristics to the right substrate before committing to a platform.
| Criterion | Kubernetes | Serverless (FaaS) | Virtual Machines |
|---|---|---|---|
| Cold-start latency tolerance | Sub-second with pre-warmed pods; not suitable for latency-critical sub-10ms P99 | 100 ms–2 s cold start; unsuitable for synchronous low-latency APIs without provisioned concurrency | No cold start; process is always running; suitable for any latency profile |
| Operational complexity | High — control plane, node pools, networking CNI, storage CSI all require expertise | Low — zero infrastructure management; vendor manages runtime, scaling, and patching | Medium — OS patching, image baking, and auto-scaling group configuration are team responsibilities |
| Cost model at low load | Fixed cluster baseline cost regardless of traffic; economical only above a workload density threshold | Pay-per-invocation; near-zero cost at low request rates; optimal for infrequent or bursty workloads | Fixed per-instance cost; idle VMs waste spend; Reserved Instances reduce this at sustained load |
| Workload duration | Unlimited duration; suitable for long-running daemons, websocket connections, batch jobs via Job resource | Maximum execution timeout (typically 15 min); unsuitable for long-running streaming or stateful processes | Unlimited duration; no platform-imposed timeout; suitable for any workload length |
| Multi-tenancy & isolation | Namespace-level isolation with NetworkPolicy and RBAC; strong but requires policy discipline | Full process isolation per invocation; no shared-state risk between tenants | Hypervisor-level isolation; strongest boundary; suitable for regulated or hostile-tenant workloads |
| Self-service platform fit | Excellent — CRDs and operators make Kubernetes the right substrate for an IDP; product teams deploy via GitOps | Moderate — golden-path templates per function type work well; harder to standardise cross-cutting concerns | Poor — VM lifecycle management at scale requires significant tooling investment to match Kubernetes self-service |
GitOps vs Push-based CI/CD
Both models can achieve continuous delivery, but they differ in security posture, auditability, and how well they scale to many environments. The choice affects your incident response workflow as much as your deployment workflow.
| Criterion | GitOps (pull) | Push-based CI/CD |
|---|---|---|
| Credential exposure | CI system holds no cluster credentials; the GitOps agent running inside the cluster pulls desired state from Git — no inbound firewall rules or kubeconfig files in CI secrets | CI runner requires cluster credentials (kubeconfig, service account token) to push changes; credentials must be rotated and are a high-value compromise target |
| Drift detection | Continuous reconciliation loop detects and surfaces (or auto-corrects) any manual change within seconds; drift is a first-class observable | No ongoing reconciliation; drift between pipeline-applied state and actual cluster state is invisible until the next deployment or an incident surfaces it |
| Rollback speed | Rollback is a git revert merged to the target branch; the reconciler applies the previous state within the sync interval — no human touching kubectl | Rollback requires re-triggering the pipeline with a previous artefact tag, or a manual kubectl apply; speed depends on pipeline queue depth |
| Multi-cluster fan-out | ApplicationSets (Argo CD) or Flux cluster generators scale a single manifest to N clusters; adding a cluster is a label or a row in a config file | Requires explicit per-cluster stages in the pipeline; each new cluster adds pipeline complexity and another set of credentials to manage |
| Audit trail | Every change to the cluster is a Git commit with author, timestamp, and diff — the Git log is the audit log with no additional tooling | Audit trail lives in CI logs; correlating a production change to a pipeline run requires cross-system query and depends on log retention policy |
Terraform vs Pulumi vs CDK
IaC tool selection should be driven by team language fluency, the complexity of the resource graph, and how tightly the workload is coupled to a single cloud provider.
| Criterion | Terraform (HCL) | Pulumi (TS/Python/Go) | AWS CDK (TypeScript) |
|---|---|---|---|
| Language & learning curve | HCL is purpose-built and readable, but lacks loops, conditionals, and abstractions of a general-purpose language; count/for_each workarounds are brittle | Uses the team's existing language — TypeScript, Python, or Go — with full IDE support, unit testing, and familiar patterns; steeper initial Pulumi SDK learning curve | TypeScript only (primarily); constructs are idiomatic but the abstraction levels (L1/L2/L3) require understanding how CloudFormation renders them |
| Provider ecosystem | Broadest ecosystem — Terraform Registry has thousands of providers including every major cloud, SaaS, and database; most stable and battle-tested | Supports all Terraform providers via the compatibility bridge plus native Pulumi providers; bridge providers occasionally lag on new resources | AWS-only by design; no multi-cloud support; tight integration with CloudFormation means AWS-native resources appear on day one |
| Multi-cloud support | First-class — providers for AWS, GCP, Azure, Kubernetes, and hundreds of SaaS platforms; the canonical choice for multi-cloud IaC | First-class — same provider breadth as Terraform via bridge; native providers for major clouds; multi-cloud is idiomatic | None — CDK is AWS-specific; CDK for Terraform (cdktf) and CDK for Kubernetes (cdk8s) exist but are separate tools with different maturity |
| Dynamic resource graphs | Static graph computed before apply; count and for_each are limited; generating 50 services from a data structure requires complex module composition | Resource graph is evaluated at runtime using real language constructs; a for-loop over a config array is idiomatic and produces clean plan output | L3 constructs (Patterns) encapsulate repeating patterns well; lower-level dynamic generation is possible but requires understanding CloudFormation limits |
| State management | Explicit state file in remote backend (S3 + DynamoDB, Terraform Cloud); team must manage locking, versioning, and state migration on resource renames | State managed by Pulumi Cloud or a self-hosted backend; automatic locking; resource renames tracked without manual state manipulation | No explicit state file — state is CloudFormation stack state managed by AWS; simplifies operations but limits visibility and portability |
Platform Maturity Model
This model describes the progression from undifferentiated manual operations to a self-service internal developer platform. Most organisations starting an engagement sit at Level 1. CypherSage's delivery target is Level 3.
Level 1 — Manual / ClickOps
- Infrastructure is provisioned through cloud consoles or ad-hoc scripts with no IaC; state is undocumented and drifts from intent within days
- Deployments are manual SSH or console operations performed by a small group with shared credentials; no audit trail
- Monitoring consists of cloud-native default dashboards and email alerts with no SLOs; on-call is reactive and relies on customer complaints as the primary signal
- Security controls are applied manually after the fact; credentials are static and long-lived; there is no secrets rotation process
- DORA metrics are unmeasured; deployment frequency is below once per week; MTTR is measured in hours to days
Level 2 — Automated
- All persistent infrastructure is managed by versioned IaC (Terraform or Pulumi) with remote state; direct console changes are treated as incidents
- CI/CD pipelines automate build, test, and deployment; production deployments are triggered by merge to main and require no manual steps for standard releases
- SLOs are defined for tier-1 services; error budget burn-rate alerts replace threshold alerts; on-call rotation is documented with escalation paths
- Image scanning and dependency vulnerability checks run in CI; secrets are managed in a central store with access auditing; static credentials are eliminated for cloud APIs
- DORA metrics are measured continuously; deployment frequency is daily or better; change failure rate is below 15 %; MTTR is below one hour for P1 incidents
Level 3 — Self-service Platform
- Product teams provision new services through a self-service portal or CLI backed by opinionated Terraform/Pulumi modules; the platform team reviews and merges, not provisions
- Ephemeral preview environments are created automatically for every pull request and torn down on merge; environment parity between preview and production is enforced by the platform
- Progressive delivery is the default release mechanism; canary analysis gates backed by SLO metrics block automatic promotion when error budgets are burning
- Policy-as-code (OPA/Conftest) enforces security, cost, and reliability constraints at CI and admission time; non-compliant workloads cannot reach production without an explicit exception
- FinOps tooling surfaces cost attribution per service and team in the developer portal; Infracost estimates appear on IaC pull requests; idle resource cleanup is automated
Failure Modes to Avoid
These are the four most common and expensive failure modes in Cloud & DevOps engagements. Each one is a pattern that appears reasonable under short-term pressure but compounds into a structural liability.
ClickOps at Scale
Infrastructure provisioned through cloud consoles is undocumented by definition. What was clicked, when, by whom, and why is not recorded. Within weeks, the actual state of the account diverges from any diagram or mental model. Incident response becomes archaeology — tracing what is actually running before you can determine what is broken. Disaster recovery becomes impossible because there is nothing to recover from except the memory of whoever last touched the console.
Enforce IaC from day one by revoking write permissions from the AWS/GCP/Azure console for all persistent resource classes and routing all changes through reviewed pull requests. The break-glass procedure for genuine emergencies is documented, requires two-person authorisation, and triggers an automatic IaC reconciliation ticket. The first sprint of any engagement includes bringing existing ClickOps resources into Terraform state via import — not a future backlog item.
Configuration Drift
Drift is the gap between what the IaC says should exist and what is actually running. It emerges from three sources: hotfixes applied directly to production, IaC modules that lag behind manual changes, and out-of-band configuration by vendor support or platform teams. Drift is invisible until it causes an incident — a terraform plan that proposes to destroy a resource that was manually added six months ago, or a security group that was manually opened and never closed. In a drifted environment, no one trusts the IaC, which means no one runs it, which accelerates the drift.
Continuous reconciliation is the only durable fix. For Kubernetes workloads, Argo CD's self-heal mode corrects drift within the sync interval. For cloud infrastructure, terraform plan is run as a scheduled CI job and any non-zero diff raises an alert. Manual changes are not forbidden in emergencies but must be immediately followed by an IaC update PR — the alert closes only when the plan output is clean. Drift tolerance is zero; exceptions require a written decision record.
Security Bolted On at the End
Security reviews positioned as the last gate before production create a false sense of control. By the time a pen-test or vulnerability scan runs, the vulnerable dependency has been in production for weeks, the insecure Dockerfile pattern has been copied into 12 service repositories, and the IAM role with wildcard permissions has been assumed by a production workload. The cost of remediation scales linearly with the number of places the problem has propagated. A 'pass/fail at the gate' model also creates adversarial dynamics — security says no, delivery asks for exceptions, exceptions become the norm.
Security controls belong at the earliest enforcement point. Pre-commit hooks run secret detection. CI runs Trivy image scanning and OPA/Conftest policy checks. The admission webhook rejects non-compliant Kubernetes manifests. The IaC pipeline runs a CIS Benchmark check against the plan before apply. The security team writes policies, not approvals — the pipeline enforces them. A finding in CI is a build failure, not a review comment, which means it is fixed in minutes not weeks.
Alert Fatigue from Missing SLOs
Teams that instrument first and define SLOs later end up with hundreds of alerts set at arbitrary thresholds — CPU over 80 %, latency over 500 ms, error rate over 1 % — with no relationship to whether customers are actually experiencing a degraded service. When everything alerts, nothing is urgent. On-call engineers learn to ignore pages; critical incidents get buried in the noise. MTTR climbs not because the team is slow but because distinguishing real incidents from threshold-breaching background noise requires manual investigation every time.
Define SLOs before writing alerting rules. An SLO requires answering: what is the user-facing behaviour being protected, how is it measured (the SLI), and what is the acceptable failure rate over a rolling window (the objective). Alerts fire on error budget burn rate — the rate at which the budget is being consumed relative to the allowed rate — not on instantaneous metric thresholds. A 5 % burn rate over one hour is informational; a 14x burn rate over five minutes is a page. This model produces fewer, higher-confidence alerts and eliminates the alert-fatigue spiral.
Go deeper
Cloud & DevOpscovers a lot of ground. Here are the specific things we're most often asked to build.
Common questions
Our app already runs but it is fragile — can you help?
That is a common starting point. We audit what you have, stabilise deployments and add basic monitoring first, then improve reliability and cut costs from a stable foundation.
Will you lock us into your tooling or your accounts?
No. We use standard, portable tooling — Docker, Terraform, mainstream clouds — on your accounts and document everything so any engineer can take it from here.
Other services
Have something in mind?
Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.