APIs & Integrations
Connect your stack, eliminate manual handoffs, own the data flow.
Most modern products are as much about what they connect to as what they do themselves. Whether you need a well-designed internal API, a webhook pipeline from Stripe into your database, a bidirectional sync between your CRM and a customer-facing app, or an integration with a carrier, government or payment API — these are repeatable engineering problems that need to be solved properly once.
We design and build APIs that are versioned, documented and built to extend. We instrument integrations with retries, idempotency keys and dead-letter queues so a downstream outage does not lose data. And we write integration tests that catch regressions before they hit production, not after.
What this covers
RESTful API design following resource-oriented conventions, versioning and consistent error shapes
GraphQL API development: schema design, resolvers, subscriptions and N+1 prevention
Webhook infrastructure: signature verification, retry logic, idempotency and dead-letter queues
Third-party API integrations: payment gateways, CRMs, ERPs, communication platforms and more
OAuth 2.0 flows, API key management and token refresh handling
OpenAPI / Swagger documentation auto-generated and kept in sync with the codebase
Integration testing and contract tests that catch breaking changes before they reach production
Deliverables
- A live, versioned API or set of integrations with full OpenAPI documentation
- Source code in your repository with runbooks for each integration
- A test suite covering the core integration flows and failure scenarios
- Monitoring and alerting on integration health, latency and error rates
What we build with
API and integration engineering
Integration engineering is the discipline of connecting systems reliably across trust, network, and protocol boundaries. It is distinct from feature development: the correctness criteria are about what happens when the remote system is slow, returns an unexpected shape, or goes down entirely — not just what happens on the happy path.
An integration that works at low traffic and breaks under load, silently drops events when a downstream service returns a 500, or processes the same webhook twice because the delivery guarantee is at-least-once is not a finished integration. It is a future incident. We build integrations that degrade gracefully, recover automatically, and surface failures with enough context to diagnose and replay them.
How we engineer integrations
These principles apply regardless of whether we are designing a public-facing API, a webhook consumer, or a bidirectional sync between two enterprise systems.
Design for at-least-once delivery
Most message-passing systems (webhooks, SQS, Kafka) guarantee at-least-once delivery, not exactly-once. Every consumer must therefore be idempotent: processing the same message twice must produce the same result as processing it once. We implement this with a processed_events table keyed on the inbound message ID, checked atomically before processing. The check and the write occur in the same database transaction as the business logic they protect.
Webhook signatures before any business logic
A webhook endpoint that processes any POST request without verifying the sender is a remote code execution surface dressed up as a business integration. Every major provider (Stripe, GitHub, Twilio, HubSpot) signs their webhook payloads with an HMAC-SHA256 signature over the raw request body and a shared secret. We verify this signature as the first operation in the handler, before deserialising the body, before logging the payload, and before touching the database. A signature verification failure is a 400 and a security alert, not a log line.
Retry with exponential back-off and jitter
Outbound calls to third-party APIs will fail. The correct response to a transient failure (429 rate-limit, 503 temporarily unavailable) is a retry after a delay, not an immediate second attempt that may hit the same error, and not a silent drop. We implement exponential back-off (delay = base * 2^attempt) with random jitter to prevent retry storms when multiple consumers recover simultaneously. Each retry is logged with the attempt number, the delay, and the full error response for post-incident analysis.
Dead-letter queues as the last line of defence
A message that exhausts its retry budget must land somewhere observable, not be silently discarded. We configure dead-letter queues (AWS SQS DLQ, RabbitMQ dead-letter exchange) on every consumer and set up alerting on DLQ depth. A message in the DLQ means a business event that has not been processed; it must be investigated and replayed once the root cause is resolved. The replay path is tested at build time, not discovered during an incident.
API contracts versioned and never silently broken
Changing a response field name, removing an optional field, or tightening a validation rule on an existing endpoint is a breaking change for any consumer that was not present at the design review. We version APIs with a /v1/, /v2/ path prefix, maintain the previous version for a documented deprecation window (minimum 90 days with written notice to consumers), and enforce contract tests in CI that fail the build if the running API no longer satisfies a previously published OpenAPI spec.
How we'd choose
There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.
REST vs Webhooks vs Event-driven (queue-based)
The communication pattern you choose for an integration determines its latency characteristics, its failure semantics, and the operational overhead of keeping it running. These are not interchangeable — the right pattern depends on whether the data flow is request-driven or event-driven, whether the consumer can be unavailable, and how much ordering and delivery guarantee you require.
| Criterion | REST (request-response) | Webhooks (provider-pushed HTTP) | Event-driven (queue or stream) |
|---|---|---|---|
| Data flow direction | Consumer initiates — pulls data on demand or on a polling schedule | Provider initiates — pushes an event HTTP callback the moment state changes | Producer publishes to a durable log or queue; consumer reads at its own pace |
| Latency | Latency is the round-trip time of the HTTP call plus any polling interval; real-time updates require polling intervals short enough to be wasteful | Near-real-time from the provider’s perspective; total latency includes the consumer’s HTTP response time and any retry delay | Sub-second for Kafka with appropriate partition and consumer configuration; SQS standard queues add tens to hundreds of milliseconds |
| Consumer availability requirement | Consumer must be available when it initiates the call; if it is down, no data is pulled during the outage | Consumer must be available to receive the push; if it returns a non-2xx, the provider retries (duration and retry count vary by provider and are not under your control) | Consumer can be offline; messages accumulate in the queue or topic and are processed when the consumer recovers, subject to retention period |
| Delivery guarantee | None from the transport layer — a 200 response means the request was received, not that the business logic succeeded | At-least-once from most providers (Stripe retries for 72 hours; GitHub retries for 3 days); exactly-once requires idempotency at the consumer | At-least-once on SQS standard and Kafka; SQS FIFO with deduplication ID provides effective exactly-once for message IDs within a 5-minute deduplication window |
| Ordering | Ordered within a single request chain; no ordering guarantees across concurrent requests | No ordering guarantees — webhook events for the same resource can arrive out of sequence; consumer must handle this | Kafka guarantees ordering within a partition; SQS FIFO guarantees ordering within a message group ID; standard SQS does not guarantee order |
| When to choose | Reading reference data on demand, user-initiated actions that need a synchronous response, or integrations where the consumer controls the pull cadence | Real-time reaction to third-party system events (payment confirmed, order shipped, user created) where the provider offers a webhook API and the consumer can be made highly available | High-volume event processing, fan-out to multiple consumers from one event source, decoupling producer and consumer release cycles, or any integration where guaranteed delivery and independent scaling are requirements |
Anti-patterns we prevent in integration work
Synchronous chaining of third-party calls in the request path
A user action triggers an API handler that calls three third-party services in sequence: a CRM update, a communication platform message, and a data warehouse event. Total response time is the sum of three remote call latencies plus the application’s own logic. A single third-party timeout (which Stripe, HubSpot, and Twilio all experience under load) pushes the user-facing response above 10 seconds or results in a 504. Worse, if the second call succeeds and the third fails, the CRM and communication platform are in an inconsistent state.
Move side-effect calls out of the request path and into a background job queue. The handler writes an event to a queue (SQS, BullMQ, Inngest) and returns immediately. A worker processes the event asynchronously, retries on failure, and writes the result to the database. The user receives a fast response; the integrations complete reliably in the background. Reserve synchronous third-party calls for operations where the response is required to complete the user’s action (payment authorisation, identity verification).
Rate limit responses treated as errors
The integration receives a 429 Too Many Requests from the upstream API and either surfaces this as an application error to the user or retries immediately in a tight loop. The immediate retry hits the rate limit again, burns through the retry budget faster, and — if multiple worker instances are doing the same thing simultaneously — creates a retry storm that keeps the consumer locked out long after the rate limit window would have reset.
Treat 429 responses as signals to back off, not as errors. Read the Retry-After header if the provider sends one and honour it exactly. If no header is present, apply exponential back-off with jitter starting from the expected rate limit window (typically 1 second for per-second limits, 60 seconds for per-minute limits). Instrument rate limit hits as a separate metric category distinct from errors so you can distinguish ‘we are calling too fast’ from ‘the API is broken’. If rate limits are hit regularly, implement client-side throttling with a token bucket before requests leave your service.
Integration health invisible until a customer reports a problem
The integration runs, webhooks are processed, and outbound calls complete — most of the time. When the downstream service degrades or a schema change silently breaks deserialisation, the first signal is a customer support ticket or a missing record noticed in a weekly report. By then, the event backlog may span hours or days and the replay is a manual operation.
Instrument every integration with four signals: inbound event receipt rate, processing success rate, outbound call error rate, and queue depth (for async consumers). Alert on sustained drops in receipt rate (provider may have stopped delivering), on processing error rate above a threshold, and on DLQ depth above zero. Log every event with enough context to replay it: the raw inbound payload, the processing outcome, and the idempotency key. A dead-letter replay command must be a documented operational runbook step, not an improvised database query.
Common questions
We need to connect two systems that do not have good APIs — is that possible?
Often, yes. Where a direct API does not exist we use scheduled ETL jobs, event-driven file exchange or — where necessary — browser automation. We document the approach and the caveats honestly so you understand exactly how fragile or robust each connection is.
How do you handle failures in integration pipelines?
Retry with exponential back-off, idempotency keys on write operations, dead-letter queues for messages that exhaust retries, and alerting when the queue depth rises. Every integration we build assumes the downstream service will fail eventually, because it will.
Related capabilities
Have something in mind?
Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.