AI & Machine Learning

Generative AI & LLM Apps

LLM-powered products grounded in your data — reliable, observable, yours.

An LLM can generate plausible-sounding nonsense about your customers' data just as easily as accurate answers. The engineering work in a production LLM application is mostly about preventing the former: retrieval that actually finds relevant context, prompt architecture that is explicit about what the model should and should not do, evaluation that measures factual accuracy on your own test cases, and observability that surfaces degradation before a user notices.

We build and ship LLM applications that stay useful after the demo: RAG systems over internal documents or databases, AI assistants embedded in your product, structured extraction pipelines that turn PDFs and emails into clean data, and multi-step agents that handle repetitive decision-making. Every system is observable, cost-bounded and backed by an eval suite tied to your specific quality bar.

Start a project Ask a quick question

what's included

What this covers

Retrieval-Augmented Generation (RAG) over documents, databases and knowledge bases

System prompt architecture, context management and token-budget optimisation

Structured output extraction from PDFs, emails, forms and unstructured text

Conversation memory strategies: windowed history, summary compression, long-term vector store

Hallucination mitigation: grounding, citation generation and confidence scoring

Evaluation harness: automated test cases against your own ground-truth dataset

Production observability: latency, token cost, retrieval recall and output quality monitoring

what you get

Deliverables

A live LLM feature or application, deployed as an API or embedded in your product
An evaluation report against your specific quality criteria — accuracy, latency, cost per call
Prompt configuration and retrieval pipeline in source control, documented and adjustable
Observability dashboard covering cost, latency and quality signals in production

tools & stack

What we build with

OpenAI APIAnthropic APILangChainLlamaIndexpgvectorPineconeWeaviateChromaPythonFastAPINext.jsMLflowLangSmith

what we mean

Production LLM Systems

A production LLM system is not a model — it is a pipeline. The model is a commodity component; the engineering challenge is the retrieval layer that grounds it in your data, the eval harness that measures whether it is working, the guardrail layer that keeps it safe, and the observability that surfaces cost and quality signals in real time. Each of these layers has to be built, tested, and versioned like any other service.

The two design decisions that determine most downstream properties are chunking strategy and vector store selection. Chunking determines what units of information the model can see at retrieval time — too coarse and the context is noisy, too fine and the semantic signal is lost. Vector store selection determines the retrieval performance envelope and the operational surface area you take on. Both deserve deliberate engineering, not default settings.

how we work

RAG System Build Lifecycle

Building a RAG pipeline involves five distinct engineering layers, each with its own failure modes. The order matters — eval infrastructure must precede any retrieval experimentation, or you are iterating blind.

Corpus preparation and chunking

Raw documents are cleaned, segmented into retrievable units, and enriched with metadata. Chunking strategy is the first load-bearing decision: the wrong choice cannot be fixed cheaply at retrieval time.

Audit the corpus: document types, length distribution, language, structure (prose vs. tables vs. code)
Select and implement a chunking strategy — fixed-size, recursive character splitting, or semantic chunking via sentence boundary detection
Attach metadata to each chunk: source document ID, page number, section heading, creation date, and any domain-specific fields that should participate in filtered retrieval
Validate chunk quality: measure average chunk length, token count distribution, and identify pathological cases (tables split mid-row, code blocks fragmented)
Implement a deterministic chunk-ID scheme so the index can be incrementally updated without a full re-embed

Embedding and indexing

Chunks are encoded as dense vectors and written to a vector store with the right index configuration for the expected query volume and corpus size.

Select an embedding model based on the retrieval task: asymmetric retrieval (bi-encoder) for question-answering, symmetric for semantic deduplication
Benchmark embedding models on a held-out query set using recall@k (typically k=5 and k=10) before committing to a model
Configure HNSW index parameters (efConstruction, M) to trade build time and memory against query latency; validate with latency percentile measurements
Implement hybrid search where keyword-heavy corpora demand it: combine BM25 sparse retrieval with dense vector search via reciprocal rank fusion (RRF)
Build incremental indexing pipeline: new documents are chunked, embedded, and upserted without re-indexing the full corpus

Retrieval tuning and reranking

Initial ANN retrieval has high recall but imprecise ranking. A reranking stage uses a cross-encoder to re-score the top-k candidates with full query-document attention, substantially improving precision.

Build a retrieval eval set: 50-200 queries with annotated relevant chunks; compute recall@5, MRR, and NDCG@10 as baseline metrics
Tune top-k for initial retrieval: higher k improves recall but increases reranker latency and context window usage; measure the recall-latency trade-off explicitly
Implement cross-encoder reranking (Cohere Rerank, sentence-transformers cross-encoder, or a fine-tuned model) and measure precision improvement on the eval set
Test metadata-filtered retrieval: filter by document type, date range, or department before vector search to reduce noise on scoped queries
Evaluate contextual retrieval: prepend a short document-level summary to each chunk before embedding to improve retrieval of chunks with low lexical overlap to the query

Generation, prompt architecture, and output validation

Retrieved chunks become grounded context for the generation model. Prompt architecture determines whether the model uses that context faithfully or ignores it in favour of parametric memory.

Design the system prompt to establish the model's role, its authority boundaries, and an explicit instruction to answer only from the provided context
Implement citation generation: require the model to attribute each factual claim to a specific source chunk ID; validate that cited chunks exist and contain the attributed text
Apply structured output where the response format is predictable: use function calling or response schemas to guarantee parseable JSON rather than relying on prompt instructions
Build a faithfulness evaluation step using an LLM judge or RAGAS to measure whether each answer is entailed by the retrieved context, not just plausible
Integrate output guardrails as a separate validation layer: PII detection, topic restriction, and toxicity scoring run on the generated output before it reaches the caller

Production observability and continuous evaluation

A RAG pipeline degrades as the corpus ages, the query distribution shifts, and the upstream embedding model is updated. Observability closes the loop so degradation is detected and corrected before users report it.

Log every retrieval event: query, top-k chunks with scores, reranker output, final context window, generated answer, and latency breakdown by stage
Instrument token usage and cost per call; set per-endpoint budget alerts so a runaway query pattern is caught before it materialises as a surprise invoice
Run the eval harness against the production query log sample on a weekly cadence; track recall@5, faithfulness, and answer relevance as time-series metrics
Monitor retrieval score distribution: a downward drift in mean cosine similarity between query and top-1 chunk often signals corpus staleness or embedding model drift
Implement a human feedback loop: a thumbs-up/down signal on generated answers creates a labelled dataset for reranker fine-tuning and a leading indicator of quality degradation

decision guides

How we'd choose

There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.

Chunking strategy: fixed-size vs recursive vs semantic

Chunking is the single most impactful retrieval-layer decision and the one most often left at its default. The right strategy depends on the corpus structure and the nature of the queries the system must answer.

Criterion	Fixed-size (character/token window)	Recursive character splitting	Semantic (sentence-boundary / topic-aware)
Best corpus fit	Homogeneous prose with consistent paragraph length; technical manuals, support articles	Mixed-structure documents where paragraph and section boundaries vary; the default for most corpora	Long narrative documents where topic shifts within a section; research papers, legal contracts
Retrieval precision	Low — chunk boundaries are arbitrary and frequently split mid-sentence or mid-argument	Medium — respects paragraph and sentence boundaries within the window; reduces semantic fragmentation	High — chunks align to semantic units; cross-encoder rerankers perform best on semantically coherent chunks
Context window efficiency	Predictable token count per chunk; easy to fit k chunks into a context budget	Mildly variable chunk size; requires token counting at retrieval time to manage context budget	High variance — some semantic segments are short (a list item), some are long (a multi-sentence argument); requires dynamic context assembly
Implementation complexity	Trivial — one parameter (chunk size) and one optional overlap window	Low — LangChain RecursiveCharacterTextSplitter with a hierarchy of separator tokens	Moderate to high — requires a sentence tokeniser, an optional embedding-based coherence metric, and tuning of the similarity threshold
Metadata preservation	Poor — page and section context must be injected as a prefix string into every chunk	Good — splitter can be configured to preserve paragraph-level context as metadata	Excellent — semantic segments naturally map to document sections; heading metadata attaches cleanly
Recommended starting point	No — only if corpus is extremely homogeneous and you need a zero-configuration baseline	Yes — use as the default; tune separator list and chunk size against recall@5 on your eval set	Use when recursive splitting gives recall@5 below 0.70 on long-form documents; adds latency to the indexing pipeline

what we avoid

Anti-Patterns in LLM Product Engineering

These are the three failure modes most commonly found in LLM systems that were built fast and evaluated never.

Retrieval by vibes

The chunking strategy and embedding model are chosen by looking at two or three example retrievals and deciding they look reasonable. There is no eval set, no recall@k measurement, and no latency benchmark. The team discovers that retrieval is failing specific query classes only after users report incorrect answers in production.

Build a retrieval eval set before writing the generation prompt. A set of 50 query-relevant-chunk pairs sampled from the actual query distribution is sufficient to make chunking and embedding model decisions on evidence rather than intuition. Measure recall@5 as the primary metric; a value below 0.65 on this set predicts user-visible retrieval failures.

Prompt-layer guardrails

Safety instructions — do not reveal PII, do not answer off-topic questions, do not impersonate a human — are written into the system prompt. This is necessary but not sufficient. A prompt injection in user input, a jailbreak that prefixes the generation with an override instruction, or a sufficiently long context window that pushes the system prompt out of the model's effective attention range can all bypass prompt-layer controls entirely.

Implement guardrails as a separate, independently deployable service layer that validates both input and output. Input guardrails run before the generation call and detect prompt injection patterns, PII in the query, and out-of-scope topic classification. Output guardrails run on the completion before it reaches the caller. Both layers are version-controlled, tested against a red-team dataset, and monitored by trigger rate in production.

No faithfulness baseline

The team measures answer quality by reading outputs and judging them qualitatively. There is no automated faithfulness metric — no measurement of whether the generated answer is actually entailed by the retrieved chunks. As a result, the system ships with an unknown hallucination rate, and the first signal of a quality problem comes from an escalation rather than a monitoring alert.

Integrate a faithfulness metric into the eval harness before the first internal release. RAGAS faithfulness, which uses an LLM judge to assess whether each sentence of the answer is supported by the retrieved context, provides a quantitative baseline. An NLI-based entailment classifier provides a cheaper, faster alternative suitable for high-frequency CI runs. Both require a sample of expected answers — the investment is small relative to the cost of shipping a system with unmeasured hallucination behaviour.

good to know

Common questions

How do you prevent the assistant from making things up?

Through retrieval grounding — the model only answers from retrieved context, not from training memory — combined with explicit instructions to say 'I do not know', citation generation tied to source chunks, and an automated eval suite that flags hallucinations on your test cases before each release.

How much will it cost to run in production?

We model inference cost during the design phase: which model, what context window, what call frequency, whether caching is viable. You get a realistic cost estimate before we build, and we instrument the running cost so you can see the actual spend from day one.

Related capabilities

Machine Learning Development Computer Vision All of AI & Machine Learning

Have something in mind?

Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.

Start a project Chat on WhatsApp