Generative AI & LLM Apps
LLM-powered products grounded in your data — reliable, observable, yours.
An LLM can generate plausible-sounding nonsense about your customers' data just as easily as accurate answers. The engineering work in a production LLM application is mostly about preventing the former: retrieval that actually finds relevant context, prompt architecture that is explicit about what the model should and should not do, evaluation that measures factual accuracy on your own test cases, and observability that surfaces degradation before a user notices.
We build and ship LLM applications that stay useful after the demo: RAG systems over internal documents or databases, AI assistants embedded in your product, structured extraction pipelines that turn PDFs and emails into clean data, and multi-step agents that handle repetitive decision-making. Every system is observable, cost-bounded and backed by an eval suite tied to your specific quality bar.
What this covers
Retrieval-Augmented Generation (RAG) over documents, databases and knowledge bases
System prompt architecture, context management and token-budget optimisation
Structured output extraction from PDFs, emails, forms and unstructured text
Conversation memory strategies: windowed history, summary compression, long-term vector store
Hallucination mitigation: grounding, citation generation and confidence scoring
Evaluation harness: automated test cases against your own ground-truth dataset
Production observability: latency, token cost, retrieval recall and output quality monitoring
Deliverables
- A live LLM feature or application, deployed as an API or embedded in your product
- An evaluation report against your specific quality criteria — accuracy, latency, cost per call
- Prompt configuration and retrieval pipeline in source control, documented and adjustable
- Observability dashboard covering cost, latency and quality signals in production
What we build with
Production LLM Systems
A production LLM system is not a model — it is a pipeline. The model is a commodity component; the engineering challenge is the retrieval layer that grounds it in your data, the eval harness that measures whether it is working, the guardrail layer that keeps it safe, and the observability that surfaces cost and quality signals in real time. Each of these layers has to be built, tested, and versioned like any other service.
The two design decisions that determine most downstream properties are chunking strategy and vector store selection. Chunking determines what units of information the model can see at retrieval time — too coarse and the context is noisy, too fine and the semantic signal is lost. Vector store selection determines the retrieval performance envelope and the operational surface area you take on. Both deserve deliberate engineering, not default settings.
RAG System Build Lifecycle
Building a RAG pipeline involves five distinct engineering layers, each with its own failure modes. The order matters — eval infrastructure must precede any retrieval experimentation, or you are iterating blind.
- Audit the corpus: document types, length distribution, language, structure (prose vs. tables vs. code)
- Select and implement a chunking strategy — fixed-size, recursive character splitting, or semantic chunking via sentence boundary detection
- Attach metadata to each chunk: source document ID, page number, section heading, creation date, and any domain-specific fields that should participate in filtered retrieval
- Validate chunk quality: measure average chunk length, token count distribution, and identify pathological cases (tables split mid-row, code blocks fragmented)
- Implement a deterministic chunk-ID scheme so the index can be incrementally updated without a full re-embed
- Select an embedding model based on the retrieval task: asymmetric retrieval (bi-encoder) for question-answering, symmetric for semantic deduplication
- Benchmark embedding models on a held-out query set using recall@k (typically k=5 and k=10) before committing to a model
- Configure HNSW index parameters (efConstruction, M) to trade build time and memory against query latency; validate with latency percentile measurements
- Implement hybrid search where keyword-heavy corpora demand it: combine BM25 sparse retrieval with dense vector search via reciprocal rank fusion (RRF)
- Build incremental indexing pipeline: new documents are chunked, embedded, and upserted without re-indexing the full corpus
- Build a retrieval eval set: 50-200 queries with annotated relevant chunks; compute recall@5, MRR, and NDCG@10 as baseline metrics
- Tune top-k for initial retrieval: higher k improves recall but increases reranker latency and context window usage; measure the recall-latency trade-off explicitly
- Implement cross-encoder reranking (Cohere Rerank, sentence-transformers cross-encoder, or a fine-tuned model) and measure precision improvement on the eval set
- Test metadata-filtered retrieval: filter by document type, date range, or department before vector search to reduce noise on scoped queries
- Evaluate contextual retrieval: prepend a short document-level summary to each chunk before embedding to improve retrieval of chunks with low lexical overlap to the query
- Design the system prompt to establish the model's role, its authority boundaries, and an explicit instruction to answer only from the provided context
- Implement citation generation: require the model to attribute each factual claim to a specific source chunk ID; validate that cited chunks exist and contain the attributed text
- Apply structured output where the response format is predictable: use function calling or response schemas to guarantee parseable JSON rather than relying on prompt instructions
- Build a faithfulness evaluation step using an LLM judge or RAGAS to measure whether each answer is entailed by the retrieved context, not just plausible
- Integrate output guardrails as a separate validation layer: PII detection, topic restriction, and toxicity scoring run on the generated output before it reaches the caller
- Log every retrieval event: query, top-k chunks with scores, reranker output, final context window, generated answer, and latency breakdown by stage
- Instrument token usage and cost per call; set per-endpoint budget alerts so a runaway query pattern is caught before it materialises as a surprise invoice
- Run the eval harness against the production query log sample on a weekly cadence; track recall@5, faithfulness, and answer relevance as time-series metrics
- Monitor retrieval score distribution: a downward drift in mean cosine similarity between query and top-1 chunk often signals corpus staleness or embedding model drift
- Implement a human feedback loop: a thumbs-up/down signal on generated answers creates a labelled dataset for reranker fine-tuning and a leading indicator of quality degradation
Corpus preparation and chunking
Raw documents are cleaned, segmented into retrievable units, and enriched with metadata. Chunking strategy is the first load-bearing decision: the wrong choice cannot be fixed cheaply at retrieval time.
Embedding and indexing
Chunks are encoded as dense vectors and written to a vector store with the right index configuration for the expected query volume and corpus size.
Retrieval tuning and reranking
Initial ANN retrieval has high recall but imprecise ranking. A reranking stage uses a cross-encoder to re-score the top-k candidates with full query-document attention, substantially improving precision.
Generation, prompt architecture, and output validation
Retrieved chunks become grounded context for the generation model. Prompt architecture determines whether the model uses that context faithfully or ignores it in favour of parametric memory.
Production observability and continuous evaluation
A RAG pipeline degrades as the corpus ages, the query distribution shifts, and the upstream embedding model is updated. Observability closes the loop so degradation is detected and corrected before users report it.
How we'd choose
There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.
Chunking strategy: fixed-size vs recursive vs semantic
Chunking is the single most impactful retrieval-layer decision and the one most often left at its default. The right strategy depends on the corpus structure and the nature of the queries the system must answer.
| Criterion | Fixed-size (character/token window) | Recursive character splitting | Semantic (sentence-boundary / topic-aware) |
|---|---|---|---|
| Best corpus fit | Homogeneous prose with consistent paragraph length; technical manuals, support articles | Mixed-structure documents where paragraph and section boundaries vary; the default for most corpora | Long narrative documents where topic shifts within a section; research papers, legal contracts |
| Retrieval precision | Low — chunk boundaries are arbitrary and frequently split mid-sentence or mid-argument | Medium — respects paragraph and sentence boundaries within the window; reduces semantic fragmentation | High — chunks align to semantic units; cross-encoder rerankers perform best on semantically coherent chunks |
| Context window efficiency | Predictable token count per chunk; easy to fit k chunks into a context budget | Mildly variable chunk size; requires token counting at retrieval time to manage context budget | High variance — some semantic segments are short (a list item), some are long (a multi-sentence argument); requires dynamic context assembly |
| Implementation complexity | Trivial — one parameter (chunk size) and one optional overlap window | Low — LangChain RecursiveCharacterTextSplitter with a hierarchy of separator tokens | Moderate to high — requires a sentence tokeniser, an optional embedding-based coherence metric, and tuning of the similarity threshold |
| Metadata preservation | Poor — page and section context must be injected as a prefix string into every chunk | Good — splitter can be configured to preserve paragraph-level context as metadata | Excellent — semantic segments naturally map to document sections; heading metadata attaches cleanly |
| Recommended starting point | No — only if corpus is extremely homogeneous and you need a zero-configuration baseline | Yes — use as the default; tune separator list and chunk size against recall@5 on your eval set | Use when recursive splitting gives recall@5 below 0.70 on long-form documents; adds latency to the indexing pipeline |
Anti-Patterns in LLM Product Engineering
These are the three failure modes most commonly found in LLM systems that were built fast and evaluated never.
Retrieval by vibes
The chunking strategy and embedding model are chosen by looking at two or three example retrievals and deciding they look reasonable. There is no eval set, no recall@k measurement, and no latency benchmark. The team discovers that retrieval is failing specific query classes only after users report incorrect answers in production.
Build a retrieval eval set before writing the generation prompt. A set of 50 query-relevant-chunk pairs sampled from the actual query distribution is sufficient to make chunking and embedding model decisions on evidence rather than intuition. Measure recall@5 as the primary metric; a value below 0.65 on this set predicts user-visible retrieval failures.
Prompt-layer guardrails
Safety instructions — do not reveal PII, do not answer off-topic questions, do not impersonate a human — are written into the system prompt. This is necessary but not sufficient. A prompt injection in user input, a jailbreak that prefixes the generation with an override instruction, or a sufficiently long context window that pushes the system prompt out of the model's effective attention range can all bypass prompt-layer controls entirely.
Implement guardrails as a separate, independently deployable service layer that validates both input and output. Input guardrails run before the generation call and detect prompt injection patterns, PII in the query, and out-of-scope topic classification. Output guardrails run on the completion before it reaches the caller. Both layers are version-controlled, tested against a red-team dataset, and monitored by trigger rate in production.
No faithfulness baseline
The team measures answer quality by reading outputs and judging them qualitatively. There is no automated faithfulness metric — no measurement of whether the generated answer is actually entailed by the retrieved chunks. As a result, the system ships with an unknown hallucination rate, and the first signal of a quality problem comes from an escalation rather than a monitoring alert.
Integrate a faithfulness metric into the eval harness before the first internal release. RAGAS faithfulness, which uses an LLM judge to assess whether each sentence of the answer is supported by the retrieved context, provides a quantitative baseline. An NLI-based entailment classifier provides a cheaper, faster alternative suitable for high-frequency CI runs. Both require a sample of expected answers — the investment is small relative to the cost of shipping a system with unmeasured hallucination behaviour.
Common questions
How do you prevent the assistant from making things up?
Through retrieval grounding — the model only answers from retrieved context, not from training memory — combined with explicit instructions to say 'I do not know', citation generation tied to source chunks, and an automated eval suite that flags hallucinations on your test cases before each release.
How much will it cost to run in production?
We model inference cost during the design phase: which model, what context window, what call frequency, whether caching is viable. You get a realistic cost estimate before we build, and we instrument the running cost so you can see the actual spend from day one.
Related capabilities
Have something in mind?
Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.