Skip to content
AI & Machine Learning

Computer Vision

Production vision systems that see, classify and extract — at the speed your workflow demands.

Computer vision works when the model has seen enough examples of your specific visual problem — not when it has been fine-tuned on a public benchmark that happens to share a few classes. We start with your images: labelling where needed, building a domain-appropriate training set, and choosing the architecture that balances accuracy and inference speed for the environment the model runs in.

Deployment target matters as much as model quality. We ship vision models as cloud APIs for batch workloads, as edge-optimised binaries for on-device inference, and as embedded components in web or mobile apps where the camera feed is the input. The pipeline includes preprocessing, confidence thresholding, failure logging and a feedback loop for ongoing improvement.

what's included

What this covers

Dataset preparation, labelling strategy and augmentation for your specific visual domain

Image classification: single-label, multi-label and hierarchical category structures

Object detection and instance segmentation (YOLO, Detectron2, custom fine-tuning)

OCR and document extraction: printed text, handwriting, tables and structured forms

Video analytics: motion detection, object tracking and frame-level event classification

Edge deployment optimisation: ONNX, TensorRT, CoreML and model quantisation

Confidence scoring, reject thresholds and a human-in-the-loop queue for low-confidence cases

what you get

Deliverables

  • A trained vision model deployed as a callable API or embedded in your application
  • Evaluation report: precision, recall, confusion matrix and examples of known failure cases
  • Training pipeline and labelled dataset in a format your team can extend
  • Inference service with confidence thresholding and failure logging in place at launch
tools & stack

What we build with

PythonPyTorchTensorFlowOpenCVUltralytics YOLODetectron2Hugging Face TransformersTesseractAWS RekognitionONNXTensorRTCoreMLRoboflowLabel Studio
what we mean

Applied Computer Vision Systems

Applied computer vision is the discipline of extracting structured information from images and video in a production system. The core tasks — classification, detection, segmentation, OCR, and tracking — each have well-understood architectures and benchmark datasets, but production performance is determined almost entirely by how well the training data represents the actual deployment distribution, not by architecture choice.

The deployment environment is the design constraint that drives every other decision. A model that runs in a cloud API can be large, slow, and memory-hungry; a model deployed on an edge device must fit within a memory budget, run on a CPU or fixed-function accelerator, and tolerate quantisation. These constraints determine the architecture family, the acceptable accuracy-size trade-off, and whether ONNX, TensorRT, or CoreML is the right export format.

how we work

Vision System Development Lifecycle

Vision projects are data-intensive before they are compute-intensive. The quality of the labelled dataset determines the performance ceiling; no architecture choice compensates for a noisy or unrepresentative annotation set.

    01

    Dataset scoping and labelling strategy

    The labelling strategy determines both the cost of the annotation effort and the quality of the resulting model. Choices made here are expensive to reverse once annotation has begun at scale.

    • Audit the raw image corpus: resolution distribution, lighting and background variation, class frequency, and coverage of edge cases that matter in production
    • Define the annotation schema: label taxonomy, bounding box vs. polygon vs. mask vs. keypoint, and the level of label hierarchy required (e.g., vehicle vs. car vs. sedan vs. damaged sedan)
    • Establish inter-annotator agreement protocol: a minimum Cohen's kappa or IoU agreement threshold, a review and adjudication workflow, and a clear definition of ambiguous cases
    • Design the sampling strategy for annotation: prioritise under-represented classes and production-relevant edge cases over random sampling from the raw corpus
    • Set up Label Studio or Roboflow with the annotation schema, assign annotators, and run a pilot batch of 100 images to validate label consistency before scaling
    02

    Data augmentation and training split design

    Augmentation is the primary tool for improving generalisation on vision tasks. The augmentation policy must be designed for the specific deployment environment, not selected from a generic default.

    • Define an augmentation policy that reflects the natural variance in the deployment environment: rotation range, brightness and contrast jitter, blur, scale variation, and domain-specific transforms (e.g., simulated lens distortion for camera calibration tasks)
    • Validate that augmentations do not invalidate the label: a horizontal flip augmentation is safe for object detection but can produce invalid samples for text recognition or chirality-sensitive tasks
    • Create a stratified train-validation-test split that preserves class frequency in each fold; for object detection tasks, split at the image level rather than the bounding box level to prevent the same scene appearing in both train and test
    • Implement a hard-negative mining strategy for detection tasks: explicitly include images with high-confidence false positives from an initial training run as training negatives in subsequent runs
    • Document the augmentation policy in the dataset card alongside the annotation schema and split statistics
    03

    Model selection and training

    Architecture and backbone selection determine the accuracy-latency operating point. The right starting point is the smallest pretrained model that achieves acceptable accuracy on your eval set, not the model with the highest benchmark score.

    • Select a pretrained backbone appropriate for the task and deployment constraint: EfficientNet or MobileNet for edge-constrained classification; YOLOv8 or RT-DETR for real-time detection; Mask R-CNN or SAM for segmentation tasks that require instance-level masks
    • Fine-tune from the pretrained checkpoint rather than training from scratch; freeze the early backbone layers initially and unfreeze progressively if the domain is visually different from ImageNet
    • Monitor training with validation mAP@0.5:0.95 for detection tasks, per-class precision and recall, and a confusion matrix; detect overfitting by tracking the train-validation mAP gap across epochs
    • Run a second training pass with hard negative mining if the first-pass false positive rate on the validation set is above the production SLO
    • For detection tasks, calibrate the confidence threshold on the validation set to achieve the required precision-recall operating point rather than using the model default
    04

    Edge optimisation and deployment packaging

    A model that achieves target mAP in PyTorch may not meet the latency SLO on the target hardware. Optimisation is an engineering step, not an afterthought, and must be validated against the accuracy benchmark after each transformation.

    • Export the trained model to ONNX as the portable intermediate representation; validate that ONNX output matches PyTorch output within floating-point tolerance on a reference input batch
    • Apply TensorRT optimisation for NVIDIA GPU targets: FP16 precision typically recovers 1.5-2x throughput with less than 1 mAP degradation; INT8 calibration with a representative dataset can achieve 3-4x at the cost of 1-3 mAP
    • Apply post-training quantisation for CPU and mobile targets: CoreML quantisation for iOS deployment; ONNX Runtime with INT8 quantisation for edge Linux targets; measure latency on the target device class, not a development machine
    • Package the model with a preprocessing pipeline that normalises input images identically to the training pipeline; a mismatch in mean/std normalisation or channel order produces a systematic accuracy degradation that is easy to introduce and hard to debug
    • Define the confidence threshold, NMS parameters, and reject-low-confidence routing logic as configuration, not hard-coded values, so they can be tuned in production without redeployment
decision guides

How we'd choose

There's rarely one right answer — these are the trade-offs we weigh before recommending an approach.

Cloud vision API vs pretrained fine-tune vs train from scratch

The starting point for a vision system determines the data requirement, the accuracy ceiling, and the operational surface area. Most projects belong in the fine-tune regime; the other two have specific conditions that justify them.

CriterionCloud vision API (AWS Rekognition, Google Vision AI)Pretrained backbone + fine-tune (YOLO, EfficientNet, ViT)Train from scratch (custom architecture, random init)
Domain specificityGeneral-purpose tasks only: face detection, generic object labels, document OCR, explicit content detection — API classes are fixed and cannot be customisedBroad — any visual domain where a pretrained backbone provides useful low-level feature representations; handles specialised industrial, medical, and document tasks with domain-specific fine-tuningJustified only when the visual domain is so different from natural images that pretrained weights provide no benefit — rare in practice; satellite imagery, microscopy, and specialised sensor modalities are candidate cases
Labelled data requirementNone — the API handles everything; no training data requiredHundreds to a few thousand labelled examples for classification; thousands for object detection (more images required for rarer classes and fine-grained distinctions)Tens of thousands to millions of labelled examples; without this volume the model memorises the training set rather than learning generalisable features
Accuracy on custom classesLimited by API taxonomy — cannot detect a class the API was not trained on; zero-shot classification via CLIP can extend this but with lower accuracy than a fine-tuned modelHigh on domain-specific classes when fine-tuned on sufficient representative data; mAP@0.5 above 0.85 is achievable for well-defined detection tasks with clean labelsPotentially the highest ceiling if the dataset is large enough, but rarely justified given the training cost and the strong baseline that pretrained fine-tuning provides
Edge / offline deploymentImpossible — the API requires network access; not viable for offline, latency-sensitive, or data-residency-constrained deploymentsFully supported — export to ONNX, TensorRT, CoreML, or TFLite for edge deployment; model size and latency are controllable via backbone choice and quantisationFully supported, but the custom architecture must be explicitly designed for edge constraints from the start; retrofitting an arbitrary architecture to a mobile target is expensive
Time to first working systemHours — an API key and a few lines of SDK code; the right choice for proof-of-concept or tasks that genuinely match the API taxonomyDays to weeks depending on labelling effort; a pilot model trained on 200 images per class can be running within a week of receiving the raw dataWeeks to months — dataset collection, architecture design, training infrastructure, and debugging from random initialisation all add time before the first useful baseline exists
Recommended defaultYes for generic tasks — use as a fast proof-of-concept or where the task exactly matches an available API class; not a production path for custom visual domainsYes for most custom vision projects — the correct starting point before either of the other two options is consideredNo — only after demonstrating that pretrained fine-tuning cannot reach the required accuracy, which is rare
what we avoid

Anti-Patterns in Computer Vision Projects

Training on a public dataset, deploying on a different domain

A model is trained on a benchmark dataset (COCO, ImageNet, Open Images) and deployed on production images without any domain-specific fine-tuning. The model achieves strong benchmark metrics but poor production accuracy because the production images differ systematically in lighting, resolution, camera angle, object scale, or background complexity. The gap is often invisible in evaluation because the benchmark eval set is used rather than a sample from the production distribution.

Collect a representative sample of production images before model development begins — a minimum of 200 images per class, covering the lighting and background conditions the model will actually encounter. Annotate this sample and use it as the primary validation set throughout development. Fine-tune the pretrained model on production-domain images rather than expecting benchmark performance to transfer. Measure mAP separately on the benchmark validation set and the production domain sample to quantify the domain gap.

Hard-coded confidence threshold

The inference service returns all detections above a fixed threshold of 0.5 (the YOLO default) without any mechanism to adjust it in production. In the deployment environment the false positive rate is above the acceptable SLO, but changing the threshold requires a code deployment rather than a configuration change. Additionally, the threshold was never evaluated against the precision-recall trade-off on the production domain — 0.5 was inherited from a tutorial and never questioned.

Expose the confidence threshold as a configuration parameter that can be changed without redeployment. Evaluate the full precision-recall curve on the production validation set and select the operating threshold based on the business cost of false positives versus false negatives. For high-stakes decisions, implement a three-zone routing strategy: above high-threshold goes to the automated path, below low-threshold goes to a human queue, between the two thresholds goes to a sampling review queue.

Ignoring inference latency until deployment

Model development and evaluation happen entirely on a GPU workstation. The model achieves the required mAP but inference takes 450ms per image. At deployment time the target environment is a CPU-only edge device with 2GB RAM, and the latency SLO is 100ms. Architectural changes at this stage require retraining, which extends the project by weeks and negates the work already done on the heavier model.

Define the latency SLO and the target inference hardware at the start of the project, before architecture selection. If the target is a CPU-only edge device, the architecture family is constrained to MobileNet, EfficientNet-Lite, or a YOLO variant with an appropriate backbone size. Prototype the full inference pipeline — including preprocessing, model forward pass, and postprocessing — on the target hardware class early in the project, even with an undertrained model, to validate that the latency budget is achievable before investing in full training.

good to know

Common questions

We do not have labelled images — can you still build a vision model for us?

Yes. We scope the labelling effort as part of the engagement: we can run labelling in-house for small datasets or set up Label Studio with a review workflow so your domain experts label and validate. For some problems, transfer learning from a similar public dataset significantly reduces the labelling burden.

The model needs to run on a device, not in the cloud — is that feasible?

Absolutely. We design for the target environment from the start: quantised models for CPU-only devices, CoreML for iOS, TensorRT for NVIDIA edge hardware. Inference latency and model size are part of the acceptance criteria, not an afterthought.

Have something in mind?

Tell us what you're building or stuck on. The first consultation is free — no obligation, no hard sell.