Back to Blog

YOLOv8 for Industrial Quality Control: Architecture Decisions That Moved the Needle

10 min read
Computer VisionYOLOv8Deep LearningIndustrial AIPyTorchProduction ML

## Not a Tutorial — A Post-Mortem

Most YOLO tutorials show you how to train on COCO and get to 90% mAP in 20 lines of code. That's not industrial quality control.

In a real factory, you have: class imbalance (3:1 GOOD:BAD), image acquisition noise from production lighting, ~1,300 images total (not 118,000), and a 50ms inference budget. Here's what actually moved the needle at PROFACTOR GmbH.

What We Used YOLO For (Not Bounding Boxes)

The first unconventional decision: we did not use YOLOv8 as a detector. We used it as a feature extractor.

The 8 feature types in inkjet-printed building components (edge quality, dot density, distance measurements, angular precision) don't naturally decompose into bounding boxes. The whole component image is the input; the output is a single GOOD/BAD classification per feature type.

YOLOv8's backbone was pre-trained on general visual patterns and then fine-tuned on our component geometry. We extracted the bottleneck features (after layer 9 of the backbone, before the detection head) and fed them into a downstream diffusion model. Details in the [companion post on diffusion models](/blog/diffusion-models-anomaly-detection).

Data Engineering: The Real Work

With 1,327 components across 8 feature types:

Class distribution problem: 70-80% GOOD, 20-30% BAD depending on feature type. The angle feature had only 2–4 BAD samples per fold — we marked it as unreliable and excluded it from performance comparisons.

What worked for augmentation: - Random brightness/contrast shifts (±20%) — simulates production lighting variation - Horizontal/vertical flips — components are orientation-invariant - CutMix between GOOD samples — forces feature robustness

What didn't work: - Synthetic defect generation (GANs) — the synthetic defects didn't match real failure modes - Heavy geometric augmentation — component geometry is measurement-critical; rotating a distance feature changes its meaning

Cross-validation setup: 5-fold stratified CV (seed=42), stratified on GOOD/BAD ratio per feature type. No data leakage between folds. This is non-negotiable when N=1,327.

Training Details That Matter

# Key hyperparameters that moved the needle
config = {
    'lr': 1e-4,                    # lower than default — small dataset
    'batch_size': 16,              # constrained by GPU memory on edge device
    'epochs': 150,                 # with early stopping patience=15
    'warmup_epochs': 5,
    'lr_scheduler': 'cosine',
    'weight_decay': 1e-4,
    'freeze_backbone_epochs': 10,  # freeze backbone, train head first
    'unfreeze_lr_factor': 0.1,    # 10x lower LR for backbone after unfreeze
}

The freeze-then-unfreeze strategy was critical. With 1,300 images, fine-tuning the full backbone from the start causes catastrophic forgetting of the low-level features. Freeze for 10 epochs, then unfreeze at 10× lower LR.

Evaluation: What AUROC Tells You vs What It Doesn't

We reported AUROC as the primary metric because it's threshold-independent. The 98.4% accuracy figure is a threshold-dependent metric calibrated at FPR=5% — meaning we accept up to 5% false positives (GOOD called BAD) in exchange for the highest possible defect recall.

Per-feature AUROC was more informative: - dots: 0.956 — reliable, many samples, clear defect signature - dist6: 0.936 — reliable - edge3: 0.744 — hard problem, high variance across folds - angle: 0.817 ± 0.138 (std!) — unreliable, too few BAD samples

Reporting a single overall accuracy hides this heterogeneity. In a real deployment, you'd set different thresholds per feature type.

Production: From 35ms Research to 35ms Production

The model that ran in research (FP32, batch inference on A100) and the model that ran in production (INT8, streaming on Jetson AGX) had the same architecture but different weight precision.

INT8 quantization with TensorRT calibration: - Calibration set: 100 GOOD samples, balanced across feature types - Accuracy loss: <0.3% AUROC — acceptable - Latency improvement: 35ms → 18ms (2× speedup) - Memory reduction: 4× smaller model footprint

We ultimately ran at FP16 in production (not INT8) because INT8 showed occasional instability on specific edge/angle features. FP16 at 35ms was within the 50ms budget with margin.

What I'd Do Differently

  1. **Collect more BAD samples aggressively from day one** — the angle feature's unreliability was entirely a data problem, not a model problem
  2. **Use Monte Carlo Dropout as a confidence signal** in addition to the diffusion score — gives a second, orthogonal uncertainty estimate
  3. **Log everything from the start** — Weights & Biases was added mid-project; early experiment results were lost in local files

The full technical report is published: [Diffusion-Based Multi-class Defect Detection PDF](https://ahmed-3m.github.io/Diffusion-Based%20Multi-class%20Defect%20Detection.pdf).

Frequently Asked Questions

Why use YOLO as a feature extractor instead of a detector?

The 8 feature types in inkjet-printed building components (edge quality, dot density, distance measurements, angular precision) do not naturally decompose into bounding boxes. The whole component image is the input; the output is a single GOOD/BAD classification per feature type.

What inference latency did the production system achieve?

~35ms per component using FP16 precision on an NVIDIA Jetson AGX edge device. INT8 quantization achieved 18ms but showed occasional instability on specific edge/angle features, so FP16 was used in production.

How was the model evaluated with only 1,327 images?

5-fold stratified cross-validation (seed=42), stratified on GOOD/BAD ratio per feature type. No data leakage between folds. Per-feature AUROC was reported to capture heterogeneity across feature types.