How I Hit 99.03% AUROC on OOD Detection Using Conditional Diffusion Models
## What I Built and Why It Matters
Out-of-distribution (OOD) detection is the ability of a neural network to say "I don't know" when it sees something it was never trained on. It's a critical safety property for any production AI system — from autonomous vehicles to medical imaging to industrial quality control. Without it, models silently produce confident wrong answers.
My Master's thesis at Johannes Kepler University Linz (under Prof. Sepp Hochreiter, inventor of LSTM) achieved 99.03% AUROC on CIFAR-10 OOD detection — a gain of +18.8 percentage points over the baseline — by introducing a class-conditional separation loss into conditional diffusion models used as generative classifiers.
The Core Problem With Existing OOD Methods
Existing OOD detectors generally fall into two camps:
- **Discriminative methods** (MSP, ODIN, energy score): Train a classifier, then use the output confidence or energy as an OOD score. These are fast but fundamentally limited — a classifier trained to distinguish cats from dogs has no incentive to be uncertain about pictures of cars.
- **Reconstruction-based methods** (VAE, flow models): Train a generative model on in-distribution data; OOD samples should reconstruct poorly. In practice, these models often reconstruct OOD inputs perfectly because they generalize too well.
Diffusion models are the latest generation of generative models, achieving state-of-the-art image synthesis. But could they be repurposed as generative classifiers for OOD detection?
The Approach: Conditional Diffusion Models as Generative Classifiers
A conditional diffusion model learns the class-conditional distribution p(x | y) for each class y in the training set. Given a test sample x, you can compute the likelihood under each class and pick the highest — that's generative classification.
For OOD detection: if a test sample x has low likelihood under all class conditionals, it's out-of-distribution. The aggregated score across classes becomes the OOD detector.
Why This Works
Unlike discriminative classifiers, a generative classifier must model the actual data distribution. An OOD input like a photo of a truck (when trained on cats and dogs) will have genuinely low density under p(x | cat) and p(x | dog) — the model can't fake familiarity.
The Baseline Problem
The baseline conditional diffusion model (without my contribution) achieved 80.25% AUROC. This is decent but not production-grade. The issue: the diffusion model has no explicit pressure to keep class-conditional distributions separate. The features learned for each class can overlap significantly in embedding space, making the score boundaries ambiguous.
My Key Contribution: Class-Conditional Separation Loss
I introduced a class-conditional separation loss λ·L_sep that explicitly encourages the model's intermediate representations to cluster by class and push different classes apart, analogous to contrastive learning but applied to the generative process.
The total training objective becomes:
loss = L_diffusion + λ * L_separation# L_diffusion: standard DDPM noise prediction loss # L_separation: pushes class embeddings apart in feature space # λ: separation loss weight (ablated over [0.0, 0.001, 0.01, 0.02, 0.05, 0.1]) ```
Ablation Results on CIFAR-10
| λ (separation weight) | AUROC | FPR@95TPR | |----------------------|-------|-----------| | 0.0 (baseline) | 80.25% | — | | 0.001 | 97.32% | — | | 0.01 | 98.69% | — | | 0.02 (best) | 99.11% | — | | 0.05 | 98.51% | — | | 0.1 | 96.67% | — |
The sweet spot is λ=0.02, achieving 99.11% AUROC peak. The reported thesis result of 99.03% AUROC is the average across 5 OOD test datasets (SVHN, LSUN, iSUN, Textures, Places365) against CIFAR-10 as in-distribution.
The gain is not marginal — going from 80% to 99% AUROC is the difference between "interesting research" and "production deployable."
Applying It to Industrial Quality Control (PROFACTOR GmbH)
After validating on CIFAR-10, I applied the same framework to real industrial data from PROFACTOR GmbH's Zer0P project (funded by the Government of Upper Austria). The task: detect defects in inkjet-printed building components using multi-head conditioning.
Industrial datasets are fundamentally different: - Much smaller scale (~1,327 images vs 60,000 for CIFAR-10) - 8 distinct feature types with wildly different defect rates - 5-fold cross-validation required (no test set leakage)
The industrial baseline AUROC was 86.73% ± 2.3% (5-fold CV). Separation loss improved it to 87.3% ± 2.1% — a small, statistically non-significant improvement. This is itself an important research finding: the separation loss effect is domain-dependent. CIFAR-10 (large, balanced, well-separated classes) benefits enormously. A small industrial dataset with noisy labels and few BAD samples per class barely moves.
This cross-domain analysis became a key thesis contribution — not just "we improved X" but "we understand when and why the method works."
Implementation Stack
- **Framework:** PyTorch + PyTorch Lightning
- **Architecture:** DDPM U-Net with class conditioning (concatenation + cross-attention)
- **Experiment management:** Hydra config system + Weights & Biases
- **Training hardware:** JKU GPU cluster (RTX A6000)
- **Evaluation:** AUROC, FPR@95TPR, statistical significance testing (Holm correction)
Key engineering decisions: - Monte Carlo sampling (K=100 trials) for robust score estimation - Separation loss applied at the bottleneck feature layer, not the output - Multi-head conditioning for the industrial case: one head per feature type
What This Means for Production Systems
If you're building a production ML system today, OOD detection is non-negotiable:
- **Silent failure is the worst kind**: A confident wrong answer is far more dangerous than "I don't know"
- **Generative classifiers are interpretable**: You can visualize what the model thinks "normal" looks like per class
- **The separation loss is cheap**: It adds ~5% training overhead but 18.8pp AUROC improvement
The framework is open-sourced at [github.com/ahmed-3m/OOD-diffusion-detector](https://github.com/ahmed-3m/OOD-diffusion-detector). Full thesis PDF available [here](https://ahmed-3m.github.io/Mohammed_Ahmed_Thesis_Diffusion_OOD_Detection.pdf).
Supervisor
This work was completed under the supervision of Prof. Sepp Hochreiter (inventor of LSTM, head of the JKU Institute for Machine Learning) and Claus Hofmann (research assistant, JKU). I'm grateful for the research environment and compute access that made this possible.
Reach me at ahmed.mo.0595@gmail.com or [LinkedIn](https://www.linkedin.com/in/ahmed-3m/) if you're working on OOD detection or want to discuss the approach.
Frequently Asked Questions
What AUROC did Ahmed Mohammed achieve on CIFAR-10 OOD detection?
99.03% AUROC on CIFAR-10, a +18.8 percentage point improvement over the 80.25% baseline, using a class-conditional separation loss in conditional diffusion models.
What is class-conditional separation loss in diffusion models?
A training objective that explicitly pushes class-conditional representations apart in feature space during diffusion model training. It is added as a weighted term to the standard DDPM noise prediction loss. At λ=0.02, it improves OOD AUROC from 80.25% to 99.11% on CIFAR-10.
Who supervised Ahmed Mohammed's Master's thesis at JKU Linz?
Prof. Sepp Hochreiter (inventor of LSTM, head of the JKU Institute for Machine Learning) and Claus Hofmann (research assistant, JKU Linz).