Back to blog

How I Reached 99.03% AUROC on OOD Detection with Conditional Diffusion Models

My master's thesis at JKU Linz introduced class-conditional separation loss into conditional diffusion models used as generative classifiers, reaching 99.03% +/- 0.07% AUROC on CIFAR-10 and adding a stable +6.5pp over the non-separated baseline.

14 min read
Diffusion ModelsOOD DetectionDeep LearningPyTorchCIFAR-10Generative Models

The Problem

Out-of-distribution detection is the part of a system that says, "this input does not belong to what I was trained on." In practice that means catching unusual inputs before a model makes a confident wrong decision. For any safety-relevant AI pipeline, that ability matters as much as raw accuracy.

In my master's thesis at JKU Linz, supervised by Prof. Sepp Hochreiter and Claus Hofmann, I studied whether a conditional diffusion model can act as a generative classifier for OOD detection instead of just generating images.

The Core Idea

The model reconstructs an image under two competing class conditions. If the image is truly in-distribution, the matching condition should reconstruct it better. If the image is unusual, both explanations should struggle and the reconstruction gap becomes the anomaly signal.

The baseline already worked, but it had a frustrating weakness: it was highly seed-sensitive. At $lambda = 0.0$, the average AUROC was 92.52% +/- 11.07%, which means some seeds looked excellent and some collapsed badly.

My Contribution: Separation Loss

I introduced a class-conditional separation loss that pushes the conditional noise predictions apart during training:

loss = L_diffusion + lambda * L_separation

The point is simple: if the two explanations become more distinct, the reconstruction-error difference becomes clearer. That makes the OOD score easier to trust.

Results

The best setting was lambda = 0.02. Averaged across three independent seeds, it reached:

  • **99.03% +/- 0.07% AUROC** on CIFAR-10
  • **+6.5 percentage points** over the non-separated baseline
  • much lower variance than the baseline

For a concrete reproducible run, seed-42 achieved:

  • **98.98% AUROC** on the within-CIFAR split
  • **90.50%-96.97%** zero-shot generalization across CIFAR-100, Places365, FashionMNIST, Textures, and SVHN

Why I Care About This Result

The important part is not only that the score went up. The important part is that the variance collapsed. Moving from a fragile 92.52% +/- 11.07% to a stable 99.03% +/- 0.07% is the difference between "interesting research result" and "plausible building block for a real safety system."

Cross-Domain Reality Check

I also transferred the same idea to industrial print-quality control on the public FTI_Zer0P benchmark. There, the crop-based YOLO + CDM baseline reached 0.8673 +/- 0.0230 AUROC under strict 5-fold cross-validation, while separation loss did not significantly improve performance after Holm correction.

That was a valuable result too. It showed that the mechanism transfers strongly in semantic image space like CIFAR-10, but not automatically to small, texture-heavy manufacturing data. Knowing where a method stops helping is part of doing honest research.

Stack and Artifacts

  • PyTorch + PyTorch Lightning
  • DDPM U-Net with class conditioning
  • Hydra + Weights & Biases
  • JKU GPU infrastructure

Public artifacts:

  • Thesis PDF: https://ahmed-3m.github.io/Mohammed_Ahmed_Thesis_Diffusion_OOD_Detection.pdf
  • Code: https://github.com/ahmed-3m/DiffusionOOD
  • Industrial transfer: https://github.com/ahmed-3m/InkjetOOD

Frequently Asked Questions

What result did the thesis achieve on CIFAR-10?

The best averaged result was 99.03% +/- 0.07% AUROC across three seeds. Seed-42 reached 98.98% within-CIFAR and generalized zero-shot to five external OOD benchmarks.

What is the separation loss?

It is an extra training term that pushes the two class-conditional noise predictions apart, making the reconstruction-error gap more discriminative and much more stable across seeds.

Why does this matter?

It turns a seed-sensitive generative OOD detector into a much more reliable one, which matters if the method is meant to become a real safety layer instead of a one-off experiment.