Why '98% Accuracy' Is Meaningless Without Context

Accuracy is the most quoted number in AI and one of the least useful on its own. A vendor or a founder says "98% accuracy" and a room full of smart people nods. The number sounds like a conclusion. It is not. It is a starting point for four questions, and the answers usually change the picture completely.

1. Accurate on what data?

The first question is what dataset produced the number. A model evaluated on its own curated test set will look far better than the same model on messy production data. We assessed a computer vision startup claiming 95 percent accuracy for manufacturing quality control. That figure held on their internal test set. It did not describe what would happen on the factory floor.

2. Which kind of error?

Accuracy collapses two very different mistakes into one number. A quality control model that catches 98 percent of good products but misses defective ones is dangerous, even at high overall accuracy. When we asked that same startup about false negatives specifically, the defective products the model lets through, the rate was closer to 18 percent. For their use case, the false negative rate was the only number that mattered, and it was nowhere on the slide.

3. Accurate against what baseline?

98 percent sounds excellent until you learn that 97 percent of the cases are the easy default. If the data is imbalanced, a model can hit high accuracy by always guessing the majority class and learning nothing useful. Accuracy without the base rate is theatre.

4. Accurate under what conditions?

Models are accurate within the conditions they were trained for and unpredictable outside them. A transcription model trained on clean English will report high accuracy and then fall apart on Arabic-English code-switching or background noise. The headline number describes the lab. Production is not the lab.

The better questions

Replace "what is your accuracy" with: what dataset was this measured on, what is your false negative and false positive rate separately, what is the base rate in that data, and how does performance hold up on inputs that look like ours. A serious team has these answers ready. A team that only has the headline number is telling you their evaluation stopped at the number that looked best.

The discipline here is not cynicism. It is refusing to let a single figure stand in for the half-dozen questions it is quietly hiding. In due diligence, the claim is never the finding. The finding is what survives the follow-up questions.

If you are weighing an AI investment, acquisition, vendor selection, or training programme, our team is happy to start with a conversation about scope and approach.

Schedule a Scoping Call

The views and findings in this article are shared for general information only. They are high-level perspectives, not legal, financial, regulatory, or other professional advice, and should not be relied upon for any specific decision or circumstance. For guidance tailored to your situation, please consult a qualified adviser.