Epoch AI

AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

May 3, 2026

7 min read

In March we broke down how LLM benchmarks actually work – GPQA Diamond, SWE-bench, Chatbot Arena. In April we tested 53 models and found that the quality gap between the top models is tenths of a point – while the price gap spans three orders of magnitude.

Now for the next question. What if the benchmarks themselves are starting to break?

AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

Epoch AI

AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

Essential

Analytics

Functional

Marketing

AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

⚙️ Cookie settings

Essential

Analytics

Functional

Marketing

Notice

Cookie Policy