AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

7 min read
Stanislav Belyaev
Stanislav Belyaev Engineering Leader at Microsoft
AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?

In March we broke down how LLM benchmarks actually work – GPQA Diamond, SWE-bench, Chatbot Arena. In April we tested 53 models and found that the quality gap between the top models is tenths of a point – while the price gap spans three orders of magnitude.

Now for the next question. What if the benchmarks themselves are starting to break?

On May 1, 2026, Epoch AI – the organization behind several of the industry’s key benchmarks – published a discussion with a telling title: “Are AI Benchmarks Doomed?”. Three researchers – Anson Ho, Greg Burnham, and Tom Adamczewski – walked through why tests now saturate faster than anyone can build them, and what to do about it.

Let’s look at their argument through the lens of what actually matters to a manager.

The problem: benchmarks saturate faster than they can be built

A new benchmark used to last for years. MMLU, created in 2020, stayed relevant until 2024. Today the picture looks very different.

GPQA Diamond – a test where even experts with internet access get it wrong 60% of the time – lasted two years. By 2026 standards, that’s exceptionally long. OpenAI’s GDPVal, which cost millions of dollars to build, is already nearly saturated. When reasoning models (o1) arrived in late 2025, they closed out math benchmarks that were supposed to last for years – in a single leap.

The pattern is clear: the faster models improve, the shorter their tests live.

The gap between scores and business value

This is Epoch AI’s central observation, and it lines up with our own data – as well as with their own research on how shallow AI adoption really is: 62% of users only apply these models to one or two tasks. Saturating GPQA Diamond didn’t translate into a proportional economic payoff. Models scoring 90%+ on expert tests didn’t become twice as useful for everyday management work.

The reason is that benchmarks measure the “self-contained” part of the job. Answering a hard physics question is one thing. Fitting that answer into the context of a project, accounting for political constraints, aligning three stakeholders, and packaging it in a format your finance committee will accept – that’s something else entirely.

In our test of 53 models we saw exactly the same thing: the difference between a $0.17-per-request model and a $0.002 one was 0.24 points on a five-point scale. Benchmarks show a gap. Real-world tasks don’t.

The gap between score and real value is obvious from examples. The harder part is developing a practical feel for what a model can actually do for you. You can’t read that off a table – it comes from running real tasks.

Try 9 management tasks for free. Your results will tell you more about a model than any benchmark.

No payment required • Get notified on launch

Join Waitlist

Three categories of evaluation – and why it matters

Tom Adamczewski offered a useful way to classify how you can evaluate models in the first place:

CategoryHow it worksExampleThe catch
Machine gradingAn algorithm compares the answer against a referenceMMLU, FrontierMathSaturates easily – models learn to “game the test”
LLM-as-judgeAnother model scores the answer against a rubricOur test of 54 modelsDepends on the quality of the judge
Human evaluationPeople rate the outputChatbot Arena, Remote Labor IndexExpensive and slow

The takeaway for a manager: the closer the evaluation gets to real work, the more expensive and slower it becomes – but also the more useful. Automated tests give you a number. Human evaluation gives you understanding.

That’s exactly why Chatbot Arena – a ranking where real people compare answers blind – remains the most credible. It’s the closest thing to how you actually pick a tool: “which answer helped me more?”

What’s replacing the classic benchmarks

Epoch AI highlights several directions worth watching.

Scalable families of tasks

Instead of a fixed set of questions – tasks with adjustable difficulty. One example is MirrorCode, a joint project by Epoch AI and METR. A model has to reproduce a program while seeing only its behavior. The difficulty scales from 100 lines of code to 100,000+. The best models burned billions of tokens reimplementing Apple Pkl (16,000 lines of C) – and still didn’t finish the task completely.

The analogy for a manager is clear: it’s like testing an employee not on their knowledge of theory, but on their ability to deliver a project of growing complexity.

Real work instead of tests

Scale AI’s Remote Labor Index takes ~100 real jobs from Upwork and checks whether the AI’s output would satisfy an actual client. This benchmark isn’t saturated yet – because “satisfying a client” wraps in a thousand things you can’t formalize.

Existing infrastructure

Instead of building new tests, you can use evaluation systems that already work: academic conferences (submit an AI-written paper for peer review), literary competitions, professional certifications. A model that earns a positive review at NeurIPS proves more than any automated test.

Choosing AI by results instead of rankings is a skill. 9 tasks on real models, free, in 30 minutes.

No payment required • Get notified on launch

Join Waitlist

What this means for choosing an AI tool

If benchmarks are losing their predictive power, how should a manager actually decide?

The most direct approach is to test on your team’s specific tasks, not abstract ones. Three to five typical scenarios run through two or three models will tell you more than a table with twenty benchmarks. We described this approach in detail in our study of 54 models. If a budget model seems to be falling short, check first: it may not be the model at all, but the quality of your prompt – structured instructions often make up for a smaller model.

Look at “quality per dollar” rather than the absolute score. Kimi K2.5 delivers 99% of GPT-5.2 Pro’s quality at 1.4% of the price. No benchmark will show you that – only a direct comparison on your own tasks.

It helps to split tasks by difficulty. An 80/20 strategy – routine work on a budget model, critical tasks on premium – cuts costs by 79% while losing 11% of quality. What counts as “routine” in your context is a question only you can answer.

Of all the evaluation systems, Chatbot Arena and the Remote Labor Index sit closest to real-world use. Arena shows human preference; the Remote Labor Index shows client satisfaction.

Benchmarks aren’t dying – they’re growing up

Epoch AI’s conclusion isn’t that benchmarks are useless. They still capture capability transitions – the moment a model “learns” something new. But their role is shifting: from the single criterion for choosing to one signal among many.

For a manager, that means the end of a convenient illusion. You can’t just glance at a table and say, “this model is better, we’ll take it.” You have to understand what exactly you’re measuring, why, and how the result maps to your tasks.

The difference between “the model scored 92% on GPQA” and “the model saved our team 12 hours a week” is the difference between a benchmark and reality. The skill of translating the first into the second is one of the most important a manager can have in 2026.

Specialisation

From rankings to real results

The MySummit course: a Foundation track on critical thinking with AI, plus a manager track. Learn to evaluate models by tasks, not benchmarks.

От pre-mortem до антикризисного плана
Переиспользуемые промпт-шаблоны
Сквозной кейс на реальном проекте
~300 часов экономии в год
Stanislav Belyaev

Stanislav Belyaev

Engineering Leader at Microsoft

18 years leading engineering teams. Founder of mysummit.school. 700+ graduates at Yandex Practicum and Stratoplan.