AI Benchmarks Are Losing Their Meaning – So How Do You Pick a Model?
In March we broke down how LLM benchmarks actually work – GPQA Diamond, SWE-bench, Chatbot Arena. In April we tested 53 models and found that the quality gap between the top models is tenths of a point – while the price gap spans three orders of magnitude.
Now for the next question. What if the benchmarks themselves are starting to break?
Read more