Measuring Up: AI Benchmarks

Imagine giving a robot a multiple-choice exam, but then realising it already knows all the answers. When a new version of an AI model is released by a major vendor like OpenAI, Anthropic, or Google, it usually comes with a series of statements about how well the model performs on benchmark tests. These benchmarks may include the commonly used Massive Multitask Language Understanding benchmark, a list of 15,000 multiple-choice questions in subjects ranging from accounting to physics and beyond. These benchmarks are used to show the progress in AI models, and are widely used by vendors to show how smart their latest models are. Other examples of AI benchmarks include ARC, BIG-bench and SuperGLUE. The various tests cover subjects from language understanding to mathematics, from knowledge to safety, from logic puzzles to software coding problems.

This is all well and good, but who decides on the benchmarks, and how reliable and up to date are they? Troublingly, it seems that there are many issues with some of the most common benchmarks. In one study, it turned out that many benchmarks are several years old, and some contain incorrect information.

Considerable attention is paid to the AI model benchmarks, with league tables available that are regularly updated, such as this one on HuggingFace, this one on Artificial Analysis, and this one on LLM Stats and this one on Vellum. Several other comprehensive leaderboard platforms also exist.

In November 2025, a study of 440 such AI benchmarks was published and reported on in The Guardian newspaper. This large study was carried out by the British Government AI Institute, with help from researchers at universities including Oxford, Stanford and Berkeley. Just 16% of the benchmarks used statistical tests to show how likely a benchmark is to be accurate and to show uncertainty estimates. The report claimed that the benchmark scores might be “irrelevant or even misleading”, with weaknesses across almost all of the benchmarks examined.

There are no industry standards for AI benchmarks, little or no regulation, and indeed, the sheer pace of development of AI means that it is difficult to come up with credible benchmarks that stand the test of time and can be widely agreed upon. Consequently, we are living in a kind of wild west of benchmark claims, where vendors can select which benchmarks they perform best on, and where the results of the benchmarks may be dubious or gamed by the vendors. There have been numerous examples of AI vendors feeding the benchmark tests into their LLMs in order to ensure high performance. In one case, LLM performance dropped sharply when questions were rephrased or new test questions devised. Vendors can cherry-pick which benchmarks they report, feed the questions into their models or even design their own benchmarks that they know their models will shine at. Unlike university exams, there are no examination halls with supervisors to check that the LLMs are not cheating on the tests.

As the technology develops, new benchmarks are needed. These days, some AI models are multimodal, meaning that they can understand not just text but also audio and video. The benchmark tests need to evolve to reflect such changes. It is a similar story as AIs are given “agentic” properties i.e. are linked together and given resources. As this area develops, new ways to test its effectiveness need to be found. Indeed, some tests for agentic AI are already emerging.

Clearly, the benchmarks that may be most useful for a specific corporate project or use case will depend on the exact scope of that use case. A project in customer service may care a lot about linguistic skill, and not about benchmarks that measure mathematical skill. The opposite may be the case in projects in engineering. For this reason, when selecting an AI model for a particular project, it is important to understand which benchmarks are most relevant, rather than just choosing the one that tops the leaderboards on a general forum. Benchmarks are vital, but unless they evolve as fast as the models they measure, they risk becoming more of a marketing tool than a true measure of intelligence.

Related Posts