Benchmarking as a measure of success
Benchmarks are often hailed as hallmarks of success. This is a popular way to measure progress, such as achieving a sub-4 minute mile or excelling on a standardized test. In the context of artificial intelligence (AI), benchmarks are the most common way to evaluate the capabilities of a model. Industry leaders like OpenAI, Anthropic, Meta, Google, and more compete for superior benchmark scores. However, recent research and industry complaints are raising questions about whether common benchmarks actually capture the essence of model capabilities.
New research points to the possibility that the training sets of some models may be contaminated with the very data they are being evaluated on, raising doubts about the reliability of benchmark scores that reflect true understanding. Just like in movies where actors can play doctors or scientists, they deliver their lines without really understanding the basic concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, he probably didn’t understand the complex physics theories he was talking about. Benchmarks are meant to assess a model’s abilities, but if the model remembers them like an actor does, does that really work?
A recent study from the University of Arizona found that GPT-4 is contaminated with the AG News, WNLI, and XSum datasets, which discredits the relevant benchmarks.[1]. Additionally, when researchers at the University of Science and Technology of China deployed a “probing” technique on the popular MMLU benchmark, they discovered the following: [2]Results declined dramatically.
Their investigative techniques included a series of methods to test understanding of model questions presented in different ways with a variety of answer options, but with the same correct answer. Examples of research techniques include rephrasing questions, rephrasing choices, changing choices, adding additional context to a question, and adding new choices to a benchmark question.
In the graph below, each model tested performed well on the unmodified “vanilla” MMLU benchmark, but performed not so well when we added probing techniques to other sections of the benchmark (LU, PS, DK, All). You can see that it didn’t work. .
This evolving situation is prompting a reevaluation of how AI models are evaluated. The need for benchmarks that reliably demonstrate functionality and predict data corruption and memory issues is becoming increasingly clear.
The lifespan of benchmarks is inherently short as models continue to evolve and are potentially updated to include benchmark data in their training sets. Additionally, the model context window grows rapidly, allowing greater amounts of context to be included in the model response. The larger the context window, the greater the potential impact of contaminated data indirectly distorting the model’s training process and biasing it towards the test examples seen.
To address these challenges, innovative approaches such as dynamic benchmarks are emerging, using tactics such as changing the question, complicating the question, introducing noise into the question, rephrasing the question, and reversing the polarity of the question. [3].
The examples below provide examples of different ways to change the benchmark question (manual or generated language model).
As we move forward, it becomes clear that there is a need to align evaluation methods more closely with real-world applications. Establishing benchmarks that accurately reflect real-world tasks and challenges will not only provide more accurate measurements of AI capabilities, but can also guide the development of small language models (SLMs) and AI agents. These expert models and agents need benchmarks that actually capture their potential to perform practical, useful tasks.