The end of static AI benchmarks | Posted by Sandi Besen

Benchmarking as a measure of success

Benchmarks are often hailed as hallmarks of success. This is a popular way to measure progress, such as achieving a sub-4 minute mile or excelling on a standardized test. In the context of artificial intelligence (AI), benchmarks are the most common way to evaluate the capabilities of a model. Industry leaders like OpenAI, Anthropic, Meta, Google, and more compete for superior benchmark scores. However, recent research and industry complaints are raising questions about whether common benchmarks actually capture the essence of model capabilities.

New research points to the possibility that the training sets of some models may be contaminated with the very data they are being evaluated on, raising doubts about the reliability of benchmark scores that reflect true understanding. Just like in movies where actors can play doctors or scientists, they deliver their lines without really understanding the basic concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, he probably didn’t understand the complex physics theories he was talking about. Benchmarks are meant to assess a model’s abilities, but if the model remembers them like an actor does, does that really work?

A recent study from the University of Arizona found that GPT-4 is contaminated with the AG News, WNLI, and XSum datasets, which discredits the relevant benchmarks.[1]. Additionally, when researchers at the University of Science and Technology of China deployed a “probing” technique on the popular MMLU benchmark, they discovered the following: [2]Results declined dramatically.

Their investigative techniques included a series of methods to test understanding of model questions presented in different ways with a variety of answer options, but with the same correct answer. Examples of research techniques include rephrasing questions, rephrasing choices, changing choices, adding additional context to a question, and adding new choices to a benchmark question.

In the graph below, each model tested performed well on the unmodified “vanilla” MMLU benchmark, but performed not so well when we added probing techniques to other sections of the benchmark (LU, PS, DK, All). You can see that it didn’t work. .

“Vanilla” refers to the performance of the unaltered MMLU benchmark. The different keys represent the performance of the changed section (LU), problem solving (PS), and domain knowledge (DK), all of the MMLU benchmark.

This evolving situation is prompting a reevaluation of how AI models are evaluated. The need for benchmarks that reliably demonstrate functionality and predict data corruption and memory issues is becoming increasingly clear.

The lifespan of benchmarks is inherently short as models continue to evolve and are potentially updated to include benchmark data in their training sets. Additionally, the model context window grows rapidly, allowing greater amounts of context to be included in the model response. The larger the context window, the greater the potential impact of contaminated data indirectly distorting the model’s training process and biasing it towards the test examples seen.

To address these challenges, innovative approaches such as dynamic benchmarks are emerging, using tactics such as changing the question, complicating the question, introducing noise into the question, rephrasing the question, and reversing the polarity of the question. [3].

The examples below provide examples of different ways to change the benchmark question (manual or generated language model).

Source: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

As we move forward, it becomes clear that there is a need to align evaluation methods more closely with real-world applications. Establishing benchmarks that accurately reflect real-world tasks and challenges will not only provide more accurate measurements of AI capabilities, but can also guide the development of small language models (SLMs) and AI agents. These expert models and agents need benchmarks that actually capture their potential to perform practical, useful tasks.

What is a network engineer?

MathPrompt: A new AI method for evading AI safety mechanisms through mathematical encoding

Study: AI Could Lead to Inconsistent Results in Home Surveillance | MIT News

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

Space Marine 2, Opens the Xbox 360 Era, Brothers Enthusiasts in Steam Reviews

Texas Court Dismisses Consensys Lawsuit Against SEC Regarding Ethereum Investigation

Wheels of Change: Self-Balancing Technologies for Urban Mobility

Discord CEO sheds light on future of gamer communication as users cross 200M

Most Popular

How Bakery’s AI made cancer detection a piece of cake

London-based Metaview receives €6.4 million to help recruiters with its AI-powered note-taking platform.

Accelerate fusion science through learned plasma control

Our Picks

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

The end of static AI benchmarks | Posted by Sandi Besen | March 2024

Benchmarking as a measure of success

Related Posts