Despite their expertise, AI developers don’t always know what their most advanced systems are capable of—at least, not at first. To find out, systems are subjected to a range of tests—often called evaluations, or ‘evals’—designed to tease out their limits. But due to rapid progress in the field, today’s systems regularly achieve top scores on many popular tests, including SATs and the U.S. bar exam, making it harder to judge just how quickly they are improving.
A new set of much more challenging evals has emerged in response, created by companies, nonprofits, and governments. Yet even on the most advanced evals, AI systems are making astonishing progress. In November, the nonprofit research institute Epoch AI announced a set of exceptionally challenging math questions developed in collaboration with leading mathematicians, called FrontierMath, on which currently available models scored only 2%. Just one month later, OpenAI’s newly-announced o3 model achieved a score of 25.2%, which Epoch’s director, Jaime Sevilla, describes as “far better than our team expected so soon after release.”
[time-brightcove not-tgx=”true”]Amid this rapid progress, these new evals could help the world understand just what advanced AI systems can do, and—with many experts worried that future systems may pose serious risks in domains like cybersecurity and bioterrorism—serve as early warning signs, should such threatening capabilities emerge in future.
Harder than it sounds
In the early days of AI, capabilities were measured by evaluating a system’s performance on specific tasks, like classifying images or playing games, with the time between a benchmark’s introduction and an AI matching or exceeding human performance typically measured in years. It took five years, for example, before AI systems surpassed humans on the ImageNet Large Scale Visual Recognition Challenge, established by Professor Fei-Fei Li and her team in 2010. And it was only in 2017 that an AI system (Google DeepMind’s AlphaGo) was able to beat the world’s number one ranked player in Go, an ancient, abstract Chinese boardgame—almost 50 years after the first program attempting the task was written.
The gap between a benchmark’s introduction and its saturation has decreased significantly in recent years. For instance, the GLUE benchmark, designed to test an AI’s ability to understand natural language by completing tasks like deciding if two sentences are equivalent or determining the correct meaning of a pronoun in context, debuted in 2018. It was considered solved one year later. In response, a harder version, SuperGLUE, was created in 2019—and within two years, AIs were able to match human performance across its tasks.
Read More: Congress May Finally Take on AI in 2025. Here’s What to Expect
Evals take many forms, and their complexity has grown alongside model capabilities. Virtually all major AI labs now “red-team” their models before release, systematically testing their ability to produce harmful outputs, bypass safety measures, or otherwise engage in undesirable behavior, such as deception. Last year, companies including OpenAI, Anthropic, Meta, and Google made voluntary commitments to the Biden administration to subject their models to both internal and external red-teaming “in areas including misuse, societal risks, and national security concerns.”
Other tests assess specific capabilities, such as coding, or evaluate models’ capacity and propensity for potentially dangerous behaviors like persuasion, deception, and large-scale biological attacks.
Perhaps the most popular contemporary benchmark is Measuring Massive Multitask Language Understanding (MMLU), which consists of about 16,000 multiple-choice questions that span academic domains like philosophy, medicine, and law. OpenAI’s GPT-4o, released in May, achieved 88%, while the company’s latest model, o1, scored 92.3%. Because these large test sets sometimes contain problems with incorrectly-labelled answers, attaining 100% is often not possible, explains Marius Hobbhahn, director and co-founder of Apollo Research, an AI safety nonprofit focused on reducing dangerous capabilities in advanced AI systems. Past a point, “more capable models will not give you significantly higher scores,” he says.
Designing evals to measure the capabilities of advanced AI systems is “astonishingly hard,” Hobbhahn says—particularly since the goal is to elicit and measure the system’s actual underlying abilities, for which tasks like multiple-choice questions are only a proxy. “You want to design it in a way that is scientifically rigorous, but that often trades off against realism, because the real world is often not like the lab setting,” he says. Another challenge is data contamination, which can occur when the answers to an eval are contained in the AI’s training data, allowing it to reproduce answers based on patterns in its training data rather than by reasoning from first principles.
Another issue is that evals can be “gamed” when “either the person that has the AI model has an incentive to train on the eval, or the model itself decides to target what is measured by the eval, rather than what is intended,” says Hobbahn.
A new wave
In response to these challenges, new, more sophisticated evals are being built.
Epoch AI’s FrontierMath benchmark consists of approximately 300 original math problems, spanning most major branches of the subject. It was created in collaboration with over 60 leading mathematicians, including Fields-medal winning mathematician Terence Tao. The problems vary in difficulty, with about 25% pitched at the level of the International Mathematical Olympiad, such that an “extremely gifted” high school student could in theory solve them if they had the requisite “creative insight” and “precise computation” abilities, says Tamay Besiroglu, Epoch’s associate director. Half the problems require “graduate level education in math” to solve, while the most challenging 25% of problems come from “the frontier of research of that specific topic,” meaning only today’s top experts could crack them, and even they may need multiple days.
Solutions cannot be derived by simply testing every possible answer, since the correct answers often take the form of 30-digit numbers. To avoid data contamination, Epoch is not publicly releasing the problems (beyond a handful, which are intended to be illustrative and do not form part of the actual benchmark). Even with a peer-review process in place, Besiroglu estimates that around 10% of the problems in the benchmark have incorrect solutions—an error rate comparable to other machine learning benchmarks. “Mathematicians make mistakes,” he says, noting they are working to lower the error rate to 5%.
Evaluating mathematical reasoning could be particularly useful because a system able to solve these problems may also be able to do much more. While careful not to overstate that “math is the fundamental thing,” Besiroglu expects any system able to solve the FrontierMath benchmark will be able to “get close, within a couple of years, to being able to automate many other domains of science and engineering.”
Another benchmark aiming for a longer shelflife is the ominously-named “Humanity’s Last Exam,” created in collaboration between the nonprofit Center for AI Safety and Scale AI, a for-profit company that provides high-quality datasets and evals to frontier AI labs like OpenAI and Anthropic. The exam is aiming to include between 20 and 50 times as many questions as Frontiermath, while also covering domains like physics, biology, and electrical engineering, says Summer Yue, Scale AI’s director of research. Questions are being crowdsourced from the academic community and beyond. To be included, a question needs to be unanswerable by all existing models. The benchmark is intended to go live in late 2024 or early 2025.
A third benchmark to watch is RE-Bench, designed to simulate real-world machine-learning work. It was created by researchers at METR, a nonprofit that specializes in model evaluations and threat research, and tests humans and cutting-edge AI systems across seven engineering tasks. Both humans and AI agents are given a limited amount of time to complete the tasks; while humans reliably outperform current AI agents on most of them, things look different when considering performance only within the first two hours. Current AI agents do best when given between 30 minutes and 2 hours, depending on the agent, explains Hjalmar Wijk, a member of METR’s technical staff. After this time, they tend to get “stuck in a rut,” he says, as AI agents can make mistakes early on and then “struggle to adjust” in the ways humans would.
“When we started this work, we were expecting to see that AI agents could solve problems only of a certain scale, and beyond that, that they would fail more completely, or that successes would be extremely rare,” says Wijk. It turns out that given enough time and resources, they can often get close to the performance of the median human engineer tested in the benchmark. “AI agents are surprisingly good at this,” he says. In one particular task—which involved optimizing code to run faster on specialized hardware—the AI agents actually outperformed the best humans, although METR’s researchers note that the humans included in their tests may not represent the peak of human performance.
These results don’t mean that current AI systems can automate AI research and development. “Eventually, this is going to have to be superseded by a harder eval,” says Wijk. But given that the possible automation of AI research is increasingly viewed as a national security concern—for example, in the National Security Memorandum on AI, issued by President Biden in October—future models that excel on this benchmark may be able to improve upon themselves, exacerbating human researchers’ lack of control over them.
Even as AI systems ace many existing tests, they continue to ...
Read More Details
Finally We wish PressBee provided you with enough information of ( AI Models Are Getting Smarter. New Tests Are Racing to Catch Up )
Also on site :