Admittedly, this article has a clickbaity title.
Supposedly, there's no reason to think that OpenAI is using the benchmarks to train the models... Supposedly OpenAI promised that they don't use it, and Epoch AI says they have "holdout" problems that will be used in the real test...
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Elliot Glazer (LinkedIn profile/Reddit profile), the lead mathematician at Epoch AI confirmed that OpenAI has the dataset and that they were allowed to use it to evaluate OpenAI’s o3 large language model, which is their next state of the art AI that’s referred to as a reasoning AI model. He offered his opinion that the high scores obtained by the o3 model are “legit” and that Epoch AI is conducting an independent evaluation to determine whether or not o3 had access to the FrontierMath dataset for training, which could cast the model’s high scores in a different light.
Personally, Sam Altman has always given me Samuel Bankman-Fried/Billy McFarland vibes. And, that's probably not fair. GPT can do some empirically incredible things, and if AI doesn't advance a single step further, the world will still be changed.
The optimistic side of me does genuinely hope that there's nothing to the story in this article, but stuff like this makes it hard not to be cynical.
Either way, I'm just not getting hyped about benchmarks.