To begin, the research team delved deep into the math world, reaching out to some of the brightest minds in the field. They asked them to provide some truly difficult math problems and got back hundreds of them in reply. Such problems, the researchers note, are not only unique (they have not been published before) but they also require a deep level of understanding of mathematics. Some take humans several days to solve.
They also cover a wide range of topics, from number theory to algebraic geometry. Because of that breadth, brute force will not work. Neither will making educated guesses. To score well on the FrontierMath benchmark, an AI system would have to have creativity, insight and what the research team describes as "deep domain expertise."
Testing thus far has demonstrated the difficulty found in FrontierMath. AIs that have scored well on traditional benchmarks have not been able to score any higher than 2%.
These observations anecdotically match my experience when trying to invoke ChatGPT's help on some of my research... it only works well for routine stuff, but still fails miserably at doing more advanced stuff. I still matter ;)
They are performing old computer mathematics. Which is the reason why it is performing poorly.
reply
What do you mean with old computer mathematics?
reply
We're still at the very beginning. I don't believe in hype, but I do believe that in a few years we can achieve fantastic things. As for difficult math problems, maybe we're still a long way off!
reply
I'm not really clear on how the supposed reasoning capabilities are supposed to work in these LLMs. I understand the basics, but those would only work when it has been trained on similar problems before. Not because it understands them, but because it is good at predicting what is the likely next step based on its training data. Hard math are supposed to be one-of-a-kind solutions that likely do not exist in the available training data.
reply