To begin, the research team delved deep into the math world, reaching out to some of the brightest minds in the field. They asked them to provide some truly difficult math problems and got back hundreds of them in reply. Such problems, the researchers note, are not only unique (they have not been published before) but they also require a deep level of understanding of mathematics. Some take humans several days to solve.
They also cover a wide range of topics, from number theory to algebraic geometry. Because of that breadth, brute force will not work. Neither will making educated guesses. To score well on the FrontierMath benchmark, an AI system would have to have creativity, insight and what the research team describes as "deep domain expertise."
Testing thus far has demonstrated the difficulty found in FrontierMath. AIs that have scored well on traditional benchmarks have not been able to score any higher than 2%.
These observations anecdotically match my experience when trying to invoke ChatGPT's help on some of my research... it only works well for routine stuff, but still fails miserably at doing more advanced stuff. I still matter ;)