Testing AI systems on hard math problems shows they still perform very poorly \ stacker news ~science

pull down to refresh

Testing AI systems on hard math problems shows they still perform very poorly phys.org/news/2024-11-ai-hard-math-problems-poorly.html

153 sats \ 4 comments \ @south_korea_ln 13 Nov 2024 science

To begin, the research team delved deep into the math world, reaching out to some of the brightest minds in the field. They asked them to provide some truly difficult math problems and got back hundreds of them in reply. Such problems, the researchers note, are not only unique (they have not been published before) but they also require a deep level of understanding of mathematics. Some take humans several days to solve.

They also cover a wide range of topics, from number theory to algebraic geometry. Because of that breadth, brute force will not work. Neither will making educated guesses. To score well on the FrontierMath benchmark, an AI system would have to have creativity, insight and what the research team describes as "deep domain expertise."

Testing thus far has demonstrated the difficulty found in FrontierMath. AIs that have scored well on traditional benchmarks have not been able to score any higher than 2%.

These observations anecdotically match my experience when trying to invoke ChatGPT's help on some of my research... it only works well for routine stuff, but still fails miserably at doing more advanced stuff. I still matter ;)

view all related items

0 sats \ 1 reply \ @payingmantis 13 Nov 2024

They are performing old computer mathematics. Which is the reason why it is performing poorly.

0 sats \ 0 replies \ @south_korea_ln OP 13 Nov 2024

What do you mean with old computer mathematics?

0 sats \ 1 reply \ @0xbitcoiner 13 Nov 2024

We're still at the very beginning. I don't believe in hype, but I do believe that in a few years we can achieve fantastic things. As for difficult math problems, maybe we're still a long way off!

33 sats \ 0 replies \ @south_korea_ln OP 13 Nov 2024

I'm not really clear on how the supposed reasoning capabilities are supposed to work in these LLMs. I understand the basics, but those would only work when it has been trained on similar problems before. Not because it understands them, but because it is good at predicting what is the likely next step based on its training data. Hard math are supposed to be one-of-a-kind solutions that likely do not exist in the available training data.