LLMs are better at math than meLLMs are better at math than me

Here's an interesting anecdote about LLMs' capabilities from the world of mathematics

I recently made an incorrect prediction about existing model capabilities. A number of world-class mathematicians initiated a wonderful project they dubbed "First Proof" (a baking pun), aimed at measuring the usefulness of current models for certain math research tasks. The first round, which was essentially a trial balloon, consisted of 10 lemmas taken from the authors' own unpublished work. Most of these are not research projects on their own, but they are natural examples of non-trivial tasks that appear in the daily life of a research mathematician.

At the paper's release I weakly expected that existing models would come up with proofs of 2-3 of these questions (weakly because they are all rather far from my work, and it took some time for me to evaluate their difficulty), but thought that 4-5 solutions was plausible. While there is no official grading for this round, it appears at the time of this writing that somewhere between 6 and 8 have been solved, if one combines all attempts (and an enormous amount of garbage has been produced). Some solutions are arguably semi-autonomous rather than autonomous. Is this meaningfully different from what I had expected? Though I already knew that existing models could sometimes usefully prove lemmas that show up in the day-to-day work of a mathematician, I found it notable. This experiment suggests that, with sufficient scaffolding, they can often prove such lemmas.

Still, the mathematician relating the story has some reservations about the quality of LLM work:

The models (and the humans supervising them) generated an enormous amount of garbage, including some incorrect solutions claiming to be formalized in Lean. Even the best models/scaffolds seem not to be able to reliably detect when they are producing nonsense.

So is a calculator. Are LLMs a different kind of thing than calculators?So is a calculator. Are LLMs a different kind of thing than calculators?

What really emerges as the question is this:

Every frontier model knows more mathematics than any human who has ever lived; every model can solve exam questions faster and more accurately than any human. One might reasonably object: Google, or the collection of all math textbooks, knows more math than anyone who has ever lived. Mathematica or a TI-84 calculator is better at certain computations than any human. Sure, I agree.

The mystery is this: a human with these capabilities would, almost certainly, be proving amazing theorems constantly. Why haven't we seen this from the models yet? What are they missing? The answer is much more obvious for a calculator than it is for an LLM.

This is a really interesting question. It may point to a fundamental difference between human and model reasoning:

"because the models have an incredibly vast knowledge-base, they often do not have to generalize very far to solve a problem, whereas a human might have to perform an immense amount of reasoning to do so.""because the models have an incredibly vast knowledge-base, they often do not have to generalize very far to solve a problem, whereas a human might have to perform an immense amount of reasoning to do so."

When you generate a benchmark, or test a model privately, you try to generate some hard questions and then give them to the model. What you actually generate is a question that you think is hard---that is, one that is hard for you or people you know. What makes a question hard is often that there is some missing prerequisite knowledge, or that one has not been exposed to similar problems; then reasoning ability is necessary to fill in this missing context. But this means that model performance can improve by increasing knowledge, without increasing reasoning abilities.

But does such a difference actually matter?But does such a difference actually matter?

Possibly:

In the near term, we're in trouble. Models are able to produce both correct, interesting mathematics, as well as incorrect mathematics that is exceedingly labor-intensive to detect. Academic mathematics is simply not prepared to handle this.

One interesting way to put this is: LLMs aren't truth-seeking. But it may not matter if they have a wide enough knowledge base.

All in all, the mathematician guy thinks things are looking up for AI and math:

I think I was not correctly calibrated as to the capabilities of existing models, let alone near-future models. This was more apparent in the mood of my comments than their content, which was largely cautious.

let me rephrase: it's very likely that this technology is bigger than the computer.

117 sats \ 0 replies \ @adlai 8h

Every frontier model knows more mathematics than any human who has ever lived; every model can solve exam questions faster and more accurately than any human.

I'm disappointed to read that kind of statement from a mathematician.

"knows"? really?

They have learned the linguistic patterns of mathematical language, and produce bullshit more rapidly than either humans or automatic verifiers can evaluate or verify it. However, as far as I've seen, there is not much research being done about fundamentally reflecting mathematical knowledge in some way that could be "exported" from the model's LLM blob, and e.g. plotted as a commutative diagram.

Yes, you could produce an SVG of that diagram by poking an LLM until it barfs one. However that would be the result of lots of "reasoning" steps performing a breadth-first search through the noisy vocabiulary, rather than a direct conversion of a subgraph from some abstract ideomorphic representation into the rendering format.

15 sats \ 0 replies \ @Solomonsatoshi 12h

Truth can be sought via reasoned dialogue and the contest of ideas.

With the V4V P2P sats denominated social media here on Stacker News we can all participate in the constructive contest of ideas and seek to approach the truth.

Part of supporting the SNs V4V P2P censorship resistant model is to attach LN wallets- that way we are also supporting the LN and its strength and liquidity.

Attach you LN wallets both sending and receiving to SNs today and be part of the solution.

Rather than endless hollow man virtue signalling and echo chamber rhetoric- do something- attach your LN wallets today and provide verification that you are not just a talker but also a walker on the road to freedom.