Is Chain-of-Thought Reasoning of LLMs a Mirage? \ stacker news

pull down to refresh

Is Chain-of-Thought Reasoning of LLMs a Mirage?arxiv.org/abs/2508.01191

427 sats \ 9 comments \ @optimism 7 Aug 2025 AI

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further.

Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.

I wonder how many TWh has been wasted on reasoning token generation since it was introduced late last year.

view all related items

122 sats \ 4 replies \ @freetx 7 Aug 2025

Personally, when doing lite programming task (ansible task, bash scripts, python one-offs), I deliberately choose non-thinking models as I find the output the be better.

Lately I've been seeing local models like: devstral and qwen3-coder that produce comparable output to previous "premier" models like Claude 3.5.

I really do think we are going to reach the point where local models are going to be "good enough" for 80% of the routine task people need....and the AI hype craze will die down as the remaining 20% that requires high-end datacenters won't be enough to sustain the enormous hype that we've seen over the last few years.

The curious thing is that we probable do need these datacenters at the moment, primarily to help us train / finetune / distill the next-gen of local open source models being released.

76 sats \ 3 replies \ @optimism OP 7 Aug 2025

Yes, I was testing qwen3-coder the other day and it is only a little less efficient than claude 3.7, but claude 4 is actually quite good.

This morning I tested gpt-oss-120b which turned out awful at tool calls and almost completely non-comprehending when it came to deciding simple things like that you can get the source of a file using the tools. This surprised me but it got annoying to the point where I had to stop using it because it was just running into error loops. So yeah that's a D- for OpenAI's "open source" model - they probably just nerfed it for the specific purpose of complaints like mine, so that they can continue sucking your moneys.

However, instead of giving Scam Altman my hard earned money - he defo does not deserve it - I'm going to test the mouthful of qwen3-235b-a22b-0725.

112 sats \ 2 replies \ @freetx 7 Aug 2025

I found gpt-oss-20b better than 120b...especially considering the speed difference when trying to run on local hardware. Not sure what they did to 120b, but I didn't get really good results from it.

20b was ok, I even really thought the ansible task I gave it came out quite well. However the "thinking" phase seemed excessive for the final output.

qwen3-235b-a22b-0725

Heard good things about that on the tubes....I'd be interesting to hear what your findings are....I may try to download today and test as well.

22 sats \ 0 replies \ @optimism OP 7 Aug 2025

re: qwen3. It just did the endless "reasoning" loop:

% cat aborted.txt | sed 's/[\.,]/ /g' | tr " " "\n" | sed 's/\.$//g;s/^$/-----/g' | tr '[:upper:]' '[:lower:]' | grep -v -- "-----" | sort | uniq -c | sort -n
   1 also
   1 application
   1 as
   1 current
   1 ensure
   1 expected
   1 implementation
   1 involves
   1 it
   1 requested
   1 running
   1 should
   1 test
   1 this
   1 triggers
   1 update
   1 user
   1 verifying
   1 works
  83 by
  83 category
  83 compilation
  83 correct
  83 error
  83 errors
  83 fixed
  83 from
  83 have
  83 structure
  83 using
  84 app
  84 associated
  84 be
  84 but
  84 can
  84 component
  84 createtoolbar
  84 currently
  84 defined
  84 does
  84 implement
  84 not
  84 now
  84 same
  84 statusicon
  84 used
  85 active
  85 available
  85 completion
  85 correctly
  85 database
  85 enabled
  85 fetches
  85 for
  85 has
  85 icon
  85 if
  85 indicate
  85 loading
  85 message
  85 processes
  85 progress
  85 results
  85 saves
  85 show
  85 summary
  86 button
  86 operation
  86 that
 167 been
 167 file
 168 actual
 168 logic
 168 validation
 169 with
 170 a
 170 all
 170 articles
 170 feeds
 170 of
 170 sets
 251 go
 252 implemented
 253 refreshfeeds
 254 bar
 254 feed
 254 updates
 342 to
 421 and
 421 function
 421 in
 421 is
 425 refresh
 509 status
1861 the

23 sats \ 0 replies \ @optimism OP 7 Aug 2025

Not sure what they did to 120b, but I didn't get really good results from it.

I think that it's an attention-span issue, where, if larger context is provided, for example after inspecting a file or even listing a dir, it "forgets" about the tools which are (often) provided at the end of the system prompt. It did do a couple of diffs successfully, but as context window grew it just messed up more and more, even when i soft-capped it at 16k. So yeah... that one needs work.

I'll check out their 20b off some provider later because I'm getting a bit tired from the massive downloads to my cloud inference node only to find out something sucks. Probably cheaper to spend 2k sats on paid api.

101 sats \ 1 reply \ @justin_shocknet 7 Aug 2025

they are "getting worse" is because of recursion to emulate thinking and reasoning, stretching out the approximations even further (increasing entropy).

#1040411

18 sats \ 0 replies \ @optimism OP 7 Aug 2025

Nice when you can link to the "told you so".

101 sats \ 0 replies \ @south_korea_ln 7 Aug 2025

I don't have any good experience with thinking models for the stuff I routinely use them for. It just creates super long and general-purpose scripts even when I just need it to make a simple figure using matplotlib. They love using argparse, adding plenty of edge cases, etc. What I would do with 20 lines of code, it does with 100. It works, but it's a waste.

101 sats \ 0 replies \ @carter 7 Aug 2025

Yeah everyone thought r1 type stuff was the way forward and it looks like it just is like a focused version of what we were doing before and can actually hinder some simpler tasks