Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further.
Our investigation, conducted through the controlled environment of DataAlchemy, reveals that the apparent reasoning prowess of Chain-of-Thought (CoT) is largely a brittle mirage. The findings across task, length, and format generalization experiments converge on a conclusion: CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces.
I wonder how many TWh has been wasted on reasoning token generation since it was introduced late last year.
Personally, when doing lite programming task (ansible task, bash scripts, python one-offs), I deliberately choose non-thinking models as I find the output the be better.
Lately I've been seeing local models like: devstral and qwen3-coder that produce comparable output to previous "premier" models like Claude 3.5.
I really do think we are going to reach the point where local models are going to be "good enough" for 80% of the routine task people need....and the AI hype craze will die down as the remaining 20% that requires high-end datacenters won't be enough to sustain the enormous hype that we've seen over the last few years.
The curious thing is that we probable do need these datacenters at the moment, primarily to help us train / finetune / distill the next-gen of local open source models being released.
Yes, I was testing
qwen3-coderthe other day and it is only a little less efficient than claude 3.7, but claude 4 is actually quite good.This morning I tested
gpt-oss-120bwhich turned out awful at tool calls and almost completely non-comprehending when it came to deciding simple things like that you can get the source of a file using the tools. This surprised me but it got annoying to the point where I had to stop using it because it was just running into error loops. So yeah that's aD-for OpenAI's "open source" model - they probably just nerfed it for the specific purpose of complaints like mine, so that they can continue sucking your moneys.However, instead of giving Scam Altman my hard earned money - he defo does not deserve it - I'm going to test the mouthful of
qwen3-235b-a22b-0725.I found
gpt-oss-20bbetter than 120b...especially considering the speed difference when trying to run on local hardware. Not sure what they did to 120b, but I didn't get really good results from it.20b was ok, I even really thought the ansible task I gave it came out quite well. However the "thinking" phase seemed excessive for the final output.
Heard good things about that on the tubes....I'd be interesting to hear what your findings are....I may try to download today and test as well.
I think that it's an attention-span issue, where, if larger context is provided, for example after inspecting a file or even listing a dir, it "forgets" about the tools which are (often) provided at the end of the system prompt. It did do a couple of diffs successfully, but as context window grew it just messed up more and more, even when i soft-capped it at 16k. So yeah... that one needs work.
I'll check out their 20b off some provider later because I'm getting a bit tired from the massive downloads to my cloud inference node only to find out something sucks. Probably cheaper to spend 2k sats on paid api.
re: qwen3. It just did the endless "reasoning" loop:
% cat aborted.txt | sed 's/[\.,]/ /g' | tr " " "\n" | sed 's/\.$//g;s/^$/-----/g' | tr '[:upper:]' '[:lower:]' | grep -v -- "-----" | sort | uniq -c | sort -n 1 also 1 application 1 as 1 current 1 ensure 1 expected 1 implementation 1 involves 1 it 1 requested 1 running 1 should 1 test 1 this 1 triggers 1 update 1 user 1 verifying 1 works 83 by 83 category 83 compilation 83 correct 83 error 83 errors 83 fixed 83 from 83 have 83 structure 83 using 84 app 84 associated 84 be 84 but 84 can 84 component 84 createtoolbar 84 currently 84 defined 84 does 84 implement 84 not 84 now 84 same 84 statusicon 84 used 85 active 85 available 85 completion 85 correctly 85 database 85 enabled 85 fetches 85 for 85 has 85 icon 85 if 85 indicate 85 loading 85 message 85 processes 85 progress 85 results 85 saves 85 show 85 summary 86 button 86 operation 86 that 167 been 167 file 168 actual 168 logic 168 validation 169 with 170 a 170 all 170 articles 170 feeds 170 of 170 sets 251 go 252 implemented 253 refreshfeeds 254 bar 254 feed 254 updates 342 to 421 and 421 function 421 in 421 is 425 refresh 509 status 1861 the#1040411
Nice when you can link to the "told you so".
I don't have any good experience with thinking models for the stuff I routinely use them for. It just creates super long and general-purpose scripts even when I just need it to make a simple figure using matplotlib. They love using argparse, adding plenty of edge cases, etc. What I would do with 20 lines of code, it does with 100. It works, but it's a waste.
Yeah everyone thought r1 type stuff was the way forward and it looks like it just is like a focused version of what we were doing before and can actually hinder some simpler tasks