pull down to refresh

The randomizer / temperature is what messes it up, and back then, low temperature caused instruction deviation. The arguing had results similar to what you see in qwen3 nowadays - in the "reasoning" step it basically does this in one shot. I've had many a run stuck in a reasoning loop: "Wait, <crap>".
I also did a "socratic loop" where you basically let one agent ask questions to the other and then you bring it together. You limit looping by limiting rounds, but even then you'll sometimes find looping within the rounds. I suspect that this is also why reasoning mode is not really efficient in production settings outside chatbots where you need consistent results - I can achieve much higher consistency with a small llama 3.2 or mistral 31 which is half a year old, that are non-reasoning. At least that's what works much better in my "production" usage than models with the reasoning feature.
It highlights also that reasoning, and indirectly, slop input are sidetrack experiments that only work to reduce LLM efficiency outside of trying to astonish people "how smart they are".
111 sats \ 1 reply \ @freetx 19h
I can achieve much higher consistency with a small llama 3.2 or mistral 31 which is half a year old, that are non-reasoning. At least that's what works much better in my "production" usage than models with the reasoning feature.
Yes, I tend to shy away from reasoning models myself in self-hosted situations. I hardly have every found that the "increased intelligence" (debatable) is ever worth the wasted time of the slop-generation uhh "reasoning"
small llama 3.2 or mistral 31
On a side note have you played with IBM's granite models? I really think 3.2 and 3.3 8b granite models punch far above their weight. I've asked them various legal, tax, and programming questions and their results are no where near "frontier class" but their results always seem grounded.
reply
IBM's granite models
Haven't - will put it on the list! Thanks!
reply