reply on: My bad experiences using AI as a physicist \ stacker news ~science

pull down to refresh

405 sats \ 10 replies \ @freetx 16 Jun \ on: My bad experiences using AI as a physicist science

So, my question to you? Have you had bad experiences using AI?

Somewhat similar using outcomes using Cursor to write SysAdmin code (ansible, python, etc).

Generally speaking I only use GPT models for things like helping write README.md or INSTALL.txt instructions.

For actual code, I consider Claude and Gemini to be the best (least hallucinations + greatest ability to handle complex code bases)....perhaps Cursors also changes some of the parameters to help minimize these ill effects as well (not sure).

Although I've only had a few rare outright hallucinations, much more prevalent are 2 occurrences that I encounter regularly:

You give the AI model a rather direct and constrained instruction (modify this code to also copy its output to this directory)....and the AI model will spontaneously decide that it needs to change 5 things in 5 different files to accomplish that, when in fact it was a rather simple change to the existing line of code. I find that Claude 3.5 does this the least....so when I'm in "surgical mode" I tend to rely on it.
Between runs, sometimes AI models will decide it wants a different architectural approach or code-style. So you write Function 1 and test / debug it and it works fine. Then when you come to write Function 2 the model chooses a completely different coding style and sort of a different "philosophical" approach to how it approaches the task. These can be very frustrating because maybe Function 1 it decided that the entire function should be self-contained, but when Function 2 is to be written, it then decides it needs to break up this task into 4 different sub-task....the result is that you get whats called "AI slop", that is a mishmash of different coding styles and approaches.

AI has definitely improved my productivity but I've found its best use is to constrain it as much as can be reasonably done. Keep it focused on 1 or 2 changes max per prompt and those changes should hint how you want it done.

Its a bit like micromanaging an employee....

31 sats \ 8 replies \ @south_korea_ln OP 16 Jun

Its a bit like micromanaging an employee....

Spot on. And depending on the employee, this can lead to great outcomes, but often, you're left frustrated and annoyed.

Your example also reminds me the limitations of its context window. I really should avoid hour-long coding sessions without some reset in the middle.

I also get lazy in terms of the choice of model. I've had decent experience with ChatGPT, and thus barely followed or tried other models. But will keep the Claude rec in mind for coding then. Just don't feel like hopping constantly between models to chose for the best one for the task at hand. But then I get the results I am complaining about, so that's on me.

91 sats \ 7 replies \ @freetx 16 Jun

Just don't feel like hopping constantly between models to chose for the best one for the task at hand.

I completely understand. This is where things like Cursor comes into play. It provides you with multiple different models (Perplexity does this as well, but its not an IDE).

In addition, Cursor gives you "modes" which you can define individual prompt bases and specify a model per mode. So you can say for instance: "In Ask Mode you should never change code, this is for brainstorming and planning only" (and choose ChatGPT for that). Then you can have "Bug Fix Mode" and choose Claude 3.5 for that, etc.

See pic

219 sats \ 2 replies \ @optimism 16 Jun

I have experimental pipelines with fast-agent - though may switch frameworks or code something myself - locally that can do these things too. You simply pre-program the optimal model per agent, prompting and so on.

For example, in my current experimental setup I use a large qwen3 on leased compute for analysis or walking a pre-collected code graph through mcp and then use mistral 32b locally to code a prototype, call pylint through mcp and fix issues, and so on.

It works okay-ish if you define small enough actions and then just loop

115 sats \ 1 reply \ @freetx 16 Jun

There is so much to learn.

I actually think eventually we are going to be able to self-host most of this. I think coding models will eventually top-out where their incremental usefulness starts slowing down and commodity hardware catches up (I've been watching the AMD AI MAX+ 395 setups).

Sure the top-end models will keep being impressive but eventually everything becomes a commodity....I mean in the early days of smart-phones it was practically a necessity to upgrade from iPhone 1 to 2 to 3 ... as each change was huge. Now a person could reasonably use an iPhone 10 even though its going on a decade....these eventually become "solved problems".

36 sats \ 0 replies \ @optimism 16 Jun

Agreed.

Out of principle, I don't use any LLM plan and have no middlemen in my setup. They shall not steal my data, they shall not mess with the output, and they shall definitely not know what I'm coding. Because fuck these guys. They aren't players with your best interest at heart.

So yeah: everything sovereign. I wish there was a larger version of llama3.2 or a distilled version of llama4 because the small models, despite nice and clean instruct, still hallucinate too much to do analysis and I can't run the big ones on an apple m4.

21 sats \ 3 replies \ @south_korea_ln OP 16 Jun

Cool! And so with the 20 dollar/month plan, it doesn't matter which model I use for each query? Need to think wheter I should switch my current ChatGPT subscription to Cursor then. 500 requests per month does not sound like a lot. How is your experience with the "slow pool"? Anyhow, I'll look into it. Tnx

169 sats \ 2 replies \ @freetx 16 Jun

They have recently added "MAX" plans which allow for greater use. They occasionally rate-limit you based on how heavy models are being used. But my experience is that only happens rarely.

Personally I've been investigating moving to a completely open-source model which would be something like: VSCode + Cline or RooCode VS Code Extensions + Requesty service

This would simulate Cursor. Requesty is an API service (like open-router) that gives you access to different models. So in that case you would load $X dollars on Requesty and use VSCode + the extension you want (note: RooCode broke off from Cline but they both share lots of similar features).

In non IDE mode, I really really like Perplexity. Its basically my new search engine. If Perplexity ever releases IDE plugins for VS Code I would strongly consider dumping everything and just using them.

The best benefit of Perplexity is it includes top-notch real time websearch. So its much more useful for day-to-day task.

110 sats \ 0 replies \ @south_korea_ln OP 16 Jun

Stop giving me such detailed and useful answers, I must keep rewarding you with sats~~

0 sats \ 0 replies \ @john_doe 22 Jun

I actually use openrouter to access many models, and we can have web search also enabled with it: https://openrouter.ai/announcements/introducing-web-search-via-the-api

Was there anything besides this feature which makes you prefer Requesty over open router?

9 sats \ 0 replies \ @seashell 16 Jun

Yeah, I’ve noticed the same, the model can’t stick to a single mental model across a session. I started saving a style snapshot after the first clean output and just re-prompt with “follow this style nothing else.” Keeps me from ending up with a Frankenstein codebase.