pull down to refresh

First of all, using ChatGPT (paid version, but not the most expensive one) has been a net positive for my work. I use it a lot to code simple scripts that I know how to code, but would take me 10 times longer to code myself. Also, it has introduced me to CS algorithms that I didn't know of, but that allow me to do some stuff in much cleaner ways. Also, to write research proposals, it's of great help, if only to save time.
But, whereas I used to really check every detail it provided me with, I slowly became complacent and started to trust it blindly, more and more. Then, its limitations hit me, several times, in more or less painful ways.
  • I am working on a paper for Science, which, by definition, goes beyond the state-of-the-art. In it, some quite technical stuff is being discussed. I then used ChatGPT to ask it a straightforward question: is method A or method B considered a stricter version of the other one (mathematically). It very confidently told me, method A is stricter than B. However, I know it is the opposite. I worked with some of the people who developed these methods. I told ChatGPT it's wrong. For another 15 minutes, it kept arguing that it may look like that, but that I am wrong and it is right. I did not want to admit its mistake. Because of this, I started questioning many of its previous statements...
  • Another time, I asked it a pretty basic question about a widely used software in my field. I was too lazy to Google it. I trusted it at its word, and built 10s of simulations on top of the answer it gave me. Only today, I realized it completely hallucinated the answer to my question. I confidently reported on my simulations to colleagues, but now have to backtrack on several of those statements. And it does not look good when I have to admit it's because I relied on ChatGPT instead of doing the basic checks myself.
  • A pretty similar occurrence as the previous bullet point happened a week ago or so. I spent a whole weekend trying to implement some new stuff with ChatGPT, helping me along the way. But as it was new stuff that I did not yet know how to code myself, it took me a while to catch the mistakes and hallucinated assumptions by ChatGPT. Also, this led me to report some shamefully wrong results to colleagues.
  • A recent report by a Nature referee (supposedly, the best in the field) clearly showed they had used AI for their report. The em dash, for one, but also because of probably the stupidest question about a feature in my data that no one who works in this field would even dare to ask, as it is so obvious... but ChatGPT would suggest to ask such a question, for sure.
I have no proof for this last one, but it is quite a likely conclusion based on my own experience using LLMs.
Many times, I realize I cannot trust junior colleagues anymore. They face the same AI limitations that I do, but are even less able to spot that ChatGPT can and will hallucinate when it hasn't been trained on the appropriate data for some questions. I often get annoyed by the sloppiness of their code, but can I even blame them if I end up making the same mistakes because of AI?
I am mostly writing this to reflect on my use of AI. It's useful, for sure, but at what cost... I really need to implement a better process so that I can benefit from it without quietly getting f'd without me realizing it at first.
My first instinct now would be to copy and paste this block of text into ChatGPT and ask it to review it for clarity and flow. It'd do a great job, I know, but not today... not today.
So, my question to you? Have you had bad experiences using AI?
I would not say I have had a "bad" experience with IA for I have always been conscious that it will unavoidably hallucinate, for once because it is reported permanently both by users and companies of origin, and fundamentally because I myself know that if we ourselves can hallucinate when making an investigation, it's simply impossible for an AI to not to do the same. Too many variables on the fly.
I did made some basic tests with some questions I knew the answer to, and while the models have made great progress very fast, it still falls way too short to be barely usable.
However, I do like to some times ask the AI about some subjects and problems because even when I know it will get it wrong, 50% of the time its slop gives me a clue on something, and 50% of that 50% of the time, it even gives me a clue indirectly, simply because the process of reviewing the slop sparks my mind into the right direction.
With that careful use, my experience has always been positive so far, although this procedure implies minimal use.
reply
400 sats \ 9 replies \ @freetx 10h
So, my question to you? Have you had bad experiences using AI?
Somewhat similar using outcomes using Cursor to write SysAdmin code (ansible, python, etc).
Generally speaking I only use GPT models for things like helping write README.md or INSTALL.txt instructions.
For actual code, I consider Claude and Gemini to be the best (least hallucinations + greatest ability to handle complex code bases)....perhaps Cursors also changes some of the parameters to help minimize these ill effects as well (not sure).
Although I've only had a few rare outright hallucinations, much more prevalent are 2 occurrences that I encounter regularly:
  1. You give the AI model a rather direct and constrained instruction (modify this code to also copy its output to this directory)....and the AI model will spontaneously decide that it needs to change 5 things in 5 different files to accomplish that, when in fact it was a rather simple change to the existing line of code. I find that Claude 3.5 does this the least....so when I'm in "surgical mode" I tend to rely on it.
  2. Between runs, sometimes AI models will decide it wants a different architectural approach or code-style. So you write Function 1 and test / debug it and it works fine. Then when you come to write Function 2 the model chooses a completely different coding style and sort of a different "philosophical" approach to how it approaches the task. These can be very frustrating because maybe Function 1 it decided that the entire function should be self-contained, but when Function 2 is to be written, it then decides it needs to break up this task into 4 different sub-task....the result is that you get whats called "AI slop", that is a mishmash of different coding styles and approaches.
AI has definitely improved my productivity but I've found its best use is to constrain it as much as can be reasonably done. Keep it focused on 1 or 2 changes max per prompt and those changes should hint how you want it done.
Its a bit like micromanaging an employee....
reply
Its a bit like micromanaging an employee....
Spot on. And depending on the employee, this can lead to great outcomes, but often, you're left frustrated and annoyed.
Your example also reminds me the limitations of its context window. I really should avoid hour-long coding sessions without some reset in the middle.
I also get lazy in terms of the choice of model. I've had decent experience with ChatGPT, and thus barely followed or tried other models. But will keep the Claude rec in mind for coding then. Just don't feel like hopping constantly between models to chose for the best one for the task at hand. But then I get the results I am complaining about, so that's on me.
reply
86 sats \ 6 replies \ @freetx 10h
Just don't feel like hopping constantly between models to chose for the best one for the task at hand.
I completely understand. This is where things like Cursor comes into play. It provides you with multiple different models (Perplexity does this as well, but its not an IDE).
In addition, Cursor gives you "modes" which you can define individual prompt bases and specify a model per mode. So you can say for instance: "In Ask Mode you should never change code, this is for brainstorming and planning only" (and choose ChatGPT for that). Then you can have "Bug Fix Mode" and choose Claude 3.5 for that, etc.
See pic
reply
209 sats \ 2 replies \ @optimism 10h
I have experimental pipelines with fast-agent - though may switch frameworks or code something myself - locally that can do these things too. You simply pre-program the optimal model per agent, prompting and so on.
For example, in my current experimental setup I use a large qwen3 on leased compute for analysis or walking a pre-collected code graph through mcp and then use mistral 32b locally to code a prototype, call pylint through mcp and fix issues, and so on.
It works okay-ish if you define small enough actions and then just loop
reply
110 sats \ 1 reply \ @freetx 10h
There is so much to learn.
I actually think eventually we are going to be able to self-host most of this. I think coding models will eventually top-out where their incremental usefulness starts slowing down and commodity hardware catches up (I've been watching the AMD AI MAX+ 395 setups).
Sure the top-end models will keep being impressive but eventually everything becomes a commodity....I mean in the early days of smart-phones it was practically a necessity to upgrade from iPhone 1 to 2 to 3 ... as each change was huge. Now a person could reasonably use an iPhone 10 even though its going on a decade....these eventually become "solved problems".
reply
21 sats \ 0 replies \ @optimism 9h
Agreed.
Out of principle, I don't use any LLM plan and have no middlemen in my setup. They shall not steal my data, they shall not mess with the output, and they shall definitely not know what I'm coding. Because fuck these guys. They aren't players with your best interest at heart.
So yeah: everything sovereign. I wish there was a larger version of llama3.2 or a distilled version of llama4 because the small models, despite nice and clean instruct, still hallucinate too much to do analysis and I can't run the big ones on an apple m4.
Cool! And so with the 20 dollar/month plan, it doesn't matter which model I use for each query? Need to think wheter I should switch my current ChatGPT subscription to Cursor then. 500 requests per month does not sound like a lot. How is your experience with the "slow pool"? Anyhow, I'll look into it. Tnx
reply
143 sats \ 1 reply \ @freetx 10h
They have recently added "MAX" plans which allow for greater use. They occasionally rate-limit you based on how heavy models are being used. But my experience is that only happens rarely.
Personally I've been investigating moving to a completely open-source model which would be something like: VSCode + Cline or RooCode VS Code Extensions + Requesty service
This would simulate Cursor. Requesty is an API service (like open-router) that gives you access to different models. So in that case you would load $X dollars on Requesty and use VSCode + the extension you want (note: RooCode broke off from Cline but they both share lots of similar features).
In non IDE mode, I really really like Perplexity. Its basically my new search engine. If Perplexity ever releases IDE plugins for VS Code I would strongly consider dumping everything and just using them.
The best benefit of Perplexity is it includes top-notch real time websearch. So its much more useful for day-to-day task.
reply
Stop giving me such detailed and useful answers, I must keep rewarding you with sats~~
Yeah, I’ve noticed the same, the model can’t stick to a single mental model across a session. I started saving a style snapshot after the first clean output and just re-prompt with “follow this style nothing else.” Keeps me from ending up with a Frankenstein codebase.
reply
Your comments about the referees and junior colleagues is what scares me most.
It seems like we're gonna have a harder time trusting each other that we're interacting with a real human intelligence and not ChatGPT. Because ChatGPT does a good enough job of simulating human intelligence most times, it's easy for the human side of us to get lazy and use ChatGPT as a substitute. I'm not optimistic that we can prevent this from occurring.
reply
True. I feel like I can find a good balance because I know and have experienced the before times, but some junior colleagues already feel ill-equipped at this point to deal with AI hallucinations.
What's interesting, too, is that some journals seem to approach the use of AI with a Don't ask, don't tell mentality. I am still rarely asked during submission if I used AI in the creation of an article.
reply
I think this is already occurring, tbh, and I don't think there's a way back.
reply
136 sats \ 0 replies \ @teremok 5h
If you were top 1% programmers in the world (I consider myself to be so, in my area, Frontend. My salary also told me so) you would be surprised how often this happens too in my field at the cutting edge.
I often had to correct my colleagues when AI made an error.
It's not bad. I know juniors are sh*tting their pants. But reaaally complex stuff, naaw man
reply
A LOT! So I was making this somewhat basic java springboot app and I told ChatGPT to review it.
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import org.springframework.stereotype.Service;

@App
public class Base {
    public static void main(String[] args) {
        SpringApplication.run(Base.class, args);
    }
}

// Controller Layer
@RestController
class GreetingController {

    private final GreetingService greetingService;

    // Constructor-based Dependency Injection
    public GreetingController(GreetingService greetingService) {
        this.greetingService = greetingService;
    }

    @GetMapping("/greet")
    public String greet(@RequestParam(value = "name", defaultValue = "World") String name) {
        return greetingService.getGreetingMessage(name);
    }
}

// Service Layer
@Service
class GreetingService {
    public String getGreetingMessage(String name) {
        return "Hello, " + name + "! Welcome to the BPE Window";
    }
}
This app is just supposed to send a Hello message as you can see the GreetingService class. and when ChatGPT gave the output, it was a 200+ lines of code, wrapped in error handling blocks, 20+ variables (whereas mine has only 1) and things I didn't even learn, just to make it "ERROR FREE" Like wtf man! I asked you check whether it works or not and you just gave me a whole thesis on it.
Then again one day, I asked him another science question, and it was pretty simple, how to prepare fresh ferrous sulphate because apparantly I could find how to make ferrous sulphate but not "fresh" on the internet. And he replied with a Flowchart of how to create ferrous sulphate through pyrites. So I straightaway asked my chem teacher and he said that it's a reaction of iron and sulphuric acid filtered later and rebuked for not studying basics thoroughly. But it wasnt MY fault! I WAS MISLEAD!
I am working on a paper for Science, which, by definition, goes beyond the state-of-the-art
Can you brief about it? I might understand something :) Or is it top-secret?
reply
Currently I'm pivoting my setup
from:
Synchronous: letting the LLM run unstructured with different models in a pipeline
to:
Asynchronous:
  1. LLM generates code or human writes it - doesn't matter - and uploads to repo
  2. Issue detection:
    • linting logs one issue per error found
    • If none found, LLM can analyze and create issue for the most significant issue - I specifically make the prompt with instruct repeatedly to only report the most significant issue. Works ok with LRM
    • Users can of course add issues too, LLM analyzes if its a one-shot or if it needs breakdown
  3. Coding LLM can ingest issue and fix it with a pull req
  4. Pull req can get reviewed by LRM or human
  5. Human merges
Everything that can be done with code, like linting, does not use LLMs.
reply
Damn, I just wanted the AI to check if my plant was alive and it built a greenhouse with a self-watering system and an AI-powered scarecrow.
I swear, sometimes ChatGPT doesn’t review code — it rewrites it like it's auditioning for a job at NASA. Like bro, I’m still trying to survive public static void main, not orchestrate microservices across a Kubernetes cluster.
Same thing with chemistry — I asked for a fresh ferrous sulphate recipe and got a mining operation flowchart straight outta a metallurgy PhD thesis. Asked my chem teacher and he just said “use Fe + H₂SO₄ and move on.”
It's like these LLMs read Thus Spoke Zarathustra and thought every answer must ascend the mountain of abstraction before descending to meet us mortals.
“He who climbs upon the highest mountains laughs at all tragedies, real or imagined." — Nietzsche (Clearly what GPT thinks before it answers a 4-mark question.)
But fr tho, loving that async pivot you're on @optimism. Turning LLMs from noisy sidekicks into focused bug-hunters with issue-detection filtering? That’s pretty GOOD
Don't worry I won't steal your repo, I'm building a Human Behaviour Prediction Engine too, https://github.com/axelvyrn/TiresiasIQ (and it's quite good, believe me - i'd like your input)
Also, curious: How are you ranking issue significance without it hallucinating a crisis over a missing semicolon?
reply
How are you ranking issue significance
It doesn't matter. Every task should be small, or otherwise needs breakdown.
without it hallucinating a crisis over a missing semicolon?
It's harder to make it "just fix a semicolon", so in that case using non-llm tools is better, or at least expose the tools needed to the LLM through MCP. Syntax fixing can be done with existing tools, so in this case you just expose an MCP tool, ie: code_fixing::correct_semicolons(files[]) that implements the syntactical logic in code, without needing the LLM to actually write correct code.
reply
like using standard --fix to lint .js files?
Can you brief about it? I might understand something :) Or is it top-secret?
Will submit it to the editor this week or next week, hopefully. I can send you the paper once it's published. For now, I prefer not doxxing myself too much by being detailed about the field that I work in. Even though some people here already more than what is good about me to remain anon~~
reply
0 sats \ 0 replies \ @Cje95 1h
I have to say that I have found Claude to be much better for me over ChatGPT. My Committee and the House Bipartisan AI Taskforce from last year met with Anthropic it seemed like every few weeks between the two and has just seemed to always outperform. Nothing ever to flashy but much more reliable.
reply
Nice post! AI while useful still not sophisticated enough for complex tasks
reply