pull down to refresh
304 sats \ 0 replies \ @SimpleStacker 5h \ on: GDPval: Measuring the performance of our models on real-world tasks - OpenAI AI
This is the key sentence that kinda makes this evaluation not super useful:
Part of the human expert's work is to define the problem and collect the relevant information needed. The AI didn't have to do any of that.
Moreover, the article didn't talk about whether the AI's work product was actually put into a productionized environment. For example, were the real estate listings actually posted automatically onto Redfin/Zillow? Another part of the human's work is to navigate the many different tools and platforms and conform inputs and outputs to the expected format, and interoperate between many technologies. Not sure if the AI can do that autonomously yet.