pull down to refresh

"People often speculate about AI’s broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing."

In case you missed it, amidst all the talk about whether AI is plateauing, OpenAI released a new evaluation paper measuring AI progress versus human counterparts.
Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.

"GDPval focuses on tasks based on deliverables that are either an actual piece of work or product that exists today or are a similarly constructed piece of work product. "

I like the idea of evaluating AIs on real world tasks, rather than made up tests. And it seems that graded them blind against human experts in the tasks' respective fields. Here are some examples of evaluation tasks:
To evaluate model performance on GDPval tasks, we rely on expert “graders”—a group of experienced professionals from the same occupations represented in the dataset. These graders blindly compare model-generated deliverables with those produced by task writers (not knowing which is AI versus human generated), and offer critiques and rankings.
we ran blind evaluations where industry experts compared deliverables from several leading models—GPT‑4o, o4-mini, OpenAI o3, GPT‑5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4—against human-produced work. Across 220 tasks in the GDPval gold set, we recorded when model outputs were rated as better than (“wins”) or on par with (“ties”) the deliverables from industry experts, as shown in the bar chart below. Claude Opus 4.1 was the best performing model in the set, excelling in particular on aesthetics (e.g., document formatting, slide layout), and GPT‑5 excelled in particular on accuracy (e.g., finding domain-specific knowledge). We also see clear progress over time on these tasks. Performance has more than doubled from GPT‑4o (released spring 2024) to GPT‑5 (released summer 2025), following a clear linear trend.
I read these results as saying that Claude Opus is on its way to being as good as a human at real world tasks -- although Zvi Mowshowitz points out: "Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure."
Here's an interesting X thread about the project: https://x.com/tejalpatwardhan/status/1971249532588741058
Also, OpenAI is publishing all their evaluation papers at evals.openai.com
"Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure."
This is why I feel that this is all a sales pitch. Also, I don't hire in the bottom 50%. I hire in the top 2%. Get 100 resumes burn 98, invite 2, hire 1.
The other thing is that I'd be Gell-Mann-amnesia-style betraying my own conscience by believing this, as just this morning I got code that didn't work when I tried something. And it wasn't even that hard to do it right. So expert level? No. Only if you are a lil yolo bitch with a big mouth on twitter that calls themselves an expert. In that case, you shall lose your internet credits. Preferably yesterday.
reply
Well, a lot of this really depends on who the "expert" humans in the blond test were.
I read 50% on the chart to mean that it is a coin toss to know whether graders thoought the human or the ai did better work. Less than 50% means graders tended to rank ai as doing less good work than the humans. Greater than 50% means they tended to rank ai as doing better work than humans.
So the important factor is were the humans the ai was graded against "top 2%" kind of people.
Also, the point about ai failure being more likely tonne catastrophic is valid.
Finally, I'd say I have no doubt that openAI is pumping their own bags with a sales pitch in every piece of info they put out. But even so, there is something here.
It feels to me like when social media was bursting onto the scene. I mostly dismissed it because I didn't see the utility and I didn't trust the promoters. Yet, lately I come and lately I see that there may be some utility here. It may be an open question whether it is a net benefit, but it certainly is a powerful tool to do something. I see ai in the same light (and perhaps I'm just scared of repeating what I now see as a mistake in my attitude toward social media).
reply
If you're curious about how they do these kinds of tests, there are crowdsourced ones for video and image generation available, see for example this video "arena". It's basically A/B testing, and it's subjective. (Yes, also with LLM-as-judge, then it's simulated subjectivity.)
Greater than 50% means they tended to rank ai as doing better work than humans.
I'd challenge that this is better looking work and not better work in an absolute sense. Since an LLM can generate faster than a human, it will give a more complete output, always, but it is much more error prone.
If you want profound results with LLMs, which is absolutely possible, then you give a human expert access to an LLM. This is because a human expert knows how to approach a problem and ask the right follow-up questions. LLMs are very much garbage in, garbage out systems and most have ingested tons of garbage, the prompt is the filter (with a bit of luck) and this is why "prompt engineering" is a thing, be it a dumb thing.
Now the question is can we train LLMs to an acceptable level of proficiency in directing LLMs (train it in management, QA, and so on) and I think that the answer is yes. I even think that this is a worthy goal and I'd like to have this. Preferably to work on one of those NVMe tensor cards and then we just build an RPi cluster of workers: build the hive. I think it'd be awesome, but I think that concepts like "AI as persona", "AI is smarter than humans", "AGI", "buy muh subscription", "we be serving you ads" are distractions from creating better tooling.
It feels to me like when social media was bursting onto the scene. I mostly dismissed it because I didn't see the utility and I didn't trust the promoters. Yet, lately I come and lately I see that there may be some utility here. It may be an open question whether it is a net benefit, but it certainly is a powerful tool to do something.
Can you elaborate on this, specifically:
  • What social media platform?
  • Can you define what you mean by utility?
reply
Can you elaborate on this, specifically
Platform Let's start when Facebook and then Twitter came out. My impression at the time was that these were ego-stroking tools for people who had nothing better to do. This isn't because I didn't like tech or the internet -- blogging was one of my first loves (probably even before girls). But I dismissed social media for a long time because I generally don't like self-aggrandizing, and I assumed that it is what the kind of people who used social media used it for.
My memory of when these tings got started (especially headliners like Facebook and Twitter) is that there was more emphasis on social than on media. I regret that I didn't spend time learning how to use them in these early stages. Or that I had asked myself how the world would look if social media stuck around and even became integral to daily life, rather than being reluctantly dragged into usage before realising : Ah! there is something here.
Utility In the last decade or so, social media tools (forums like this one, Reddit (less so), things like X or LinkedIN, Telegram -- never yet found a use for Facebook) have been hugely useful to me. First, as tools for learning. Second, helping me to build relationships. Third, I've gotten all my jobs and gigs through social media, as has my wife. Even though there is a huge amount of fluff, the connective power of social media and its massive availability of information and door-opening access to people is a wonder.

Your points about better looking work and the distraction of AI personas or AGI faff are good. And, if anything, that's the tone that society in its current understanding of AI as a scifi character rather than a tool certainly needs.
Comparing AI tools to humans to see who is "better" doesn't seem useful, but comparing them us to figure out what AI might be good at does.
Finally: "buy muh subscription", "we be serving you ads" -- this is where my pessimism comes in. I don't see how we end up anywhere else. General populations, even businesses, have demonstrated they mostly do not want to run their own infrastructure. The only exception to this is routers (people seem willing to plug in a device, but are very unlikely to change any settings or do something like flash their own firmware on to it). If email can be taken as a model: there is no world where most people run their own models nor where they even care to. This is true for Bitcoin as well...unless we can find an incentive that motivates more than a few crazy individuals (perhaps if the censorship state comes quick and hard, it would generate enough of a backlash to create a culture of people who care about personal sovereignty in their devices) to desire control over the tools they use, we'll probably end up with captured AI like captured email. I suppose the captured AI future looks even uglier than the captured email future and doesn't leave us in a very nice state.
reply
I was on FB and LinkedIn rather early, before there were a million users on both, because my friend told me it was cool in the former case and one of my colleagues was moving to there in the latter. For both it did feel novel and cool at the time and it was definitely more social than networking on FB, but more networking than social on LinkedIn. I am no longer on either. Twitter became useful for me in 2010 or so I think, I've had great DM conversations on there and definitely some social discovery through just reading what people are up to, but the algo screwed things up for me. Left that too.
I do get the gig part - I scored 3 jobs and found a co-founder through Reddit over a decade ago, before it was shitty. But honestly, I've gotten way more gigs from having drinks at conference afterparties, maybe 5x, even though I've spent much more time on Reddit alone than at conferences including the boring part where you're not drinking. So, I'm rather skeptical about the efficiency of social media and whether it lives up to the promises made. Too much noise, not enough signal.

General populations, even businesses, have demonstrated they mostly do not want to run their own infrastructure. [..]
Mind you I'm not advocating against services. I'm advocating against closed-source SaaS. There are plenty of 3rd party providers of open-source software. Even VPS': you don't need to run your own Xen hypervisor or k8s (or both.) There are plenty of providers for this.
[..] perhaps if the censorship state comes quick and hard, it would generate enough of a backlash to create a culture of people who care about personal sovereignty in their devices
This has already started in Europe, and I thought it was just small business but I've been amazed to learn of some major businesses working on completely de-SaaS-ing, mostly away from US dependencies. My main worry is reserved for those in emerging or dependent economies as they don't have so many options, and a dependency on a EU corporation is as bad as a dependency on a US one. I still think we can make it work but it's going to be a tough couple of years ahead.
reply
I forgot to mention: Hiring the 2% is a European thing. In the US my main hiring criterium was "attitude is everything". I'd probably not hire an LLM in the US even if it's smarter than any human, because:
reply
This is the key sentence that kinda makes this evaluation not super useful:
Additionally, in the real world, tasks aren’t always clearly defined with a prompt and reference files; for example, a lawyer might have to navigate ambiguity and talk to their client before deciding that creating a legal brief is the right approach to help them. We plan to expand GDPval to include more occupations, industries, and task types, with increased interactivity, and more tasks involving navigating ambiguity, with the long-term goal of better measuring progress on diverse knowledge work.
Part of the human expert's work is to define the problem and collect the relevant information needed. The AI didn't have to do any of that.
Moreover, the article didn't talk about whether the AI's work product was actually put into a productionized environment. For example, were the real estate listings actually posted automatically onto Redfin/Zillow? Another part of the human's work is to navigate the many different tools and platforms and conform inputs and outputs to the expected format, and interoperate between many technologies. Not sure if the AI can do that autonomously yet.
reply
stackers have outlawed this. turn on wild west mode in your /settings to see outlawed content.