pull down to refresh

I really like that they didn't do the lazy "GPT gief bench" but actually went through the human labor to create a great set.

reply

Maybe, this why it works

reply

I'm not 100% convinced, because it doesn't show gpt-5's regression in IF, which was not just coming from the bench, but was also observed in usage. But maybe that is because it still is just one method - rigorous testing means applying multiple frameworks, for example what is done for device safety.

reply