I'm not 100% convinced, because it doesn't show gpt-5's regression in IF, which was not just coming from the bench, but was also observed in usage. But maybe that is because it still is just one method - rigorous testing means applying multiple frameworks, for example what is done for device safety.
I really like that they didn't do the lazy "GPT gief bench" but actually went through the human labor to create a great set.
Maybe, this why it works
I'm not 100% convinced, because it doesn't show gpt-5's regression in IF, which was not just coming from the bench, but was also observed in usage. But maybe that is because it still is just one method - rigorous testing means applying multiple frameworks, for example what is done for device safety.