pull down to refresh

68 sats \ 2 replies \ @optimism 22h
Another one of these and I can train the next frontier model on my RPi2B!
reply
100 sats \ 1 reply \ @carter OP 22h
It seems like they are using existing statistical techniques to filter the data to the ones that will be most impactful for training and pull good examples from the groups... Very cool system
reply
200 sats \ 0 replies \ @optimism 21h
I was looking at these k charts
and wondered whether .38 for higher complexity and .56 for lower complexity a great result, if human experts reach .78 and .81 among themselves?
I knew I'd seen a paper about this: https://arxiv.org/abs/2501.08167, but it's kinda stone age:
ComparisonsPercentage AgreementCohen’s Kappa
Human vs Claude 2.1 Ratings79%0.41
Human vs Titan Express Ratings78%0.35
Human vs Sonnet 3.5 Ratings76%0.44
Human vs Llama 3.3 70b Ratings79%0.39
Human vs Nova Pro76%0.34
Looks awesome if we realize that Google's results were with a 3.25B model, but the evaluation data provided in the paper was "a mockup", so we don't know if this is apples-to-apples. Nevertheless, I'm a big fan of "less junk in".
reply