Design

The Era of Exploration - Yiding Jiang

deSign_r

> today’s models consume data far faster than humans can produce it.

I read through the rest of the paper and I have to admit I still don't understand this line. If pretraining is walking LLMs through huge chunks of human-created text and giving feedback about the quality of the llm's outputs, I don't think I understand why we can't repeatedly use the same data (as long as it's pretty huge, as in all the English language writing on the internet). Maybe it is because to make improvements llms need larger data sets and so we are getting to the point where human generated data sets can't get larger quickly (in this sense they have been "consumed"). But, I'm still struggling with how to think about a model "consuming" data.