It seems like the AI community believes that access to unique training data is the most important variable in order for the AI models to produce differentiated results.
Following that logic, for an interface like ChatGPT or Perplexity to separate itself from competitors, it would need exclusive access to lots of useful and unique data, right?
If that holds true over time, which company produces the most valuable data for training AI models today?
I suspect there’s a difference between producing lots of data and producing lots of unique and useful data too.
It makes me wonder whether a company like YouTube (which produces a lot of data) has a defensible data moat if their most interesting videos are also posted to TikTok, X, and Instagram.
Or said differently, how “unique” does someone’s data need to be for it to be worth something? Can you make up for a lack of uniqueness by simply having more data than the next guy?
The one data point on this topic I’ve seen so far is that Reddit is earning $60m this year for non-exclusive access to train AI models using their data.
I’m mystified as to why someone would want non-exclusive access to Reddit’s data, but would be curious to hear other stackers chime in…
specialized
overfragmented
.