pull down to refresh

It seems like the AI community believes that access to unique training data is the most important variable in order for the AI models to produce differentiated results.
Following that logic, for an interface like ChatGPT or Perplexity to separate itself from competitors, it would need exclusive access to lots of useful and unique data, right?
If that holds true over time, which company produces the most valuable data for training AI models today?
I suspect there’s a difference between producing lots of data and producing lots of unique and useful data too.
It makes me wonder whether a company like YouTube (which produces a lot of data) has a defensible data moat if their most interesting videos are also posted to TikTok, X, and Instagram.
Or said differently, how “unique” does someone’s data need to be for it to be worth something? Can you make up for a lack of uniqueness by simply having more data than the next guy?
The one data point on this topic I’ve seen so far is that Reddit is earning $60m this year for non-exclusive access to train AI models using their data.
I’m mystified as to why someone would want non-exclusive access to Reddit’s data, but would be curious to hear other stackers chime in…
Twitter and Reddit have a ton of interesting niche based data for training AI models. It also helps that retweets, likes, and upvotes can parse out the noise. There are way more interesting conversations and ideas going on in these platforms vs Facebook or instagram posts.
I think another interesting one would be Medium or other blog like platforms. Being able to train AI via triangulating thought leaders long form articles and comments offers unique insights beyond Google or Wikipedia.
On that note podcast transcripts are probably the holy grail for really getting into the weeds. Not sure how licensing works but since many podcasts are RSS feeds I wonder if it’s an open market to download and train on transcripts (similar to YouTube but I find podcasts to have more signal than videos most of the time).
Books too would be very interesting
reply
The "most valuable data" depends on the use case the model is trying solve for. If you want health diagnostics, you need medical records. If you want fingerprint matching, you want fingerprint data, financial models need financial data and so on. We are seeing a lot of niche AI systems trained on specific types of data.
reply
good point, do you think the market for AI systems will become more and more fragmented over time?
reply
Yes. I would use the term specialized over fragmented.
Another example: weather prediction systems have become much better in recent years. You can be sure that the data being fed into the model does NOT include the Twitter feed or the Reddit catalog. It wants data from weather stations, sea buoys, satellites, commercial airline and shipping, and it wants this data over time informing and training the model. Increases in computation enable many more inputs to be computed.
I think there is some conflation here with LLMs and modeling in general. LLMs represent a specialized type of model that ingests largely human written texts, associates patterns and connections, and it can then form stochastic responses to prompts.
Another interesting point: As LLMs continue generating content that flood the Internet, search engines will of course consume this content which in turn will be used for training LLMs. This creates a negative feedback loop where AI generated content trains itself. As anyone who has worked with ChatGPT for 5 minutes knows, it hallucinates. The human brain frequently detects these anomalies, but if this content is fed into an LLM, no such discernment will exist. This feedback loop will ultimately create decreasingly valuable content and may ultimately destroy the utility. The same holds for generative images and video. There is a "wow" factor now, but over time, we may see a decline in quality for the reasons stated.
The curators of LLMs will seek rich, high quality, human created content and try to reject the crap their systems regurgitate. The irony.
reply
Google but apparently they didn't use it for their product. Haha
reply
yeah that strikes me as odd - do you expect their AI products to be better than those built by everyone else?
reply
Not sure. I think Open AI has a big lead currently but Google can integrate their AI product into billion of devices.
reply
It's definitely data from Stacker News because Stackers put their money where their mouth is ;)
reply
👀
reply
Been thinking a lot about this lately, very interesting points, I agree everyone will need to optimize to become “inference-driven.” That era is here 💯esp for the huge companies.
SN is def still way to early but the most obvious places to infer on sn would be the saloon or the individual territories around interactions and usage? Other than that it would be general conversations and unique written posts around Bitcoin. Obviously the more data, charts, alpha, signal etc…would be the mv I would imagine. Also zaps lol.
When you start optimizing this way for the real world, it increasingly becomes optimize for more surveillance because the inference from these systems will be able to package it in a way that is helpful to everyone including the state. China is already there https://www.reuters.com/world/china/china-uses-ai-software-improve-its-surveillance-capabilities-2022-04-08/
reply
Microsoft (Github + LinkedIn)
...plus I predict they will drop the price/switch-to/add a "pay-with-your-data, but still no ads" model for office/windows/one-drive/azure.
Free postgres databases are coming, but the host gets a copy.
reply
Tencent, Meta
reply
why those two specifically?
reply
They have the most privacy invasive and popular apps. So more data points. And the data is high quality and difficult to obtain. Of course it depends on what kind of hindsight do you want from the data. It might be worthless for some use cases.
reply