The "most valuable data" depends on the use case the model is trying solve for. If you want health diagnostics, you need medical records. If you want fingerprint matching, you want fingerprint data, financial models need financial data and so on. We are seeing a lot of niche AI systems trained on specific types of data.
16 sats \ 1 reply \ @kr OP 17 Mar
good point, do you think the market for AI systems will become more and more fragmented over time?
reply
Yes. I would use the term specialized over fragmented.
Another example: weather prediction systems have become much better in recent years. You can be sure that the data being fed into the model does NOT include the Twitter feed or the Reddit catalog. It wants data from weather stations, sea buoys, satellites, commercial airline and shipping, and it wants this data over time informing and training the model. Increases in computation enable many more inputs to be computed.
I think there is some conflation here with LLMs and modeling in general. LLMs represent a specialized type of model that ingests largely human written texts, associates patterns and connections, and it can then form stochastic responses to prompts.
Another interesting point: As LLMs continue generating content that flood the Internet, search engines will of course consume this content which in turn will be used for training LLMs. This creates a negative feedback loop where AI generated content trains itself. As anyone who has worked with ChatGPT for 5 minutes knows, it hallucinates. The human brain frequently detects these anomalies, but if this content is fed into an LLM, no such discernment will exist. This feedback loop will ultimately create decreasingly valuable content and may ultimately destroy the utility. The same holds for generative images and video. There is a "wow" factor now, but over time, we may see a decline in quality for the reasons stated.
The curators of LLMs will seek rich, high quality, human created content and try to reject the crap their systems regurgitate. The irony.
reply