reply on: Which company has the most valuable AI training data? \ stacker news ~tech

pull down to refresh

138 sats \ 0 replies \ @DiedOnTitan 18 Mar 2024 \ parent \ on: Which company has the most valuable AI training data? tech

Yes. I would use the term specialized over fragmented.

Another example: weather prediction systems have become much better in recent years. You can be sure that the data being fed into the model does NOT include the Twitter feed or the Reddit catalog. It wants data from weather stations, sea buoys, satellites, commercial airline and shipping, and it wants this data over time informing and training the model. Increases in computation enable many more inputs to be computed.

I think there is some conflation here with LLMs and modeling in general. LLMs represent a specialized type of model that ingests largely human written texts, associates patterns and connections, and it can then form stochastic responses to prompts.

Another interesting point: As LLMs continue generating content that flood the Internet, search engines will of course consume this content which in turn will be used for training LLMs. This creates a negative feedback loop where AI generated content trains itself. As anyone who has worked with ChatGPT for 5 minutes knows, it hallucinates. The human brain frequently detects these anomalies, but if this content is fed into an LLM, no such discernment will exist. This feedback loop will ultimately create decreasingly valuable content and may ultimately destroy the utility. The same holds for generative images and video. There is a "wow" factor now, but over time, we may see a decline in quality for the reasons stated.

The curators of LLMs will seek rich, high quality, human created content and try to reject the crap their systems regurgitate. The irony.