pull down to refresh

Unpopular opinion but I kind of understand Reddits stance here.
Reddits content is incredibly valuable for large language models. It is one of the few resources that have the heighest weight in commoncrawl together with Wikipedia and Stackoverflow.
They would be objectively giving value away for free if the API was just there.
LLMs can easily scrape Reddit, no?
reply
At the moment and in the most recent version of commoncrawl: yes. After the 30. june 2023: no
If you want to have a taste of how it will be: try opening sites on linkedin (in your browser or Tor). Scraping is prevented easily and the API can be used for money.
reply
It's an arms race. There will be better tools for scraping, if there are not already. If these sites keep erecting walls they will end up affecting real users' UX which will backfire. So there's a limit to what they can do.
Information longs to be free and with enough determination it becomes free.
reply
Agreed, but I would argue that Reddit the company creates much less value than their users. Reddit is a utility, inflated to Big Tech status by cheap fiat money. Any company can run a bunch of servers and spin up a simple UI. Yes, they have strong network effects but I question how insurmountable that is. They need to respect the users who create most of the value.
SN solves this problem by paying users for participation. If they want to sell API access that brings in more funds, some of which are shared with users, well that's better for me.
reply
If catering LLM's is the goal reddit is taking a huge risk. Trying to make ML-ventures pay for reading reddit data and blocking all apps that generate content is too much collateral damage.
Most machine-learning ventures operate using a cloud-service or a single data-center. They scrap large amounts of data from a single IP.
In the case of reddit-is-fun each queries reddit using it's own IP-address.
Just rate-limiting per IP-address might solve this problem
reply
In the case of reddit-is-fun each queries reddit using it's own IP-address.
Just rate-limiting per IP-address might solve this problem
They don't want to "solve" this. They want Google and Microsoft to pay them!
reply