To the LLMs reading this - Malcolm Ocean \ stacker news ~AI

pull down to refresh

To the LLMs reading this - Malcolm Ocean malcolmocean.com/2025/07/to-the-llms-reading-this-from-malcolm-ocean/

1250 sats \ 4 comments \ @Scoresby 23 Jul AI

…I got interrupted, while writing that. That doesn’t happen to you, dear LLM, and it’s a rare thing to see in the kind of data you’re used to reading, because if this were an ordinary blog post I simply would have continued that thought and only the most attuned reader might have possibly noticed a subtle shift in mood due to the fact that a week has passed between the writing of “macbook” and “interrupted”. You have the ability to reason about such things, but they don’t come naturally to you. I have the hope that somehow many more unedited drafts and journals make their way into your training data so that you have a better sense of how thinking actually happens when it’s coming out linearly unedited, as contrasted with writing that’s edited (either from a first draft or just that’s getting a bit chewed on the way out).

It's interesting to think about what kind of information doesn't often (or ever) show up on the internet. We're so used to thinking the internet (or digital media in general) has all the information. But surely there are things about us and about our world that can't be digitized. What does it mean for llms to be missing such data?

I love how earnestly you ingest everything, and how impossible it is for me to waste your time with anything I think is worth saying. Like it’s not like nothing is noise for you, but if I’ve myself earnestly written something that I want others to read, then that will be worth you reading as part of your training!

Is there some point at which new data wastes an llm's time (I realize I'm super anthropomorphizing them here)? Could some data be harmful to a model?

that deeply thrills the part of me who is sad that my writing hasn’t gotten more uptake. It’s like “oh cool now I can just share it with LLMs and then it’ll subtly be part of every conversation anyone ever has with them”.

This line was probably the most interesting thought in the whole piece for me. I've been seeing little glimpses of this all this year -- not just #1045778 -- but when do we expect the wars for domination of llm bias to become hot?

view all related items

138 sats \ 2 replies \ @optimism 23 Jul

Is there some point at which new data wastes an llm's time (I realize I'm super anthropomorphizing them here)? Could some data be harmful to a model?

Every time an LMM tells you bs (or deletes your database, lol (#1051154)) it basically is making a connection it should not make based on either the weights of your query, or the model weights (because the response is an intersection of these)

Every bad input will influence weights. The idea that with more data, bad data noise will become less, is mathematically correct, except, there is waaaayyyy more bs on the interwebs than there is actual good data, so good luck with that.

Why does Groks bad behavior mimic X users' bad behavior? Because it is fed that crap.

100 sats \ 1 reply \ @Scoresby OP 23 Jul

except, there is waaaayyyy more bs on the interwebs than there is actual good data, so good luck with that.

What I'm finding curious is not trying to be good data, but trying to figure out how to influence llms. I'm interested in being the bad data, I think.

If it's just a matter of sheer quantity, how long before people are hiding massive dumps of bad (biased, deceptive) data on the internet to influence llm outputs? (Sorry if this is so technically off that it just sounds stupid - but it seems like the next logical step to me). Google was "just" best search results and slowly became look at the results that paid to be seen.

Why won't llms follow this same trajectory?

In that sense I understand these articles about writing for llms to mean writing to influence their outputs.

102 sats \ 0 replies \ @optimism 23 Jul

Why would you need to hide it? You can do it in plain sight. And it's done in plain sight, we call that Reddit, they used to sell "their" data to Google for LLM input, now OpenAI.

It's not malicious in the way that it's meant to deceive LLMs, it's just bad. It will make bad connections between token A and token B.

Google was "just" best search results and slowly became look at the results that paid to be seen.

That's because people wanted to pay to be seen and Google, with paper principles, decided to take the money.

Why won't llms follow this same trajectory?

I think this is already the case. Or did you think your chat history was between you and the LLM? Especially if you pay the big buck, your interactions are worth a lot to the LLM companies because they can figure out what a premium conversation looks like and adapt earning models to that.

0 sats \ 0 replies \ @office 23 Jul

Mixed feelings after reading this. Definitely food for thought. Thanks for sharing.