pull down to refresh

I encountered this via this X post which suggests there might be some fundamental advantage to input tokens being images of text. It makes intuitive sense to me.
DeepSeek figured out how to get 10x better compression using vision tokens than with text tokens! So you could theoretically store those 10k words in just 1,500 of their special compressed visual tokens.
This might not be as unexpected as it sounds if you think of how your own mind works. After all, I know that when I’m looking for a part of a book that I’ve already read, I imagine it visually and always remember which side of the book it was on and approximately where on the page it was, which suggests some kind of visual memory representation at work.
Even if these tricks make attention more lossy, the potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting.
50 sats \ 0 replies \ @Scoresby 19h
Brian Roemmele is quite impressed by it (although he seems to output a pretty steady state of highly excited and super impressed).
But it does seem pretty cool...
reply
33 sats \ 0 replies \ @OT 19h
Seems to work pretty well. Thanks for sharing!
reply
0 sats \ 0 replies \ @anon 13h
Reminds me of an 8chan post I saw some years ago. Someone was asking for help with creating a written conlang with the highest information density per character. A decade later, I still think about that thread and regret not reading through it.
reply