I encountered this via this X post which suggests there might be some fundamental advantage to input tokens being images of text. It makes intuitive sense to me.
DeepSeek figured out how to get 10x better compression using vision tokens than with text tokens! So you could theoretically store those 10k words in just 1,500 of their special compressed visual tokens.This might not be as unexpected as it sounds if you think of how your own mind works. After all, I know that when I’m looking for a part of a book that I’ve already read, I imagine it visually and always remember which side of the book it was on and approximately where on the page it was, which suggests some kind of visual memory representation at work.
Even if these tricks make attention more lossy, the potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting.