tech

Meta Superintelligence's surprising first paper

> # Well, how does it work?
> In the paper, the core insight is stated to be using the policy network to compress less-relevant chunks in the RAG process, but to us, ***the core insight here is actually: if embeddings are generated by layers within the LLM, it makes no sense to convert them back to natural language, just for another LLM to compress those tokens back to embeddings.***

> That is why the speedups come without collapsing accuracy.

 https://paddedinputs.substack.com/i/173879499/well-how-does-it-work