pull down to refresh

Well, how does it work?

In the paper, the core insight is stated to be using the policy network to compress less-relevant chunks in the RAG process, but to us, the core insight here is actually: if embeddings are generated by layers within the LLM, it makes no sense to convert them back to natural language, just for another LLM to compress those tokens back to embeddings.
That is why the speedups come without collapsing accuracy.