model performance degrades as input length increases, often in surprising and non-uniform ways.
Long context evaluations for these models often demonstrate consistent performance across input lengths. However, these evaluations are narrow in scope and not representative of how long context is used in practice. The most commonly used test, Needle in a Haystack (NIAH), is a simple lexical retrieval task often used to generalize a model’s ability to reliably handle long context.
The researchers tested llms with focused prompts (~300 tokens) and full prompts (~113k tokens):
Across all models, we see significantly higher performance on focused prompts compared to full prompts.
More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance, making this a meaningful direction of future work for optimizing model performance.