Reproducing and Analyzing the “Lost in the Middle” and “The Power of Noise” Phenomenon in Retrieval-Augmented Generation

Thumbnail Image

Date

2026-06-16

Journal Title

Journal ISSN

Volume Title

Publisher

Indian Statistical Institute

Abstract

Retrieval-Augmented Generation has become the way to improve Large Language Models. They help with problems like knowledge and hallucinations. Recent studies show that these models still have limitations. One big problem is the “Lost in the Middle” phenomenon. Models can’t access information in the middle of contexts properly. Another counterintuitive observation is the “Power of Noise” paradigm, which suggests adding unrelated documents can actually make the generation better. We know these happen in extractive QA tasks, but we don’t know if they happen in tasks that need complex reasoning. This dissertation looks into how position and noise affect Long-Form Question Answering. We use the ELI5 dataset and test three models. We give them varying amounts of context and see how they do. We also change the location of the correct information and add distracting or random information to observe the effects of these perturbations. Traditional metrics for evaluating model-generated answers aren’t very effective for long-form responses. We introduce two new metrics of evaluation, Prop Score and Sentence Score. Our experiments give us three findings. First, the “Lost in the Middle” issue still happen to a certain degree in Long-Form QA. Second, we confirm that noise can actually improve generation. Third, we hypothesize the reasons of persistence of the “Lost in the Middle” phenomenon and the “power of noise” paradigm in Long-Form QA.

Description

This dissertation has been completed under the supervision of Prof. Mandar Mitra

Keywords

Retrieval-Augmented Generation (RAG), Lost in the Middle, Power of Noise, Long-Form Question Answering, ELI5 dataset, Positional Bias, Distractor Documents, Noise Injection, PropScore, SentenceScore, BERTScore, ROUGE-L, BLEU, Sentence-BERT, Large Language Models, Llama-2, Phi-3-mini, Qwen2.5, Context Length, Document-Grounded/Answer-Grounded Evaluation.

Citation

92p.

Endorsement

Review

Supplemented By

Referenced By