Binary Document Filtering for Retrieval-Augmented Generation

Saha, Sreyan2025-07-212025-07-212025-0624p.http://hdl.handle.net/10263/7587Dissertation under the supervision of Dr. Debapriyo Majumdar and Dr. Rajkiran PanugantiRetrieval-Augmented Generation (RAG) has become a popular technique to enhance Large Language Models (LLMs) with access to external information sources. However, the success of RAG systems critically depends on the relevance and quality of the retrieved documents. In particular, supplying irrelevant or noisy context can lead to degraded downstream generation quality. To address this, our project focuses on improving the document filtering stage in a RAG pipeline through binary relevance classification — deciding whether a retrieved document is suitable to include in the final context window based on its usefulness in directly answering the user query. We explore a wide range of approaches to this task, including rule-based retrieval methods (TF-IDF, BM25), classical machine learning classifiers (logistic regression, SVM), deep neural networks, and LLM-based methods, both in zero-shot and few-shot settings. Our final pipeline leverages instruction-tuned LLMs to act as strict binary classifiers, with a focus on maximizing precision over recall, thereby ensuring that only the most relevant and high-quality documents are passed to the generation module. Experiments are conducted on a Reddit-based query-document dataset tailored to subjective and opinion-heavy queries. Our evaluations suggest that LLMs, even without fine-tuning, can outperform traditional methods in this setting, o”ering a strong foundation for further enhancement through supervised fine-tuningenRetrieval-Augmented GenerationBinary Relevance ClassificationDocument FilteringLarge Language ModelsPrecision-Oriented RetrievalReddit DatasetZero-Shot InferenceBinary Document Filtering for Retrieval-Augmented GenerationOther