Binary Document Filtering for Retrieval-Augmented Generation

No Thumbnail Available

Date

2025-06

Journal Title

Journal ISSN

Volume Title

Publisher

Indian Statistical Institute, Kolkata

Abstract

Retrieval-Augmented Generation (RAG) has become a popular technique to enhance Large Language Models (LLMs) with access to external information sources. However, the success of RAG systems critically depends on the relevance and quality of the retrieved documents. In particular, supplying irrelevant or noisy context can lead to degraded downstream generation quality. To address this, our project focuses on improving the document filtering stage in a RAG pipeline through binary relevance classification — deciding whether a retrieved document is suitable to include in the final context window based on its usefulness in directly answering the user query. We explore a wide range of approaches to this task, including rule-based retrieval methods (TF-IDF, BM25), classical machine learning classifiers (logistic regression, SVM), deep neural networks, and LLM-based methods, both in zero-shot and few-shot settings. Our final pipeline leverages instruction-tuned LLMs to act as strict binary classifiers, with a focus on maximizing precision over recall, thereby ensuring that only the most relevant and high-quality documents are passed to the generation module. Experiments are conducted on a Reddit-based query-document dataset tailored to subjective and opinion-heavy queries. Our evaluations suggest that LLMs, even without fine-tuning, can outperform traditional methods in this setting, o”ering a strong foundation for further enhancement through supervised fine-tuning

Description

Dissertation under the supervision of Dr. Debapriyo Majumdar and Dr. Rajkiran Panuganti

Keywords

Retrieval-Augmented Generation, Binary Relevance Classification, Document Filtering, Large Language Models, Precision-Oriented Retrieval, Reddit Dataset, Zero-Shot Inference

Citation

24p.

Endorsement

Review

Supplemented By

Referenced By