Explaining Query Expansion Algorithms

Dutta, Aditya

Explaining Query Expansion Algorithms

dc.contributor.author	Dutta, Aditya
dc.date.accessioned	2025-07-17T09:29:44Z
dc.date.available	2025-07-17T09:29:44Z
dc.date.issued	2025-06
dc.description	Dissertation under the supervision of Prof. Mandar Mitra.	en_US
dc.description.abstract	Query Expansion (QE) techniques aim to mitigate vocabulary mismatch in Information Retrieval by augmenting user queries with related terms. However, their effectiveness varies across queries. This work investigates the explainability of QE by leveraging the concept of an Ideal Expanded Query (IEQ): a hypothetical query yielding near-perfect retrieval performance, measured via Average Precision (AP). We hypothesize that the closer an Expanded Query (EQ) variant is to the IEQ, the higher its AP. Our approach consists of three major components: (i) generating an IEQ, (ii) measuring the similarity between an EQ and an IEQ, and (iii) computing the correlation between the above similarities and the corresponding AP values. For component (i), we use Oracle Rocchio tuning and Logistic Regression to construct IEQs. For component (ii), the following similarity measures were used: • Cosine similarity metric, • Modified cosine similarity normalized by 𝐿1 norm, • Jaccard index, and • a modified nDCG similarity. For component (iii), we use Pearson, Kendall’s Tau and Spearman’s Rho correlation coefficients. We generate multiple EQ variants using methods like RM3, SPL, CEQE, and Log-Logistic, and compare them to constructed IEQs. The main findings are: (i) Using Logistic Regression to classify relevant and non-relevant documents, and then using the trained model’s coefficients as expanded query weights resulted in very high MAP. This shows that this method can be potentially used to generate IEQs. One benefit of using this method is that the IEQ generation is very fast and takes only a few seconds per query. (ii) There are multiple IEQs for any query. Various IEQs can be constructed via different procedures described in the dissertation like IEQ0 and IEQ1 construction. Pruning also modifies the input IEQ, producing a different IEQ. (iii) Only moderate correlations were observed between the AP achieved by an EQ and it’s similarity to an IEQ. The highest correlation was the Pearson correlation coefficient, at 0.4858. It was found using 𝐿2 similarity between IEQs (from IEQ0) and EQs. (iv) It was seen as a general trend that the AP achieved by an IEQ (generated by any method) decreased as the number of relevant documents for that particular query increased. (v) Pruning helped in getting rid of unhelpful terms in the IEQ, and resulted in much shorter IEQs. It was seen that there was a strong positive correlation between the number of relevant documents and the number of terms in the pruned IEQ. The codebase for this project is located at: https://github.com/mrishu/py-qe-explain	en_US
dc.identifier.citation	32p.	en_US
dc.identifier.uri	http://hdl.handle.net/10263/7576
dc.language.iso	en	en_US
dc.publisher	Indian Statistical Institute, Kolkata	en_US
dc.relation.ispartofseries	MTech(CS) Dissertation;23-01
dc.subject	Query Expansion	en_US
dc.title	Explaining Query Expansion Algorithms	en_US
dc.type	Other	en_US

Files

Original bundle

Now showing 1 - 2 of 2

Name:: aditya_dutta-cs2301-dissertation.pdf
Size:: 3.35 MB
Format:: Adobe Portable Document Format
Description:: Dissertations - M Tech (CS)

Download

Name:: aditya_dutta-cs2301-dissertation_plagiarism_report.pdf
Size:: 3.46 MB
Format:: Adobe Portable Document Format
Description:: Plagiarism_report

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Dissertations - M Tech (CS)