Explaining Query Expansion Algorithms

dc.contributor.authorDutta, Aditya
dc.date.accessioned2025-07-17T09:29:44Z
dc.date.available2025-07-17T09:29:44Z
dc.date.issued2025-06
dc.descriptionDissertation under the supervision of Prof. Mandar Mitra.en_US
dc.description.abstractQuery Expansion (QE) techniques aim to mitigate vocabulary mismatch in Information Retrieval by augmenting user queries with related terms. However, their effectiveness varies across queries. This work investigates the explainability of QE by leveraging the concept of an Ideal Expanded Query (IEQ): a hypothetical query yielding near-perfect retrieval performance, measured via Average Precision (AP). We hypothesize that the closer an Expanded Query (EQ) variant is to the IEQ, the higher its AP. Our approach consists of three major components: (i) generating an IEQ, (ii) measuring the similarity between an EQ and an IEQ, and (iii) computing the correlation between the above similarities and the corresponding AP values. For component (i), we use Oracle Rocchio tuning and Logistic Regression to construct IEQs. For component (ii), the following similarity measures were used: • Cosine similarity metric, • Modified cosine similarity normalized by 𝐿1 norm, • Jaccard index, and • a modified nDCG similarity. For component (iii), we use Pearson, Kendall’s Tau and Spearman’s Rho correlation coefficients. We generate multiple EQ variants using methods like RM3, SPL, CEQE, and Log-Logistic, and compare them to constructed IEQs. The main findings are: (i) Using Logistic Regression to classify relevant and non-relevant documents, and then using the trained model’s coefficients as expanded query weights resulted in very high MAP. This shows that this method can be potentially used to generate IEQs. One benefit of using this method is that the IEQ generation is very fast and takes only a few seconds per query. (ii) There are multiple IEQs for any query. Various IEQs can be constructed via different procedures described in the dissertation like IEQ0 and IEQ1 construction. Pruning also modifies the input IEQ, producing a different IEQ. (iii) Only moderate correlations were observed between the AP achieved by an EQ and it’s similarity to an IEQ. The highest correlation was the Pearson correlation coefficient, at 0.4858. It was found using 𝐿2 similarity between IEQs (from IEQ0) and EQs. (iv) It was seen as a general trend that the AP achieved by an IEQ (generated by any method) decreased as the number of relevant documents for that particular query increased. (v) Pruning helped in getting rid of unhelpful terms in the IEQ, and resulted in much shorter IEQs. It was seen that there was a strong positive correlation between the number of relevant documents and the number of terms in the pruned IEQ. The codebase for this project is located at: https://github.com/mrishu/py-qe-explainen_US
dc.identifier.citation32p.en_US
dc.identifier.urihttp://hdl.handle.net/10263/7576
dc.language.isoenen_US
dc.publisherIndian Statistical Institute, Kolkataen_US
dc.relation.ispartofseriesMTech(CS) Dissertation;23-01
dc.subjectQuery Expansionen_US
dc.titleExplaining Query Expansion Algorithmsen_US
dc.typeOtheren_US

Files

Original bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
aditya_dutta-cs2301-dissertation.pdf
Size:
3.35 MB
Format:
Adobe Portable Document Format
Description:
Dissertations - M Tech (CS)
No Thumbnail Available
Name:
aditya_dutta-cs2301-dissertation_plagiarism_report.pdf
Size:
3.46 MB
Format:
Adobe Portable Document Format
Description:
Plagiarism_report

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: