Explaining Query Expansion Algorithms

No Thumbnail Available

Date

2025-06

Journal Title

Journal ISSN

Volume Title

Publisher

Indian Statistical Institute, Kolkata

Abstract

Query Expansion (QE) techniques aim to mitigate vocabulary mismatch in Information Retrieval by augmenting user queries with related terms. However, their effectiveness varies across queries. This work investigates the explainability of QE by leveraging the concept of an Ideal Expanded Query (IEQ): a hypothetical query yielding near-perfect retrieval performance, measured via Average Precision (AP). We hypothesize that the closer an Expanded Query (EQ) variant is to the IEQ, the higher its AP. Our approach consists of three major components: (i) generating an IEQ, (ii) measuring the similarity between an EQ and an IEQ, and (iii) computing the correlation between the above similarities and the corresponding AP values. For component (i), we use Oracle Rocchio tuning and Logistic Regression to construct IEQs. For component (ii), the following similarity measures were used: • Cosine similarity metric, • Modified cosine similarity normalized by 𝐿1 norm, • Jaccard index, and • a modified nDCG similarity. For component (iii), we use Pearson, Kendall’s Tau and Spearman’s Rho correlation coefficients. We generate multiple EQ variants using methods like RM3, SPL, CEQE, and Log-Logistic, and compare them to constructed IEQs. The main findings are: (i) Using Logistic Regression to classify relevant and non-relevant documents, and then using the trained model’s coefficients as expanded query weights resulted in very high MAP. This shows that this method can be potentially used to generate IEQs. One benefit of using this method is that the IEQ generation is very fast and takes only a few seconds per query. (ii) There are multiple IEQs for any query. Various IEQs can be constructed via different procedures described in the dissertation like IEQ0 and IEQ1 construction. Pruning also modifies the input IEQ, producing a different IEQ. (iii) Only moderate correlations were observed between the AP achieved by an EQ and it’s similarity to an IEQ. The highest correlation was the Pearson correlation coefficient, at 0.4858. It was found using 𝐿2 similarity between IEQs (from IEQ0) and EQs. (iv) It was seen as a general trend that the AP achieved by an IEQ (generated by any method) decreased as the number of relevant documents for that particular query increased. (v) Pruning helped in getting rid of unhelpful terms in the IEQ, and resulted in much shorter IEQs. It was seen that there was a strong positive correlation between the number of relevant documents and the number of terms in the pruned IEQ. The codebase for this project is located at: https://github.com/mrishu/py-qe-explain

Description

Dissertation under the supervision of Prof. Mandar Mitra.

Keywords

Query Expansion

Citation

32p.

Endorsement

Review

Supplemented By

Referenced By