Explaining Query Expansion Algorithms
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute, Kolkata
Abstract
Query Expansion (QE) techniques aim to mitigate vocabulary mismatch in Information Retrieval by
augmenting user queries with related terms. However, their effectiveness varies across queries. This
work investigates the explainability of QE by leveraging the concept of an Ideal Expanded Query
(IEQ): a hypothetical query yielding near-perfect retrieval performance, measured via Average
Precision (AP). We hypothesize that the closer an Expanded Query (EQ) variant is to the IEQ, the
higher its AP.
Our approach consists of three major components:
(i) generating an IEQ,
(ii) measuring the similarity between an EQ and an IEQ, and
(iii) computing the correlation between the above similarities and the corresponding AP values.
For component (i), we use Oracle Rocchio tuning and Logistic Regression to construct IEQs.
For component (ii), the following similarity measures were used:
• Cosine similarity metric,
• Modified cosine similarity normalized by 𝐿1 norm,
• Jaccard index, and
• a modified nDCG similarity.
For component (iii), we use Pearson, Kendall’s Tau and Spearman’s Rho correlation coefficients.
We generate multiple EQ variants using methods like RM3, SPL, CEQE, and Log-Logistic, and
compare them to constructed IEQs.
The main findings are:
(i) Using Logistic Regression to classify relevant and non-relevant documents, and then using the
trained model’s coefficients as expanded query weights resulted in very high MAP. This shows
that this method can be potentially used to generate IEQs. One benefit of using this method is
that the IEQ generation is very fast and takes only a few seconds per query.
(ii) There are multiple IEQs for any query. Various IEQs can be constructed via different procedures
described in the dissertation like IEQ0 and IEQ1 construction. Pruning also modifies the input
IEQ, producing a different IEQ.
(iii) Only moderate correlations were observed between the AP achieved by an EQ and it’s
similarity to an IEQ. The highest correlation was the Pearson correlation coefficient, at 0.4858. It
was found using 𝐿2 similarity between IEQs (from IEQ0) and EQs.
(iv) It was seen as a general trend that the AP achieved by an IEQ (generated by any method)
decreased as the number of relevant documents for that particular query increased.
(v) Pruning helped in getting rid of unhelpful terms in the IEQ, and resulted in much shorter IEQs.
It was seen that there was a strong positive correlation between the number of relevant
documents and the number of terms in the pruned IEQ.
The codebase for this project is located at: https://github.com/mrishu/py-qe-explain
Description
Dissertation under the supervision of Prof. Mandar Mitra.
Keywords
Query Expansion
Citation
32p.
