Query Dependent Pseudo-Relevance Feedback based on Wikipedia Yang Xu Gareth J. F. Jones

by user

on 15-09-2016

Category: Documents

>> Downloads: 10

views

Report

Comments

Description

Download Query Dependent Pseudo-Relevance Feedback based on Wikipedia Yang Xu Gareth J. F. Jones

Transcript

Query Dependent Pseudo-Relevance Feedback based on Wikipedia Yang Xu Gareth J. F. Jones

Query Dependent Pseudo-Relevance Feedback based on
Wikipedia
Yang Xu
Gareth J. F. Jones
Institute of Computing
Technology
Chinese Academy of Sciences
Beijing,100190,P.R.China
Dublin City University
Glasnevin, Dublin 9
Ireland
Bin Wang
Institute of Computing
Technology
Chinese Academy of Sciences
[email protected]
Beijing,100190,P.R.China
[email protected]
[email protected]
ABSTRACT
General Terms
Pseudo-relevance feedback (PRF) via query-expansion has
been proven to be eﬀective in many information retrieval
(IR) tasks. In most existing work, the top-ranked documents
from an initial search are assumed to be relevant and used for
PRF. One problem with this approach is that one or more
of the top retrieved documents may be non-relevant, which
can introduce noise into the feedback process. Besides, existing methods generally do not take into account the signiﬁcantly diﬀerent types of queries that are often entered
into an IR system. Intuitively, Wikipedia can be seen as a
large, manually edited document collection which could be
exploited to improve document retrieval eﬀectiveness within
PRF. It is not obvious how we might best utilize information from Wikipedia in PRF, and to date, the potential of
Wikipedia for this task has been largely unexplored. In our
work, we present a systematic exploration of the utilization
of Wikipedia in PRF for query dependent expansion. Specifically, we classify TREC topics into three categories based
on Wikipedia: 1) entity queries, 2) ambiguous queries, and
3) broader queries. We propose and study the eﬀectiveness
of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from
a diﬀerent perspective. We incorporate the expansion terms
into the original query and use language modeling IR to evaluate these methods. Experiments on four TREC test collections, including the large web collection GOV2, show that
retrieval performance of each type of query can be improved.
In addition, we demonstrate that the proposed method outperforms the baseline relevance model in terms of precision
and robustness.
Algorithms, Design, Experimentation, Performance
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Search and
Retrieval - search process, query formulation
SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA.
.
Keywords
Information Retrieval, Entity, Query Expansion, Pseudorelevance feedback, Wikipedia
1.
INTRODUCTION
One of the fundamental problems of information retrieval
(IR) is to search for documents that satisfy a user’s information need. Often a query is too short to describe the
speciﬁc information need clearly. For a long time query expansion has been a focus for researchers because it has the
potential to enhance IR eﬀectiveness by adding useful terms
that can help discriminate relevant documents from irrelevant ones. For all query expansion methods, pseudo relevance feedback (PRF) is attractive because it requires no
user input [21][2][31][3]. PRF assumes that the top-ranked
documents in the initial retrieval are relevant. However, this
assumption is often invalid [3] which can result in a negative
impact on PRF performance. Meanwhile, as the volume of
data on the web becomes much larger, other resources have
emerged which can potentially supplement an initial search
better in PRF, e.g. Wikipedia.
Wikipedia is a free online encyclopedia edited collaboratively by large numbers of volunteers (web users). The
exponential growth and the reliability of Wikipedia make
it a potentially valuable resource for IR [33]. The aim of
this study is to explore the possible utility of Wikipedia
as a resource improving for IR in PRF. The basic entry in
Wikipedia is an entity page, which is an article that contains
information focusing on one single entity. Wikipedia covers
a great many topics, and it might reasonably be assumed to
reﬂect the diverse interests and information needs of users
of many search engines [28]. With the help of enriched text,
we can expect to bridge the gap between the large volume
of information on the web and the simple queries issued by
users. However, few studies have directly examined whether
Wikipedia, especially the internal structures of Wikipedia
articles, can indeed help in IR systems.
As far as we are aware, there is little work done investigating the impact of Wikipedia on diﬀerent types of queries. In
this paper, we propose a query-dependent method for using
PRF for query expansion, on the basis of Wikepdia. Given
a query, we categorize it into one of three types: 1) query
about a speciﬁc entity (EQ); 2) ambiguous query (AQ); 3)
broader query (BQ). Pseudo relevant documents are generated in two ways according to the query type: 1) using top
ranked articles from Wikipedia retrieved in response to the
query, and 2) using Wikipedia entity pages corresponding
to queries. In selecting expansion terms, term distributions
and structures of Wikipedia pages are taken into account.
We propose and compare a supervised method and an unsupervised method for this task. Based on these methods, we
evaluate the eﬀect of Wikipedia on PRF for IR. Our experiments show diﬀerent methods impact diﬀerently on the three
types of queries. Thus a query dependent query expansion
is necessary to optimally beneﬁt retrieval performance.
The contributions of this work are as follows: 1) we thoroughly evaluate the potential of Wikipedia for IR as a resource for PRF, 2) we explore the use of Wikipedia as an
entity repository as well as its internal structure for retrieval,
and based on these two aspects, diﬀerent methods for selecting expansion terms are proposed and compared, and 3) our
method is conducted in a query dependent way, which is
more eﬀective and robust than a single method.
The remainder of this paper is organized as follows: Section 2 provides a detailed account of related work, Section
3 introduces query categorization based on Wikipedia, Section 4 describes our proposed methods for using Wikipedia
as pseudo relevant documents, experimental results are reported in Section 5, and we conclude in Section 6.
2. RELATED WORK
Automatic query expansion is a widely used technique in
IR. PRF has been shown to be an eﬀective way of improving
retrieval accuracy by reformulating an original query using
pseudo-relevant documents from the initial retrieval result
[27, 18, 34]. Traditional PRF algorithms such as Okapi [27],
Lavrenko and Croft’s relevance model [18], and Zhai and
Laﬀerty’s mixture model [34] are based on the assumption
that the top-ranked documents from an initial search are
relevant to the query.
Large amounts of research has been reported on attempts
to improve traditional PRF, e.g. using latent concepts [21],
query-regularized estimation [29], additional features other
than term distributions [3], and a clustered-based re-sampling
method for generating pseudo-relevant documents [19].
On the other hand, there has been work on selective query
expansion which aims to improve query expansion with decision mechanisms based on the characteristics of queries
[11, 6, 32, 13]. These methods share a similar idea, that is,
to disable query expansion if the query is predicted to perform poorly. Other relevant work is that of He and Ounis
who proposed selecting an appropriate collection resource
for query expansion [13].
Cui et al. proposed a query expansion approach by mining
user logs [12]. They extracted correlations between query
terms and document terms by analyzing user logs, and then
used these to select expansion terms for new queries.
A concept-based interactive query expansion method was
suggested by Fonseca et al. also using query logs [10]. Association rules are applied to identify concepts related to the
current query from the query context derived from user logs.
These concepts then provide suggestions to reﬁne the vague
(short) query.
Kwok and Chan [17] studied the idea of using an external
resource for query expansion. They found that query expansion failure can be caused by the lack of relevant documents
in the local collection. Therefore, the performance of query
expansion can be improved by using a large external collection. Several external collection enrichment approaches
have been proposed [12, 13, 10, 5]. Our work follows this
strategy of a query expansion approach using an external
collection as a source of query expansion terms.
Recently, Wikipedia[30] has emerged as a potentially important resource for IR, a number of studies have been reported which adopt Wikipedia to assist query expansion,
many of these have appeared in the context of the TREC
Blog track [24, 35, 7, 8, 16].
One interesting example of the use of Wikipedia in this
way is the work of Li et al. [20] who proposed using it
for query expansion by utilizing the category assignments of
Wikipedia articles. The base query is run against a Wikipedia
collection and each category is assigned a weight proportional to the number of top-ranked articles assigned to it.
Articles are then re-ranked based on the sum of the weights
of the categories to which each belongs. The method shows
improvement over PRF in measures favoring weak queries.
A thesaurus-based query expansion method using Wikipedia
was proposed by Milne, Witten and Nichols [23]. The thesaurus is derived with criteria such that topics relevant to
the document collection are included. They propose to extract signiﬁcant topics from a query by checking consecutive
sequences of words in the query against the thesaurus. However, query dependent knowledge is not taken into consideration by the thesaurus. Elsas et al. investigated link-based
query expansion using Wikipedia in [7]. They focused on anchor text and proposed a phrase scoring function. Our work
diﬀers from these methods, in that, expansion terms are not
selected directly from the documents obtained by running
the base query on the Wikipedia. Instead Wikipedia entity pages are viewed as a set of pseudo-relevant documents
tailored to the speciﬁc query.
Similar to the query log based expansion methods, our
approach can also reﬂect the preference of the majority of
the users. However, our work diﬀers from these methods
in that we try to narrow the gap between the large volume
of information on the web and the simple queries issued by
users, by mapping queries to Wikipedia entity pages which
reﬂect knowledge underlying the query, rather than using a
simple bag-of-words query model.
3.
QUERY CATEGORIZATION
In this section we brieﬂy summarize the relevant features
of Wikipedia for our study, and then examine the diﬀerent
categories of queries that are typically encountered in user
queries and how these can be related to the properties of
Wikipedia for eﬀective query expansion.
3.1
Wikipedia Data Set
Wikipedia is a free online encyclopedia which has detailed
guidelines governing its content quality and format. Unlike
traditional encyclopedias, Wikipedia articles are constantly
being updated and new entries are created everyday. These
characteristics make it an ideal repository of background
knowledge for query expansion.
Wikipedia is organized as one article per topic. Each article thus eﬀectively summarizes the most important information of each topic. A topic in Wikipedia has a distinct,
separate existence often referring to a speciﬁc entity, such as
a person, a place, an organization or miscellaneous. In ad-
Field
Title
Overview
Content
Category
Appendix
Links
Description
Unique identiﬁer for the topic
Lead section, summary of the topic
Grouped by sections
At least one for each topic
References, notes, external reading etc.
Cross-article hyperlinks speciﬁed by topics
Table 1: Fields of Wikipedia Articles
dition, important information for the topic of a given article
may also be found in other Wikipedia articles [9]. An analysis of 200 randomly selected articles describing a single topic
(not including redirect pages and disambiguation pages, described below) showed that only 4% failed to adhere to a
standard format. Based on our analysis of Wikipedia page
structure, in our work we divide Wikipedia articles into six
ﬁelds as shown in Table 1.
In addition to topic pages, Wikipedia contains “redirect”
pages which provide alternative ways of referring to a topic:
e.g. the page Einstein redirects to the article Albert Einstein
which contains information about the physicist. Another
useful structure of Wikipedia is its so-called “disambiguation pages” which list the referents of ambiguous words and
phrases that denote two or more topics in Wikipedia.
As an open source project, Wikipedia is obtainable via
download from http://download.wikipedia.org, which is in
the form of database dumps that are released periodically.
We downloaded the English Wikipedia dump of January,
2008. Wikipedia articles are stored in their own markup
language called Wikitext which preserves features, e.g. categories, hyperlinks, subsections, tables, pictures. We index
all the pages using the Indri search engine [14]. Preprocessing includes removing Wikipedia pages that are used for
management purpose, e.g. those with Wikipedia in the title.
Wikipedia articles are stemmed using Porter stemming.
3.2 Query Categorization
Queries in web search may vary largely in semantics and
the user’s intensions they present, in numbers of relevant
documents they have in the document repository, and in
numbers of entities they involve. In this work, we deﬁne
three types of queries according to their relationship with
Wikipedia topics : 1) queries about a speciﬁc entity (EQ),
2) ambiguous queries (AQ), and 3) broader queries (BQ).
By EQ, we mean queries that have a speciﬁc meaning and
cover a narrow topic, e.g. “Scalable Vector Machine”. By
AQ, we mean queries with terms having more than one potential meaning, e.g. “Apple”. Finally, we denote the rest
of the queries to be BQ, because these queries are neither
ambiguous nor focused on a speciﬁc entity. For the ﬁrst two
types of queries, a corresponding Wikipedia entity page can
be determined for the query. For EQ, the corresponding
entity page is the page with the same title ﬁeld as the query.
For AQ, a disambiguation process is needed to determine
its sense. For all three types of queries, a set of top ranked
retrieved documents is obtained from an initial search.
Our method automatically categorizes queries, with the
help of titles of Wikipedia articles. Of particular interest are
the titles of entity pages, redirect pages and disambiguation
pages. Queries exactly matching one title of an entity page
or a redirect page will be classiﬁed as EQ. Thus EQ can
be mapped directly to the entity page with the same title.
Note that queries with both entity page and disambiguation
pages will be counted as EQ, because the existing entity
page indicates that there is consensus on a dominant sense
for the word or phrase, e.g. “Piracy‘” and “Poaching”.
To identify AQ, we look for queries with terms in the
ambiguous list of terms/phrases (derived from extracting
all the titles of disambiguation pages). All other queries are
then classiﬁed as BQ. Though the categorization process
is simple, we show in later experiments that the results are
helpful for retrieval performance enhancement.
3.2.1
Strategy for Ambiguous Queries
We now explain our solution to the problem of mapping
ambiguous queries to a speciﬁc entity page. In previous
research on query ambiguity, one assumption is that when
an ambiguous query is submitted, the person that initiates
the search knows the intended sense of each ambiguous word
[28]. Our solution is also based on this assumption.
Given an ambiguous query, we ﬁrst obtain the top ranked
100 documents from the target collection to be searched
using the query likelihood retrieval model. Then K-means
clustering is applied [15] to cluster these documents. Each
document is represented by tf ∗ idf weighting and cosine
normalization. Cosine similarity is used to compute the similarities among the top documents.
After clustering, we rank these clusters by a cluster-based
language model, as proposed by Lee et al. [19], given as
follows:
m
∏
P (Q|Clu) =
P (qi |Clu)
i=1
where P (w|Clu) is speciﬁed by the cluster based language
model
P (w|Clu) =
|Clu|
λ
PM L (w|Clu)+
PM L (w|Coll)
|Clu| + λ
|Clu| + λ
where PM L (w|Clu) is the maximum likelihood estimate of
word w in the cluster and PM L (w|Coll) is the maximum likelihood estimate of word w in the collection. The smoothing
parameter λ is learned using training topics on each collection in experiments. PM L (w|Clu) and PM L (w|Coll) are
speciﬁed as follows:
PM L (w|Clu) =
f req(w, Clu)
f req(w, Coll)
, PM L (w|Coll) =
|Clu|
|Coll|
where f req(w, Clu) is the sum of f req(w, D) for the document D which belongs to the cluster Clu, f req(w, D) denotes the frequency of w in D, and f req(w, Coll) is the
number of times w occurs in the collection.
The documents in the top-ranked clusters are used to represent the characteristics of the test collection, in terms of
pseudo relevant documents in response to a speciﬁc query.
The top ranked cluster is then compared to all the referents
(entity pages) extracted from the disambiguation page associated with the query. The assumption is that the dominant
sense for the query should have a much greater degree of
matching to the top ranked cluster from the test collection
than other senses. Each document is represented by tf ∗ idf
weighting and the cosine is used to measure the similarity
between one cluster and an entity page. The top matching
entity page is then chosen for the query.
Method
EQ
AQ
BQ
All
AP
21
108
21
150
Robust
58
159
33
250
WT10G
31
54
15
100
GOV2
41
98
11
150
Table 2: Numbers of each type of query in the TREC
topic sets by automatic query categorization.
as follows :
P (w|R) =
4. QUERY EXPANSION METHODS
In this section, we describe our query expansion methods. Using these methods, we investigate the relationship
between the query type and expansion methods. Moreover,
we look into how to combine evidence from diﬀerent ﬁelds
of Wikipedia articles for query expansion. Essentially these
methods diﬀer in their criteria of selecting expansion terms.
In each of these methods, query speciﬁc relevance information is considered.
4.1 Relevance model
A relevance model is a query expansion approach based on
the language modeling framework [18]. The relevance model
is a multinomial distribution which estimates the likelihood
of word w given a query Q. In the model, the query words
q1 , ..., qm and the word w in relevant documents are sampled
identically and independently from a distribution R. Thus
the probability of a word in the distribution R is estimated
P (D)P (w|D)P (Q|D)
D∈F
where F is the set of documents that are pseudo-relevant to
the query Q. We assume that P (D) is uniform over the set.
Based on this estimation, the most likely expansion term
e from P (w|D) is chosen for the original query. The ﬁnal
expanded query is combined with the original query using
linear interpolation, weighted by a parameter λ.
3.2.2 Evaluation
In order to evaluate the accuracy of our query categorization process, we used the four sets of TREC topics used in
the retrieval experiments reported in Section 5, with ﬁve
human subjects. These are taken from four diﬀerent search
tasks and comprise a total of 650 queries. Each participant
was asked to judge whether a query is ambiguous or not.
If it was, the participant was required to determine which
referent from the disambiguation page is most likely to be
mapped to the query, according to the description of the
TREC topic. If it was not, the participant was required
to manually search with the query in Wikipedia to identify
whether or not it is an entity deﬁned by Wikipedia (EQ).
The user study results indicate that for query categorization,
participants are in general agreement, i.e. 87% in judging
whether a query is ambiguous or not. However, when determining which referent should a query be mapped to, there
is only 54% agreement.
Table 2 shows the result of automatic query categorization
process. It can be seen from Table 2 that most queries from
TREC topic sets are AQ. That is to say, most of the queries
contain ambiguity to some degree. Thus, it is necessary
to handle this properly according to query type in query
expansion.
To test the eﬀectiveness of the cluster-based disambiguation process, we deﬁne that for each query, if there are at
least two participants who indicate a referent as the most
likely mapping target, this target will be used as an answer. If a query has no answer, it will not be counted by
the evaluation process. Experimental results show that our
disambiguation process leads to an accuracy of 57% for AQ.
∑
′
P (w|Q ) = (1 − λ)P (w|Q) + λP (w|R)
The original relevance model and traditional PRF methods
use the top retrieved documents from an initial search as
pseudo-relevant documents. The problem is that the top
retrieved documents frequently contain non-relevant documents or content, which can introduce noise into the expanded query, resulting in query drift. Our approach introduces Wikipedia articles as pseudo-relevant documents.
This may still ﬁnd non-relevant documents, but we will show
that it can enhance the quality of pseudo-relevant documents
for query expansion. This method forms the baseline in our
experiments.
4.2
Strategy for Entity/Ambiguous Queries
One of the issues for web queries is they are often too short
to clearly express a user’s information need. With the help
of Wikipedia, this issue is expected to be reduced to some
degree. Entity queries are those “good” queries whose sense
is given clearly. On the contrary, it is harder to ﬁnd relevant
documents for ambiguous queries. Both EQ and AQ can be
associated with a speciﬁc Wikipedia entity page. In this
strategy, instead of considering the top-ranked documents
from the test collection, only the corresponding entity page
from Wikipedia will be used as pseudo-relevant information.
We brieﬂy introduced entity pages in the ﬁrst part of Section 3.1. An entity page contains the most representative
information for the entity, which most Wikipedia users have
an agreed consensus on.
Our strategy ﬁrstly ranks all the terms in the entity page,
then the top K terms are chosen for expansion. The measure
to score each term is deﬁned as: score(t) = tf ∗ idf , where
tf is the term frequency in the entity page. idf is computed
as log(N/df ), where N is the number of documents in the
Wikipedia collection, and df is the number of documents
that contain term t. The measure is simple, yet we will
show in later experiments that it is very eﬀective.
The performance of existing PRF methods is often affected signiﬁcantly by parameters, e.g. the number of feedback documents used. The proposed method eases the problem by utilizing the fact that one article exists for each entity
in Wikipedia which focuses on details of the subject.
4.3
Field Evidence for Query Expansion
Currently, structured documents are becoming more and
more common, in consequence several studies have been conducted on exploiting ﬁeld combination for improved IR [13,
26]. Although just semi-structured, our observations show
that the evidence of diﬀerent ﬁelds of Wikipedia can be used
for improving retrieval, e.g. the importance of a term appearing in the overview may be diﬀerent than its appearance
in an appendix.
We examine two methods for utilizing evidence from different ﬁelds. The ﬁrst is similar to that proposed by Robertson et al. for the BM25 model in [26], we replace the term
frequency in a pseudo relevance document from original relevance model with a linearly combined weighted term frequencies. The second method is a supervised learning approach which classiﬁes “good” expansion terms from “bad”
ones. Features derived from ﬁelds are used by the classiﬁer.
Note that the ﬁeld based expansion methods are applicable
to all the queries. For EQ and AQ, the pseudo relevant documents can be either a query speciﬁc entity page, or just
the same as BQ, i.e. the top ranked entity pages from the
initial search.
Unsupervised Method.
Robertson et al. [26] gives a detailed analysis of the disadvantages of linear combination of the scores obtained from
scoring every ﬁeld, and recommends a simple weighted ﬁeld
extension. We use a similar idea in our work to explore
the impact of diﬀerent ﬁelds in Wikipedia. By assigning
diﬀerent weights to ﬁelds, we modify the PM L (w|D) in the
original relevance model to the following form:
∑K
f =1 Wf ∗ T Ff (w, D)
PM L (w|D) =
|D|
where K (here K = 6) is the number of ﬁelds in a docu∑
ment, and K
f =1 Wf = 1. Parameter tuning is needed for
each single pair of parameters. We evaluate diﬀerent weight
settings for the ﬁelds, shown in next section.
Supervised Method (Support Vector Machines).
An alternative way of utilizing evidence of ﬁeld information is to transfer it into features for supervised learning. The learning process is to train a classiﬁer which distinguishes “good” expansion terms from “bad” ones. This
method is inspired by the work of Cao et al. [3], where
they suggest that “good” terms should be identiﬁed directly
according to their possible impact on retrieval eﬀectiveness.
We use Support Vector Machines (SVMs) [1], which are a
popular supervised learner for tasks such as this, as a classiﬁer. A radial-based kernel (RBF) SVM with default settings
based on LIBSVM [4] is used. Parameters are estimated
with a 5-fold cross-validation to maximize the classiﬁcation
accuracy of the training data.
In our work, we want to select good expansion terms and
re-weight terms. It is important for us to know the probability estimated for each term belonging to each class. We set
LIBSVM to train an SVM model for probability estimates.
We compute posterior probabilities from SVM outputs us1
ing the method of [25], P (+1|x) = exp(A∗S(x)+B)
, where A
and B are the parameters, S(x) is the score calculated by
the SVM. Each expansion term is represented by a feature
vector F (e) = [f1 (e), f2 (e), f3 (e), ..., fn (e)].
The ﬁrst group of features are the term distributions in the
PRF documents and collections. Term distributions (TD)
have already been used and proven to be useful in traditional
approaches. The assumption is that the most useful features for term selection make the largest diﬀerence between
the feedback documents and the whole collection. The features that we used include: (1) TD in the test collection;
(2) TD in PRF (top 10) from the test collection; (3) TD
in the Wikipedia collection; and (4) TD in PRF (top 10)
from the Wikipedia collection. TD in the PRF documents
from Wikipedia is given below, the others can be deﬁned
similarly.
∑
∩
D∈F
W tf (e, D)
fprf W iki (e) = log ∑ ∑
∩
t
D∈F
W tf (t, D)
where F is the set of feedback documents, W is the Wikipedia
collection.
The second group of features is based on ﬁeld evidence. As
described in section 3.1, we divide each entity page into six
ﬁelds. One feature is deﬁned for each ﬁeld; this is computed
as follows :
fF ieldi (e) = log(tfi (e) ∗ idf /f ieldLengthi )
where tfi (e) is the term frequency in the ﬁeld i for the entity
page, and f ieldLengthi is the length of the ﬁeld i.
Accuracy
AP
74.15%
ROBUST
75.30%
WT10G
72.99%
GOV2
75.75%
Table 3: Term classiﬁcation results
Training and test data are generated using a method similar to Cao et al. [3], that is, to label possible expansion
terms of each query as good terms or non-good terms to see
their impact on retrieval eﬀectiveness. We deﬁne good expansion terms as those that improve the eﬀectiveness when
λf b = 0.01 and hurt the eﬀectiveness when λf b = −0.01.
Terms that do not satisfy these criteria are counted as bad
terms. We now examine the classiﬁcation quality. We use
four test collections, see Table 4. We divide queries from
the same collection into three equal size groups, and then
perform a leave-one-out cross validation to evaluate classiﬁcation accuracy, shown in Table 3.
5.
EXPERIMENTS
5.1
Experiment Settings
In our experiments, documents are retrieved for a given
query by the query-likelihood language model with Dirichlet
smoothing. We set the Dirichlet prior empirically at 1,500
as recommended in [22]. The experiments were carried out
using the Indri search engine. Indri provides support for
PRF via a modiﬁed version of Lavrenko’s relevance model.
We implemented our new methods on top of Indri.
Experiments were conducted using four standard Text Retrieval Conference (TREC) collections : Associated Press
is a small homogeneous collection; Robust2004, is the test
collection for the TREC Robust Track started in 2003 to
focus on poor performing queries; and two Web collections:
the WT10G collection and the large-scale .GOV2 collection.
Further details of these collections are given in Table 4.
Corpus
AP88-90
Robust2004
WT10g
GOV2
Size
0.73GB
1.9GB
11GB
427GB
# of Doc
242,918
528,155
1,692,096
25,205,179
Topics
51-200
301-450&601-700
451-550
701-850
Table 4: Overview of TREC collections and topics.
Retrieval eﬀectiveness is measured in terms of Mean Average Precision (MAP). Given an initial query Qorig , the
relevance model ﬁrst retrieves a set of N documents and
forms a relevance model from them. It then forms an expanded query QRM by adding the top K most likely non
stopword terms from the relevance model. The expanded
query is formed with the following structure:
#weight( 1.0
Qo rig
λf b
QRM
)
5.2 Baselines
The baseline of our experiments is the query-likelihood
language model (QL) and the relevance model (RMC). Besides, we also consider the relevance model based on Wikipedia
(RMW). Note that our RMW method retrieves the top
ranked N Wikipedia articles that are not “redirect pages”.
For RMC and RMW we ﬁxed the parameters for the rest
experiments: N = 10, K = 100 and λ = 0.6, for a fair
comparison.
Method
QL
RMC
RMW
AP
0.1428
0.1707
0.1622
Robust
0.2530
0.2823
0.2904*
WT10G
0.1831
0.1969
0.2094*
GOV2
0.2967
0.3141
0.3392*
Table 5: Performance comparisons using MAP for
all the test topics on test collections. * indicate statistically signiﬁcant improvements over RMC. We
use the paired t-test with signiﬁcance at p<0.05.
As can be seen from Table 5, Wikipedia is a good resource
for relevance feedback on large collections. Using Wikipedia
for generating PRF, RMW brings comparable performance
with that produced by RMC for Robust2004, WT10G and
GOV2. For AP, although RMW does not work as well
as RMC, RMW still improves performance over QL. This
serves as the baseline for our following experiments. We also
believe that the test collections and Wikipedia have their
own advantages. For test collections, the initial search emphasizes characteristics of the collection (e.g. GOV2 consists
of web pages from government web sites), while Wikipedia
appears to be more general in terms of topics.
5.3 Using Entity Pages for Relevance Feedback
We now turn our attention to our proposed method utilizing only the entity page corresponding to the query for PRF
(RE), (see section 4.2). The results are presented in Table
6. Note that in our proposed method, not all the queries
can be mapped to a speciﬁc Wikipedia entity page, thus the
method is only applicable to EQ and AQ. Results for EQ
and AQ are shown in Table 6 and Table 7 respectively. Note
that results in Table 7 are based on automatic disambiguation. QL, RMC and RMW in the two tables are the averages
for the same groups of queries.
As can be seen in Table 6 and Table 7, RE outperforms
RMC and RMW on all collections. This indicates that what
entity pages provide is relevant information that is closely
related to the query, but might be missing in the test collection. Thus exploiting an entity page as the sole relevance
feedback source works for both small and large collections.
We also notice that the improvement of RE over RMW is
greater for entity queries than for ambiguous queries. This is
because the accuracy of the automatic disambiguation process is rather low (see section 3.2.2). If AQ is not associated
with the most relevant entity page, its performance suﬀers
from the RE method.
Method
QL
RMC
RMW
RE
AP
0.2208
0.2484
0.2335
0.2494**
Robust
0.3156
0.3401
0.3295
0.3580**
WT10G
0.2749
0.2458
0.2821*
0.2897*
GOV2
0.3022
0.3168
0.3453*
0.3889**
Table 6: Performance comparisons using MAP for
entity queries on test collections. * and ** indicate
statistically signiﬁcant improvements over RMC and
RMW, respectively. We use the paired t-test with
signiﬁcance at p<0.05.
Comparing Tables 5, 6 and 7, we can see that entity
queries have better performance than the average, while
ambiguous queries have lower performance than the average. For ambiguous queries, using the top retrieved documents as pseudo relevant documents usually includes more
non-relevant documents than for entity queries. The disambiguation process helps to ﬁnd the relevant entity page for
ambiguous queries, thus both types of queries can beneﬁt
from the query expansion. Based on these results, reﬁnement of the disambiguation process for AQ could be expected to further improve performance.
Method
QL
RMC
RMW
RE
AP
0.1391
0.1520
0.1588
0.1692**
Robust
0.2258
0.2485
0.2619*
0.2728**
WT10G
0.1801
0.1881
0.1868
0.2037**
GOV2
0.2892
0.3101
0.3186*
0.3329**
Table 7: Performance comparisons using MAP for
ambiguous queries on test collections. * and **
indicate statistically signiﬁcant improvements over
RMC and RMW, respectively. We use the paired
t-test with signiﬁcance at p<0.05.
5.4 Field Based Expansion
Inspired by the result that an entity page can indeed help
to improve retrieval performance, we next go on to investigate utilizing ﬁeld evidence from entity pages. We will ﬁrst
see the results of two diﬀerent term selection methods based
on ﬁeld evidence, then an analysis between them is given.
Note that in our experiments on ﬁeld based expansion, the
top retrieved documents are considered pseudo relevant documents for EQ and AQ.
For the supervised method, we compare two ways of incorporating expansion terms for retrieval. The ﬁrst is to add
the top ranked 100 good terms (SL). The second is to add the
top ranked 10 good terms, each given the classiﬁcation probability as weight (SLW). The relevance model with weighted
TFs is denoted as (RMWTF). From Table 8, we can see
that both methods enhance retrieval performance. Among
the three methods, RMWTF and SLW generate similar results. However, the SLW is subject to the accuracy of term
classiﬁcation, thus we choose RMWTF for the query dependent method introduced in the next section. Although we do
not give results for BQ for space reasons, our experiments
show that BQ is improved by the ﬁeld based expansion.
In addition, characteristics of BQ could be investigated so
that pseudo relevant documents could be tailored for BQ,
and the ﬁeld based expansion still be applied.
AP
0.1640
0.1702
0.1768*
Robust
0.2902
0.2921*
0.2934*
WT10G
0.2107
0.2145*
0.2153*
GOV2
0.3237
0.3298*
0.3274*
.32
.30
Table 8: Supervised Learning vs Linear Combination. * indicates statistically signiﬁcant improvements over RMC. We use the paired t-test with signiﬁcance at p<0.05.
.333
.28
Mean Average Precision
Method
SL
SLW
RMWTF
AP
Robust2004
WT10G
GOV2
.26
.24
.22
.20
.18
.329
Mean Average Precision
.16
.14
.325
0
20
40
60
80
100
120
Number of Expansion Terms
.321
Figure 2: Parameter (K) sensitivity of expansion
terms over diﬀerent data sets, N = 10,λf b = 0.6.
.317
Title
Overview
Content
Category
Appendix
Links
.313
.308
0
2
4
6
8
Field Weight (others being 1.0)
Figure 1: Performance on diﬀerent ﬁeld weights on
the GOV2 collection.
Table 9 also presents an analysis of the robustness by giving the numbers of queries improved by RMC and QD over
QL respectively. QD shows improvement for robustness on
Robust2004, WT10G and GOV2 collections.
5.6
Parameter Selection
Figure 1 shows the results of assigning diﬀerent weight to
ﬁelds on GOV2. As can be seen in Figure 1, performance
improves as weights for Links, Content increase. On the
other hand, the increase of weight to Overview leads to
deterioration of the performance. This shows that the positions where a term appears have diﬀerent impacts on the
indication of term relevancy.
Both the RMC and RMW methods have parameters N ,
k and λf b . We tested these two methods with 10 diﬀerent
values of N , the number of feedback documents: 10, 20,...,
100. For λf b , we tested with 5 diﬀerent values: 0.4, 0.5, 0.6,
0.7 and 0.8. Due to space limitations, we present only the
ﬁnal results. Results show that setting N = 10, K = 50 and
λf b = 0.6 work best for the values tested. Figure 2 shows
the sensitivity of RMW to K. The results for other methods
are similar. Figure 2 shows that retrieval performance varies
little as the number of expansion terms increases.
5.5 Query Dependent Expansion
6.
We now explore a query dependent expansion method.
We assign diﬀerent methods for queries according to their
types as follows: RE for EQ and AQ, RMWTF for BQ.
RE is chosen because EQ and AQ beneﬁt more from RE
than from RMWTF. For AQ, a disambiguation process is
conducted to determine the corresponding entity page. We
denote the query dependent method as (QD). As can be seen
in Table 9, the improvement of QD over RMC is signiﬁcant
across test collections.
In this paper, we have explored utilization of Wikipedia in
PRF. In this work TREC topics are categorized into three
types based on Wikipedia. We propose and study diﬀerent
methods for term selection using pseudo relevance information from Wikipedia entity pages. We evaluated these methods on four TREC collections. The impact of Wikipedia on
retrieval performance for diﬀerent types of queries has been
evaluated and compared. Our experimental results show
that the query dependent approach can improve over a baseline relevance model.
This study suggests several interesting research avenues
for our future investigations. More investigation is needed
to explore the characteristics and possible technique reﬁnements for the broader queries. For ambiguous queries, if
the disambiguation process can achieve improved accuracy,
the eﬀectiveness of the ﬁnal retrieval will be improved. For
the supervised term selection method, the results obtained
are not satisfactory in terms of accuracy. This means that
there is still much room for improvement. We are going to
explore more features for the learning process. Finally, in
this paper, we focused on using Wikipedia as the sole source
RMC
QD
MAP
IMP
MAP
IMP
AP
0.1707
119
0.1777
116
Robust
0.2823
174
0.3002*
191
WT10G
0.1969
47
0.2194*
67
GOV2
0.3141
88
0.3348*
96
Table 9: Query Dependent vs traditional Relevance
Model. IMP is the number of queries improved by
the method over QL. * indicates statistically significant improvements over RMC. We use the paired
t-test with signiﬁcant at p<0.05.
CONCLUSION
of PRF information. However, we believe both the initial
result from the test collection and Wikipedia have their own
advantages for PRF. By combining them together, one may
be able to develop an expansion strategy which is robust to
the query being degraded by either of the resources.
7. ACKNOWLEDGMENTS
This work is supported by the National Science Foundation of China under Grant No. 60603094, the Major State
Basic Research Project of China (973 Program) under Grant
No. 2007CB311103 and the National High Technology Research and Development Program of China (863 Program)
under Grant No. 2006AA010105. The authors are grateful
to the anonymous reviewers for their comments, which have
helped improve the quality of the paper.
8. REFERENCES
[1] C. Bishop. Pattern recognition and machine learning.
Springer, 2006.
[2] C. Buckley, G. Salton, J. Allan, and A. Singhal.
Automatic query expansion using SMART: TREC 3.
In Text REtrieval Conference, 1994.
[3] G. Cao, J.-Y. Nie, J. Gao, and S. Robertson. Selecting
good expansion terms for pseudo-relevance feedback.
In Proceedings of SIGIR 2008, pages 243–250.
[4] C.-C. Chang and C.-J. Lin. LIBSVM: a library for
support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/c̃jlin/libsvm.
[5] P. A. Chirita, C. S. Firan, and W. Nejdl. Personalized
query expansion for the web. In Proceeding of SIGIR
2007, pages 7–14.
[6] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. A
framework for selective query expansion. In
Proceedings of CIKM 2004, pages 236–237.
[7] J. L. Elsas, J. Arguello, J. Callan, and J. G.
Carbonell. Retrieval and feedback models for blog feed
search. In Proceedings of SIGIR 2008, pages 347–354.
[8] C. Fautsch and J. Savoy. UniNE at TREC 2008: Fact
and Opinion Retrieval in the Blogsphere. In
Proceedings of TREC 2008.
[9] S. Fissaha Adafre, V. Jijkoun, and M. de Rijke. Fact
discovery in Wikipedia. In Proceedings of Web
Intelligence 2007, pages 177–183.
[10] B. M. Fonseca, P. Golgher, B. Pôssas,
B. Ribeiro-Neto, and N. Ziviani. Concept-based
interactive query expansion. In Proceedings of CIKM
2005, pages 696–703.
[11] G. R. Giambattista Amati, Claudio Carpineto and
F. U. Bordoni. Query diﬃculty, robustness and
selective application of query expansion. In
Proceedings of ECIR 2004, pages 127–137, 2004.
[12] J.-Y. N. Hang Cui, Ji-Rong Wen and W.-Y. Ma.
Query expansion by mining user logs. IEEE
Transactions on knowledge and data engineering,
15(4):829–839, 2003.
[13] B. He and I. Ounis. Combining ﬁelds for query
expansion and adaptive query expansion. Information
Processing and Management, 2007.
[14] Indri. http://www.lemurproject.org/indri/.
[15] J.A.Hartigan and M. Wong. A k-means clustering
algorithm. Applied Statistics.
[16] W. W.-J. H. K. Balog, E. Meij and M. de Rijke. The
University of Amsterdam at TREC 2008: Blog,
Enterprise, and Relevance Feedback. In Proceedings of
TREC 2008.
[17] K. L. Kwok and M. Chan. Improving two-stage ad-hoc
retrieval for short queries. In Proceedings of SIGIR
1998, pages 250–256.
[18] V. Lavrenko and W. B. Croft. Relevance based
language models. In Proceedings of SIGIR 2001, pages
120–127.
[19] K. S. Lee, W. B. Croft, and J. Allan. A cluster-based
resampling method for pseudo-relevance feedback. In
Proceedings of SIGIR 2008, pages 235–242.
[20] Y. Li, W. P. R. Luk, K. S. E. Ho, and F. L. K. Chung.
Improving weak ad-hoc queries using Wikipedia as
external corpus. In Proceedings of SIGIR 2007, pages
797–798.
[21] D. Metzler and W. B. Croft. Latent concept expansion
using markov random ﬁelds. In Proceedings of SIGIR
2007, pages 311–318.
[22] D. Metzler, T. Strohman, H. Turtle, and W. Croft.
Indri at trec 2005: Terabyte track. In Proceedings of
TREC 2004.
[23] D. N. Milne, I. H. Witten, and D. M. Nichols. A
knowledge-based search engine powered by Wikipedia.
In Proceedings of CIKM 2007, pages 445–454.
[24] G. Mishne. Applied Text Analytics for Blogs. PhD
thesis, University of Amsterdam, Amsterdam, 2007.
[25] J. Platt. Probabilities for SV machines. Advances in
large margin classifiers, pages 61–74.
[26] S. Robertson, H. Zaragoza, and M. Taylor. Simple
BM25 extension to multiple weighted ﬁelds. In
Proceedings of CIKM 2004, pages 42–49.
[27] S. E. Robertson, S. Walker, M. Beaulieu, M. Gatford,
and A. Payne. Okapi at TREC-4. In In Proceedings of
the 4th Text REtrieval Conference (TREC), 1996.
[28] M. Sanderson. Ambiguous queries: test collections
need more sense. In Proceedings of SIGIR 2008, pages
499–506.
[29] T. Tao and C. Zhai. Regularized estimation of mixture
models for robust pseudo-relevance feedback. In
Proceedings of SIGIR 2006, pages 162–169.
[30] Wikipedia. http://www.wikipedia.org.
[31] J. Xu and W. B. Croft. Improving the eﬀectiveness of
information retrieval with local context analysis. ACM
Trans. Inf. Syst., 18(1):79–112, 2000.
[32] E. Yom-Tov, S. Fine, D. Carmel, and A. Darlow.
Learning to estimate query diﬃculty. In Proceedings of
SIGIR 2005, pages 512–519.
[33] M. M. Zesch Torsten, Gurevych Iryna. Analyzing and
accessing Wikipedia as a lexical semantic resource. In
Biannual Conference of the Society for Computational
Linguistics and Language Technology 2007, pages
213–221.
[34] C. Zhai and J. Laﬀerty. Model-based feedback in the
language modeling approach to information retrieval.
In Proceedings of CIKM 2001, pages 403–410.
[35] W. Zhang and C. Yu. UIC at TREC 2006 Blog Track.
In The Fifteenth Text REtrieval Conference (TREC
2006) Proceedings, 2007.