воскресенье, 4 марта 2012 г.

PubMed related articles: a probabilistic topic-based model for content similarity.(Research article)(Report)

Authors: Jimmy Lin (corresponding author) [1,2]; W John Wilbur [2]

Background

This article describes the retrieval model behind the related article search functionality in PubMed [1]. Whenever the user examines a MEDLINE citation in detail, a panel to the right of the abstract text is automatically populated with titles of articles that may also be of interest (see Figure 1). We describe

pmra , the topic-based content similarity model that underlies this feature.Figure 1: A typical view in the PubMed search interface showing an abstract in detail. The "Related Links" panel on the right is populated with titles of articles that may be of interest. [figure omitted]

There is evidence to suggest that related article search is a useful feature. Based on PubMed query logs gathered during a one-week period in June 2007, we observed approximately 35 million page views across 8 million browser sessions. Of those sessions, 63% consisted of a single page view-representing bots and direct access into MEDLINE (e.g., from an embedded link or another search engine). Of all sessions in our data set, approximately 2 million include at least one PubMed search query and at least one view of an abstract-this figure roughly quantifies actual searches. About 19% of these involve at least one click on a related article. In other words, roughly a fifth of all non-trivial user sessions contain at least one invocation of related article search. In terms of overall frequency, approximately five percent of all page views in these non-trivial sessions were generated from clicks on related article links. More details can be found in [2].

We evaluate the

pmra retrieval model with the test collection from the TREC 2005 genomics track. A test collection is a standard laboratory tool for evaluating retrieval systems, and it consists of three major components:

* a corpus-a collection of documents on which retrieval is performed,

* a set of information needs-written statements describing the desired information, which translate into queries to the system, and

* relevance judgments-records specifying the documents that should be retrieved in response to each information need (typically, these are gathered from human assessors in large-scale evaluations [3]).

The use of test collections to assess the performance of retrieval algorithms is a well-established methodology in the information retrieval (IR) literature, dating back to the Cranfield experiments in the 60's [4]. These tools enable rapid, reproducible experiments in a controlled setting without requiring users.

The

pmra model is compared against bm25 [5, 6], a competitive probabilistic model that shares theoretical similarities with pmra . On test data from the TREC 2005 genomics track, we observe a small but statistically significant improvement in terms of precision.

Before proceeding, a clarification on terminology: although MEDLINE records contain only abstract text and associated bibliographic information, PubMed provides access to the full text articles (if available). Thus, it is not inaccurate to speak of searching for articles, even though the search itself is only performed on information in MEDLINE. Throughout this work, we use "document" and "article" interchangeably.

1.1 Formal Model

We formalize the related document search problem as follows: given a document that the user has indicated interest in, the system task is to retrieve other documents that the user may also want to examine. Since this activity generally occurs in the context of broader information-seeking behaviors, relevance can serve as one indicator of interest, i.e., retrieve other relevant documents. However, we think of the problem in broader terms: other documents may be interesting because they discuss similar topics, share the same citations, provide general background, lead to interesting hypotheses, etc.

To constrain this problem, we assume in our theoretical model that documents of interest are similar in terms of the topics or concepts that they are

about ; in the case of MEDLINE citations, we limit ourselves to the article title and abstract (the deployed algorithm in PubMed also takes advantage of MeSH terms, which we do not discuss here). Following typical assumptions in information retrieval [7], we wish to rank documents (MEDLINE citations, in our case) based on the probability that the user will want to see them. Thus, our pmra retrieval model focuses on estimating P (c |d ), the probability that the user will find document c interesting given expressed interest in document d .

Let us begin by decomposing documents into mutually-exclusive and exhaustive "topics" (denoted by the set {

s1 ...sN }). Assuming that the relatedness of documents is mediated through topics, we get the following:

[math omitted]

Expanding

P (sj |d ) by Bayes' Theorem, we get:

[math omitted]

Since we are only concerned about the ranking of documents, the denominator can be safely ignored since it is independent of

c . Thus, we arrive at the following criteria for ranking documents:

[math omitted]

Rephrased in prose,

P (c |sj ) is the probability that a user would want to see c given an interest in topic sj , and similarly for P (d |sj ). Thus, the degree to which two documents are related can be computed by the product of these two probabilities and the prior probability on the topic P (sj ), summed across all topics.

Thus far, we have not addressed the important question of what a topic actually is. For computational tractability, we make the simplifying assumption that each term in a document represents a topic (that is, each term conveys an idea or concept). Thus, the "aboutness" of a document (i.e., what topics the document discusses) is conveyed through the terms in the document. As with most retrieval models, we assume single-word terms, as opposed to potentially complex multi-word concepts. This satisfies our requirement that the set of topics be exhaustive and mutually-exclusive.

From this starting point, we leverage previous work in probabilistic retrieval models based on Poisson distributions (e.g., [6, 8, 9]). A Poisson distribution characterizes the probability of a specific number of events occurring in a fixed period of time if these events occur with a known average rate. The underlying assumption is a generative model of document content: let us suppose that an author uses a particular term with constant probability, and that documents are generated as a sequence of terms. A Poisson distribution specifies the probability that we would observe the term

n times in a document. Obviously, this does not accurately reflect how content is actually produced-nevertheless, this simple model has served as the starting point for many effective retrieval algorithms.

This content model also assumes that each term occurrence is …

Комментариев нет:

Отправить комментарий