Information Retrieval: Challenges in Interactive-Time Manipulation of Massive Text Collections


David Karger

Information retrieval (IR) is a study of how we can hook people up to the information they need. With the rapid proliferation of large corpora on the World Wide Web, the problem of finding useful needles in these data haystacks has received new attention.

Perhaps the simplest IR model is one in which a system is responsible for matching a user query against the most relevant documents in its corpus. The oldest systems use boolean queries, returning the documents that have all the terms in the query. This kind of system suffers from the existence of synonyms and the fact that concepts can be expressed in many different ways. Much IR research has been devoted to finding improved approaches that deal with the ambiguity of language. We will discuss some of the mathematical models of and algorithms that have been developed for IR, including the recently developed "Latent Semantic Indexing" approach based on a singular value decomposition of the term-document occurrence matrix.

Corpora are often huge, placing their analysis in the domain of supercomputing tasks. Unfortunately, users are not willing to wait a week for their answers. So another crucial question that must be answered is how to make the IR systems efficient enough to answer users queries interatively. We will discuss some of the methods that have been applied here as well.