DIMACS Working Group on The Mathematics of Web Search and Meta-Search

June 19 - 26, 2004
Bertinoro International Center for Informatics, Bertorino, Italy

Organizers:
Cynthia Dwork, Microsoft, dwork@microsoft.com
Andrew Gelman, Columbia University, gelman@stat.columbia.edu
D. Sivakumar, IBM Almaden, siva@almaden
Presented under the auspices of the Special Focus on Computation and the Socio-Economic Sciences.

Information on BICI can be found at http://www.cs.unibo.it/bici/.


In an election, each of a large number k of voters ranks a small number n of candidates. The rankings are then combined in some fashion to elect either a single or several candidates. The formal analysis of voting began in France in the latter half of the eighteenth century with two seminal, but conflicting, approaches proposed, respectively, by Jean Charles de Borda and Marie J. A. N. Caritat, the Marquis de Condorcet. This laid the groundwork for an extensive literature on the mathematics of voting. Fast-forwarding to a more modern problem: In web meta-search, in response to a given query, each of a small number k of search engines (voters) ranks a (subset of a) large number n of candidates (pages). The results are then combined in some fashion to produce a ranking that is in some sense "better'' than the results produced by any single search engine. The precise definition of "better'' may vary; for example, assuming that different search engines have different databases of pages (resulting from different web crawls), "better'' might mean "resulting from broader coverage.'' Or, if we assume that different search engines are susceptible to different types of search engine "spam'' -- intuitively, arranging for one's page to have unreasonably high rank in response to some query -- rank aggregation can produce results that are less vulnerable to spam [Dwork, Kumar, Naor and Sivakumar (2001)]. (Search engine spam is, in fact, rampant; see, e.g, [Fetterly, Manasse, Najork and Wiener (2003)].) As phrased here, the connection between voting, analysis of rank data, and the mathematics of (web) search and meta-search is patent. Indeed, the voting literature inspired several of the meta-search results of [Dwork, Kumar, Naor, and Sivakumar (2001)].

Statistically, this problem of ranking latent parameters given unbalanced discrete data arises in many contexts, most notably in psychometrics, for ranking abilities of students [Lord and Novick (1968), Johnson (1996), Johnson (1997)] and also animals [Johnson (2002)] and sports teams [Glickman and Stern (1998)]. Much recent work has gone into generalizing these models to apply to more flexible data structures [Boomsma, Duijn and Snijders (2001)]. Statistical models of Web connections (which are relevant to understanding the workings of the individual search engines) can be found in [Sen and Hansen (2003), Baldi, Frasconi and Smyth (2003)]. Connections between meta-search and learning have been established as well [Lebanon and Lafferty (2002), Freund, Iyer, Schapire and Singer (1998), Joachims (2003)] (the first paper uses rank data as input). Finally, the database community has been exploring usage of rank-like information in the context of search: personal preference data for improving search outcomes [Jeh and Widom (2002)] and page importance for guiding crawls [Abiteboul, Preda and Cobena (2003)]. The working group will bring together researchers from the voting theory, statistics, psychometrics, learning theory, database, and web search communities, with the goal of obtaining new algorithms for search and meta-search.

References:

S. Abiteboul, M. Preda, and G. Cobena, "Adaptive on-line page importance computation,'' Proc. WWW2003.

P. Baldi, P. Frasconi, and P. Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms, John Wiley and Sons, 2003.

Boomsma, A, Van Duijn, M.A.J., and Snijders, T.A.B. (Eds), Essays on Item Response Theory, New York, Springer Verlag, 2001.

Dwork, C., Kumar, R., Naor, M., and Sivakumar, D., "Rank aggregation methods for the Web,'' Proc. WWW10, 2001.

Fetterly, D., Manasse, M., Najork, M., and Wiener, J.L., ''A large-scale Study of the evolution of web pages,'' WWW2003, pp. 669?678.

Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer, "An efficient boosting algorithm for combining preferences,'' Machine Learning: Proceedings of the Fifteenth International Conference, 1998.

Glickman, M.E., and Stern, H.S., "A state-space model for National Football League scores,'' Journal of the American Statistical Association, 93, 1998, 25--35.

Jeh, G., and Widom, J., "Scaling personalized Web search,'' Stanford University Technical Report, 2002.

Joachims, T.,"Optimizing search engines using clickthrough data,'' Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002.

Johnson, V.E., "On Bayesian analysis of multirater ordinal data: an application to automated essay grading,'' Journal of the American Statistical Association, 91, 1996, 42--51.

Johnson, V.E., "An alternative to traditional GPA for evaluating student performance (with discussion), " Statistical Science, 12, 1997, 251--278.

Johnson, V. "Bayesian analysis of rank data with applications to primate intelligence experiments,'' Journal of the American Statistical Association, 2002.

Lebanon, G., and Lafferty, J., "Cranking: Combining rankings using conditional probability models on permutations," Machine Learning: Proc. Nineteenth International Conference, 2002.

Lord, F.M., and Novick, M.R., Statistical Theories of Mental Test Scores, Reading, Mass.: Addisom-Wesley, 1968.

Sen, R., and Hansen, M.H., "Predicting a Web user's next request based on log data,'' Journal of Computational and Graphical Statistics, 12, 2003.


Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on October 31, 2003.