Title: A Data Mining Case for Online Information Sharing and Information Acquisition
Speaker: Balaji Padmanabhan, The Wharton School, University of Pennsylvania
Date: April 25, 2003, 12:00-1:00
Location: DIMACS Center, CoRE Bldg. Lecture Hall(1st Floor), Rutgers University, Busch Campus, Piscataway, NJ
Prior work in academia and industry describes various models of personalization of web sites based on models and profiles derived from clickstream data collected at the site. We argue that a fundamental nature of such data is that it is incomplete, since it does not capture user behavior across sites in a given session. Given the inherent incompleteness of such data, it stands to reason that models learned from such data may be subject to limitations, the nature of which has not been studied or quantified in prior work. In this research we compare models using site-centric data versus models using user-centric data in the context of three different prediction problems. We present results from a comprehensive experiment on user-level clickstream data gathered based on 20,000 users' browsing behavior over a period of six months. The main result is that models built on user-centric data significantly outperform models built on site-centric data for the various tasks. Moreover, comparison of qualitative inferences reveals that potentially erroneous conclusions may be formed from an incomplete view of the world. We argue that the results are significant given the widespread use of site-centric methods in the online world and discuss the implications of these findings for information sharing. In follow-up research, we present an interesting solution to the above-mentioned 'incomplete data problem' based on an active data acquisition method developed from the principles of active learning.
Sponsored by Rutgers Graduate Student Association (GSA) and DIMACS.
Please refer to http://gsa.rutgers.edu/~reca/ for further information.