DIMACS Workshop Data Processing on the Web: A Look into the Future

March 6-7, 2000
DIMACS Center, CoRE Building, Rutgers University

Organizers:: Mary Fernandez, AT&T Labs - Research, mff@research.att.com; S. Muthukrishnan, AT&T Labs - Research, muthu@research.att.com; Dan Suciu, AT&T Labs - Research, suciu@research.att.com

Presented under the auspices of the DIMACS Special Focus on Next Generation Networks Technologies and its Applications and the DIMACS Special Year on Networks.

Processing Web data poses new challenges for database and information-retrieval techologies. Among these challenges are: the amounts of Web data; its relevance ranging from useless to highly useful sources; its heterogeneity ranging from plain text, to structured documents, to sounds and images; the way data is generated, from static HTML pages to database queries, and its unpredictability. Despite these challenges, today we can search the entire Web, mine significant portions of it, integrate data from multiple Web data sources, and query Web data. New technologies, such as those based on the XML standard, are aimed at enterprise-wide Web applications and data integration. Future applications will probably go much further. In this two-day workshop, leading researchers at the forefront of database and Web research will give their visions of the future of data processing on the Web.

March 6, 2000

Morning

Breakfast and Registration: 8:15 - 8:45

Welcome and Greeting: 8:50 - 9:00
Fred Roberts, Director of DIMACS

Representing Web Data

The web graph: structure and interpretation: 9:00 - 9:45
Dr. Sridhar Rajagopalan, IBM Almaden Research Center

Abstract & Speaker Bio

Economics of Information

A Proposal for Valuing Information and Instrumental Goods: 9:45 - 10:30

Abstract & Speaker Bio

Break: 10:30 - 11:00

More on Representing Web Data

WHIRL: A Formalism for Representing Web Data: 11:00 - 11:45

Abstract & Speaker Bio

Musings on the Extraction of Structure from the Web: 11:45 - 12:30
Dr. Rajeev Motwani, Stanford University
Abstract & Speaker Bio

Lunch: 12:30 - 2:00

Afternoon

Privacy

Privacy Implications of Online Data Collection: 2:00 - 2:45

Dr. Lorrie Cranor

Abstract & Speaker Bio

Revolution, not Evolution

Online music - The next big revolution: 2:45 - 3:30

Abstract & Speaker Bio

Break: 3:30 - 4:00

A Petabyte in Your Pocket: Directions for Net Data Management: 4:00 - 4:45

Abstract & Speaker Bio

March 7, 2000

Morning

Breakfast and Registration: 8:15 - 9:00

Data on Data

Information Access and Data Processing on the Web: Current Limitations, New Techniques,
and Future Directions: 9:00 - 9:45

Abstract & Speaker Bio

On collecting and using Web data: 9:45 - 10:30
Dr. Balachander Krishnamurthy, AT&T Labs - Research
Abstract & Speaker Bio

Break: 10:30 - 11:00

XML

XML + Databases = ?: 11:00 - 11:45

Abstract & Speaker Bio

The Next 700 Markup Languages: 11:45 - 12:30
Dr. Philip Wadler, Lucent Technologies - Bell Labs
Abstract & Speaker Bio

Lunch: 12:30 - 2:00

Afternoon

Ubiquity

ObjectGlobe: Ubiquitous Query Processing on the Internet: 2:00 - 2:45

Abstract & Speaker Bio

Data Management for Ubiquitous Computing: 2:45 - 3:30
Dr. Alon Levy, University of Washington
Abstract & Speaker Bio

Break: 3:30 - 4:00

Open Research/Future Problems Session: 4:00 - 5:00

Next: Call for Participation

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on February 8, 2000.

Abstracts and Speaker Bios

The web graph: structure and interpretation
Dr. Sridhar Rajagopalan, IBM Almaden Research Center

The study of the web as a graph is not only fascinating in its own right, but also yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the web graph using two crawls each with over 200M pages and 1.5 billion links. Our study indicates that the macroscopic structure of the web is considerably more intricate than suggested by earlier experiments on a smaller scale.

Collaborators: Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Raymie Stata, Andrew Tomkins, Eli Upfal and Janet Wiener.

Sridhar Rajagopalan has received a B.Tech. from the Indian Institute of Technology, Delhi, in 1989 and a Ph.D. from the University of California, Berkeley in 1994. He has been a DIMACS postdoctoral fellow between 1994 and 1996. He is now a Research Staff Member at the IBM Almaden Research Center. His research interests are algorithms and algorithm engineering, randomization, information and coding theory, and information retrieval issues on the web.

WHIRL: A Formalism for Representing Web Data
Dr. William Cohen, AT&T Shannon Laboratory

Data on the Web is hard to represent with conventional knowledge-base and database formalisms, due to problems like terminological differences across sites, and the frequent interleaving of textual information with structured, data-like information. Over the last few years, I have developed a new "information representation language" called WHIRL that addresses these problems by incorporating ideas from both AI knowledge representation systems and statistical information retrieval. Specifically, WHIRL is a subset of Prolog that has been extended by adding special features for reasoning about the similarity of fragments of text. WHIRL's combination of features greatly facilitates the construction of Web-based information integration systems; in more recent work, WHIRL has also been useful for collecting Web data for collaborative filtering and machine learning systems.

William Cohen received his bachelor's degree in Computer Science from Duke University, and a Ph.D. in Computer Science from Rutgers University. Since 1990, Dr. Cohen has been employed at AT&T Labs. His main research area is machine learning, and his present research interests include information integration, text categorization, and learning from large datasets.

Musings on the Extraction of Structure from the Web
Dr. Rajeev Motwani, Stanford University

One of the major challenges in dealing with the web is the unstructured or semistructured nature of the data. There are major benefits in extracting the structure implicit in the web or extracting a subset of the data that can be structured easily. I will give a personal view of some attempts in this direction. The focus will be on the problems and proposals for attacking these problems.

Rajeev Motwani is an associate professor of computer science at Stanford University, where he also serves as the director of graduate studies. He obtained his Ph.D. in computer science from University of California, Berkeley in 1988, and his BTech in computer science from Indian Institute of Technology, Kanpur in 1983. His research interests include: databases and data mining, web search and information retrieval, robotics, and theoretical computer science. He is a co-author of the book, Randomized Algorithms, published by Cambridge University Press in 1995. Motwani has received the Arthur P. Sloan Research Fellowship, the National Young Investigator Award from the National Science Foundation, the Bergmann Memorial Award from the US-Israel Binational Science Foundation, and an IBM Faculty Award.

Privacy Implications of Online Data Collection
Dr. Lorrie Faith Cranor, AT&T Shannon Laboratory

New Web applications are enhancing businesses' abilities to gather data about their online customers that helps them to provide customized services and targeted advertising. By learning more about their customers, online businesses can develop more personal relationships with them, allowing them to better anticipate and meet their customers' needs. However, many of the Web-based data collection systems being deployed raise many privacy concerns. First, most of these systems are being deployed silently, without notifying Web site visitors or giving them an opportunity to choose whether or not they wish to have their data collected. Second, online data is increasingly being combined with data from a variety of sources, allowing for the development of detailed individual profiles. Third, data gathered for business purposes is increasingly being subpoenaed for use in criminal investigations and civil proceedings. Even when data is stored without traditional identifiers such as name or social security number, profiles often contain enough information to uniquely identify individuals. The privacy concerns raised by online data collection are gaining increased attention from the news media, the public, and policy makers. As more advanced data collection and processing applications are deployed without addressing privacy concerns, individual privacy is slowly eroding. Individually, most of these applications do not pose major threats, but taken together they are bringing us closer to a surveillance society.

Lorrie Faith Cranor is a Senior Technical Staff Member in the Secure Systems Research Department at AT&T Labs-Research Shannon Laboratory in Florham Park, New Jersey. She is chair of the Platform for Privacy Preferences Project (P3P) Specification Working Group at the World Wide Web Consortium. Her research has focussed on a variety of areas where technology and policy issues interact, including online privacy, electronic voting, and spam. For more information, please see her home page at http://www.research.att.com/~lorrie/.

A Proposal for Valuing Information and Instrumental Goods
Dr. Marshall Van Alstyne, University of Michigan

How should a firm value information capital? This essay offers one framework that combines ideas from economics and computer science. Drawing a distinction between data and procedures, it augments the traditional Bayesian model that treats information as a change in uncertainty with instruments that permit information as instructions to be executed and reused. It then applies the standard hedonic methods - used in marketing to value tangible goods - to information goods. This leads to a generalized method for ascribing value that can be applied both to procedural information such as software, blueprints, and production know-how as well as to arbitrary resources that have instrumental qualities, that is, they represent tools for effecting outcomes. This approach has the added advantage of supporting efficient information transfers since consumers need not always see the information they are about to buy.

Marshall Van Alstyne is an assistant professor at the University of Michigan, where he teaches information economics, electronic commerce, and computer simulation. He holds a bachelor's in computer science from Yale and MS and Ph.D. degrees in information technology from MIT. In 1999, he received an NSF Career Award and Intel Young Investigator Fellowship to explore research on the economics of information. Before returning to academia, he worked as a technology management consultant and he co-founded a software venture. Past clients include Fortune 500 companies as well as US state and federal government agencies. He has published in several journals including Science and Sloan Management Review, and his research has been the subject of radio broadcasts in the US and Canada.

Online music - The next big revolution
Dr. Narayanan Shivakumar, Gigabeat, Inc.

Text is passe. The growing popularity of the MP3 and RealAudio formats is leading to a new revolution on the Internet -- online music. In this talk, I will discuss some challenging problems that arise in building multi-media web crawlers, and in mining the web for nuggets of text and audio information.

Narayanan Shivakumar is currently the Chief Scientist of Gigabeat.com, an online music startup in Palo Alto, CA. His current research interests include data mining, multi-media, databases, and digital libraries. He has been a summer visitor at Microsoft Corp., Bell Labs, and Xerox PARC. He is a member of ACM and Tau Beta Pi. He received his computer science degrees (MS'97/PhD'99) from Stanford, and from UCLA (BS'94).

A Petabyte in Your Pocket: Directions for Net Data Management
Dr. David Maier, Department of Computer Science and Engineering, Oregon Graduate Institute

In 2015, for a few hundred dollars a year, you can have a personal petabyte database (PetDB) that you can access from any point of connection, with any device. It stores and organizes any kind of digital data you want to have, without losing structure or information. All this data is queryable and it is arranged by type, content, structure, association and multiple categorizations and groupings. You can also locate items by when or how you encountered them, what you have done with them, where you were when you accessed them.

What could you fit in a personal petabyte store? Some possibilities:

The contents of every book and magazine you ever bought.
Every web page you've ever visited.
Every email you've ever sent or received, including all attachments.
Every version of any piece of software you have ever used.
Maps and images of every place you have ever been.
Descriptions and price lists for every kind of product you might want to buy-a "universal, personal catalog."
Portions of the web you might want to browse, including snapshots at past points of time.

Your PetDB doesn't appear to reside on any particular computer-you are never on the "wrong" machine to access it. More importantly, you don't have to take any explicit action to insert data into you PetDB; your PetDB doesn't appear to have an "outside" where data is concerned. Thus your PetDB is also your personal Internet portal: your evolving and customized view of all on-line digital data.

The PetDB isn't predicated upon some massive improvement in holographic memory technology or DNA-based storage units. Rather, it is an example of what could be done with a new generation of software infrastructure we term Net Data Managers (NDMs). NDMs are a radical departure from the capabilities and structure of current database management systems. They focus on data movement, rather than data storage, working equally well with live streams of data as with files in secondary storage. They will be capable of storing data of arbitrary types, without a matching database schema having been defined previously. They will efficiently execute queries over thousands or tens of thousands of information sites. They will locate and select data items by both internal content as well as a variety of external contexts. NDMs will also support monitoring rapidly changing information sources in a way that scales to thousands or even millions of triggers.

This talk lays out the requirements for Net Data Management, and reports on the research directions being pursued by the NIAGARA project, a joint undertaking with David DeWitt and Jeffrey Naughton at the University of Wisconsin.

David Maier is a professor of Computer Science and Engineering at Oregon Graduate Institute. His current research interests include object-oriented databases, query processing, superimposed information systems, XML and related standards, information assurance, scientific databases and net data management. He has consulted with most of the major database vendors, including Oracle, Informix, IBM and Microsoft. Maier is an ACM Fellow and a holder of the SIGMOD Innovations Award. He received his PhD from Princeton University in 1978.

Information Access and Data Processing on the Web: Current Limitations, New Techniques, and Future Directions
Dr. Steve Lawrence, NEC Research Institute

This talk describes current limitations, new techniques, and future directions for information access and data processing on the web. We describe recent studies analyzing the accessibility, distribution, and structure of information on the web, which highlight the fact that there is much room for improvement and new methods. New techniques for information access and data processing on the web are described, including two projects at NEC Research Institute: Inquirus, which is a content-based metasearch engine, and CiteSeer, which is the largest free full-text index of scientific literature in the world.

Joint work with Lee Giles, Kurt Bollacker, and Eric Glover.

Steve Lawrence is a Research Scientist at NEC Research Institute in Princeton, NJ. Dr. Lawrence has published over 50 articles in areas including information retrieval, web analysis, digital libraries, and machine learning. Dr. Lawrence has done over 100 interviews with news organizations including the New York Times, Wall Street Journal, Washington Post, Reuters, Associated Press, UPI, CNN, BBC, MSNBC, and NPR. Hundreds of articles about his research have appeared worldwide in over 10 different languages.

On collecting and using Web data
Dr. Balachander Krishnamurthy, AT&T Labs - Research

Web-related data has been gathered since the inception of Web, often without the knowledge of a vast majority of Web users. By Web data, I mean client, proxy, and server logs, and HTTP packet traces. Apart from obvious privacy issues, Web data has problems relating to gathering, storing, cleaning, and validation.

I have been involved in several aspects of collection of Web-related data from a wide variety of sources (both inside and outside AT&T) and in creating a repository in conjunction with the W3C World Wide Web consortium's Web Characterization group. The data has been used in several applications ranging from Web caching, improving the HTTP/1.1 protocol, testing Web software components for compliancy with the protocol, reducing validation traffic, to predicting future access.

I will cover the basics of collecting Web data, software issues dealing with cleaning and validating, related protocol issues, and use in a few applications.

Balachander Krishnamurthy has been with AT&T Labs-Research since receiving his PhD in Computer Science from Purdue University. He has written and edited a book called 'Practical Reusable UNIX Software' (John Wiley, 1995) and was the series editor of 'Trends in Software' (John Wiley) consisting of 8 books published over a period of five years. He has several patents, published over thirty five technical papers, given invited lectures in over twenty countries, and presented tutorials on aspects related to the Web. He is the area editor for Web related issues for ACM SIGCOMM's 'Computer Communications Review' and is currently working on a book that will provide a technical overview of the World Wide Web.

XML + Databases = ?
Dr. Michael Carey, IBM Almaden Research Center

In the first half of this talk, I will share some of my thoughts on semistructured databases, object-relational databases, XML, web querying, and how they are all related (or not). Using one of my favorite queries ("find U.S.-made Fender Jazz Bass or Precision Bass guitars available for under $700 within 50 miles of my home in San Jose, California"), I'll talk about what one can and can't do on the web today and how the database community can hopefully help change that. I'll also talk about the pros and cons of the aforementioned technologies, in terms of making my query answerable, and I'll propose a possible XML-based research agenda that might help us get there from here.

In the second half of this talk, I will discuss a new project - called Xperanto (Xml Publishing of Entities, Relationships, ANd Typed Objects) - that we have initiated at the IBM Almaden Research Center. The goal of this project is to provide facilities to enable "XML people" (as opposed to "SQL people") to conveniently publish content from relational and object-relational databases on the web in queryable XML form. I will outline the approach that we're taking, including the architecture of the system, the roles of the various Xperanto components, and some of the technical issues and challenges involved in the project.

Michael J. Carey received the Ph.D. degree from UC Berkeley in 1983. He spent 13 years on the faculty at the University of Wisconsin-Madison,where he conducted research on DBMS performance, transaction processing, distributed and parallel database systems, extensibility, and object-oriented (O-O) databases. He co-directed the EXODUS and SHORE projects while at Wisconsin. In mid-1995, Carey joined the staff of the IBM Almaden Research Center, where he has worked on the Garlic heterogeneous information system project and more recently on object-relational (O-R) database system technology for DB2. Inspired by a semester spent as Stonebraker Visiting Fellow at UC Berkeley in 1999, he has also begun to explore the intersection of XML and object-relational database system technology. His current interests include O-R DBMS implementation techniques, the use of XML to publish databases' contents on the web, and the ongoing evolution of the SQL standard.

The Next 700 Markup Languages
Dr. Philip Wadler, Lucent Technologies - Bell Labs

XML (eXtensible Markup Language) is a magnet for hype: the successor to HTML for Web publishing, electronic data interchange, and e-commerce. In fact, XML is little more than a notation for trees and for tree grammars, a verbose variant of Lisp S-expressions coupled with a poor man's BNF (Backus-Naur form). Yet this simple basis has spawned scores of specialized sublanguages: for airlines, banks, and cell phones; for astronomy, biology, and chemistry; for the DOD and the IRS. Domain-specific languages indeed! There is much for the language designer to contribute here. In particular, as all this is based on a sort of S-expression, is there a role for a sort of Lisp?

Philip Wadler is a researcher at Bell Labs, Lucent Technologies, and codesigner of the languages Haskell and GJ. He spends his time on the border between theory and practice, seeking ways one may inform the other. He helped turn monads from a concept in algebraic topology into a way to structure programs in Haskell, and his work on GJ may help turn quantifiers in second-order logic into a feature of the Java programming language. He edits the Journal of Functional Programming for Cambridge University Press, and writes a column for SIGPLAN Notices. He was an ACM distinguished lecturer 1989--1993, and has been an invited speaker in Amsterdam, Austin, Boulder, Brest, Gdansk, London, Montreal, New Haven, Portland, Santa Fe, Sydney, and Victoria.

ObjectGlobe: Ubiquitous Query Processing on the Internet
Dr. Alfons Kemper, Universitat Passau

We present the design of ObjectGlobe, a distributed and open query processor. Today, data is published on the Internet via Web servers which have, if at all, very localized query processing capabilities. The goal of the ObjectGlobe project is to establish an open market place in which data and query processing capabilities can be distributed and used by any kind of Internet application. The goal of the ObjectGlobe project is twofold. First, we would like to create an infrastructure that makes it as easy to distribute query processing capabilities (i.e., query operators) as it is to publish data and documents on the Web today. Second, we would like to enable clients to execute complex queries which involve the execution of operators from multiple providers at different sites and the retrieval of data and documents from multiple data sources. All query operators should be able to interact in a distributed query plan and it should be possible to move query operators to arbitrary sites, including sites which are near the data. The only requirement we make is that all query operators must be written in Java and conform to the secure interfaces of ObjectGlobe. One of the main challenges in the design of such an open system is to ensure s ecurity. We discuss the ObjectGlobe security requirements, show how basic components such as the optimizer and runtime system need to be extended. Finally, we present the results of some performance experiments that assess the benefits of placing query operators close to the Internet data sources and the additional cost for ensuring security in such an open system.

This is joint work with R. Braumandl, M. Keidl, D. Kossmann, A. Kreutz, S. Proels, S. Seltzsam, and K. Stocker (all at the University of Passau).

Alfons Kemper received his Bachelor degree in Computer Science from the University of Dortmund (Germany) in 1979. He then moved to the University of Southern California where he obtained the Masters degree and the Ph. D. degree in Computer Science in 1981 and 1984, respectively. From 1984 until 1991 he was an Assistant Professor of Computer Science at the University of Karlsruhe, Germany. He spent two years (from 1991 until 1993) as an Associate Professor at the Technical University (RWTH) of Aachen, Germany. He is currently a Full Professor of Computer Science at the University of Passau, Germany. His research interests center around the design and realization of advanced database technology. His main research focus was on indexing and query processing techniques for object-oriented and object-relational database systems and performance issues related to complex database application systems (such as decision support systems and SAP R/3). In his recent work he concentrates on distributed database implementation techniques and distributed query processing over Internet data sources.

Data Management for Ubiquitous Computing
Dr. Alon Levy, University of Washington

In the not too distant future, many devices (e.g., common household appliances, PDAs, cellphones, cars) will contain computer chips that will enable them to exhibit more sophisticated behavior and interact with other devices. For example, refrigerators will be able to monitor their contents and automatically order supplies. The heating system of a house will monitor the alarm clocks and calendars of their owners to set the temperature in an optimal fashion. In the context of ubiquitous computing, data exchange and computation occur in the background in response to cues from users. Computing is centered around the network and data, rather than around the computing devices or disks as it is today. Devices are added and removed from the network on a regular basis, and they must be able to interoperate with little human intervention.

I will describe the Sagres Project being conducted at the University of Washington, which addresses the data management issues that arise in the context of ubiquitous computing. In particular, in Sagres we consider the issues of modeling a wide variety of devices, describing the complex interactions between devices and the easy addition and removal of devices. This is joint work with Qiong Chen Zack Ives, Jayant Madhavan, Rachel Pottinger, Stefan Saroiu, and Igor Tatarinov.

Alon Levy joined the faculty of the Computer Science and Engineering Department of the University of Washington in January, 1998. Before joining the U. of Washington, he was a principal member of technical staff at AT&T (previously, Bell) Laboratories. He received his Ph.D in Computer Science from Stanford University in 1993. Alon's interests are in data integration, web-site management, semi-structured data, database aspects of ubiquitous computing, query optimization, and interactions between Databases and Artificial Intelligence. In June, 1999 he co-founded Nimble.com, a company that builds tools for query processing and data integration for XML.