Musings on the Extraction of Structure from the Web: 11:45 - 12:30
On collecting and using Web data: 9:45 - 10:30
The Next 700 Markup Languages: 11:45 - 12:30
Data Management for Ubiquitous Computing: 2:45 - 3:30March 6, 2000
Morning
Breakfast and Registration: 8:15 - 8:45
Welcome and Greeting: 8:50 - 9:00
Fred Roberts, Director of DIMACSRepresenting Web Data
The web graph: structure and interpretation: 9:00 - 9:45
Dr. Sridhar Rajagopalan, IBM Almaden Research Center
Abstract & Speaker Bio
Economics of Information
A Proposal for Valuing Information and Instrumental Goods: 9:45 - 10:30
Dr. Marshall Van Alstyne, University of Michigan
Abstract & Speaker Bio
Break: 10:30 - 11:00
More on Representing Web Data
WHIRL: A Formalism for Representing Web Data: 11:00 - 11:45
Dr. William Cohen, AT&T Labs - Research
Abstract & Speaker Bio
Dr. Rajeev Motwani, Stanford University
Abstract & Speaker Bio
Lunch: 12:30 - 2:00
Afternoon
Privacy
Privacy Implications of Online Data Collection: 2:00 - 2:45
Dr. Lorrie Cranor,AT&T Labs - Research
Abstract & Speaker Bio
Revolution, not Evolution
Online music - The next big revolution: 2:45 - 3:30
Dr. Narayanan Shivakumar, Gigabeat, Inc.
Abstract & Speaker Bio
Break: 3:30 - 4:00
A Petabyte in Your Pocket: Directions for Net Data Management: 4:00 - 4:45
Dr. David Maier, Oregon Graduate Institute
Abstract & Speaker Bio
March 7, 2000
Morning
Breakfast and Registration: 8:15 - 9:00
Data on Data
Information Access and Data Processing on the Web:
Current Limitations, New Techniques,
and Future Directions: 9:00 - 9:45
Dr. Steven Lawrence, NEC Research Institute
Abstract & Speaker Bio
Dr. Balachander Krishnamurthy, AT&T Labs - Research
Abstract & Speaker Bio
Break: 10:30 - 11:00
XML
XML + Databases = ?: 11:00 - 11:45
Dr. Michael Carey, IBM Almaden Research Center
Abstract & Speaker Bio
Dr. Philip Wadler, Lucent Technologies - Bell Labs
Abstract & Speaker Bio
Lunch: 12:30 - 2:00
Afternoon
Ubiquity
ObjectGlobe: Ubiquitous Query Processing on the Internet: 2:00 - 2:45
Dr. Alfons Kemper, Universitat Passau
Abstract & Speaker Bio
Dr. Alon Levy, University of Washington
Abstract & Speaker Bio
Break: 3:30 - 4:00
Open Research/Future Problems Session: 4:00 - 5:00
Next: Call for Participation
The study of the web as a graph is not only fascinating in its own right, but also yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution. We report on experiments on local and global properties of the web graph using two crawls each with over 200M pages and 1.5 billion links. Our study indicates that the macroscopic structure of the web is considerably more intricate than suggested by earlier experiments on a smaller scale.
Collaborators: Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Raymie Stata, Andrew Tomkins, Eli Upfal and Janet Wiener.
Sridhar Rajagopalan has received a B.Tech. from the Indian Institute of Technology, Delhi, in 1989 and a Ph.D. from the University of California, Berkeley in 1994. He has been a DIMACS postdoctoral fellow between 1994 and 1996. He is now a Research Staff Member at the IBM Almaden Research Center. His research interests are algorithms and algorithm engineering, randomization, information and coding theory, and information retrieval issues on the web.
Data on the Web is hard to represent with conventional knowledge-base
and database formalisms, due to problems like terminological
differences across sites, and the frequent interleaving of textual
information with structured, data-like information. Over the last few
years, I have developed a new "information representation language"
called WHIRL that addresses these problems by incorporating ideas from
both AI knowledge representation systems and statistical information
retrieval. Specifically, WHIRL is a subset of Prolog that has been
extended by adding special features for reasoning about the similarity
of fragments of text. WHIRL's combination of features greatly
facilitates the construction of Web-based information integration
systems; in more recent work, WHIRL has also been useful for
collecting Web data for collaborative filtering and machine learning
systems.
William Cohen received his bachelor's degree
in Computer Science from
Duke University, and a Ph.D. in Computer Science from Rutgers
University. Since 1990, Dr. Cohen has been employed at AT&T Labs.
His main research area is machine learning, and his present research
interests include information integration, text categorization, and
learning from large datasets.
One of the major challenges in dealing with the web is the unstructured or
semistructured nature of the data. There are major benefits in extracting
the structure implicit in the web or extracting a subset of the data that
can be structured easily. I will give a personal view of some attempts in
this direction. The focus will be on the problems and proposals for
attacking these problems.
Rajeev Motwani is an associate
professor of computer science at Stanford University, where he also serves
as the director of graduate studies. He obtained his Ph.D. in computer
science from University of California, Berkeley in 1988, and his BTech in
computer science from Indian Institute of Technology, Kanpur in 1983. His
research interests include: databases and data mining, web search and
information retrieval, robotics, and theoretical computer science. He is a
co-author of the book, Randomized Algorithms, published by Cambridge
University Press in 1995. Motwani has received the Arthur P. Sloan Research
Fellowship, the National Young Investigator Award from the National Science
Foundation, the Bergmann Memorial Award from the US-Israel Binational
Science Foundation, and an IBM Faculty Award.
New Web applications are enhancing businesses' abilities to gather
data about their online customers that helps them to provide
customized services and targeted advertising. By learning more about
their customers, online businesses can develop more personal
relationships with them, allowing them to better anticipate and meet
their customers' needs. However, many of the Web-based data collection
systems being deployed raise many privacy concerns. First, most of
these systems are being deployed silently, without notifying Web site
visitors or giving them an opportunity to choose whether or not they
wish to have their data collected. Second, online data is increasingly
being combined with data from a variety of sources, allowing for the
development of detailed individual profiles. Third, data gathered for
business purposes is increasingly being subpoenaed for use in criminal
investigations and civil proceedings. Even when data is stored without
traditional identifiers such as name or social security number,
profiles often contain enough information to uniquely identify
individuals. The privacy concerns raised by online data collection are
gaining increased attention from the news media, the public, and
policy makers. As more advanced data collection and processing
applications are deployed without addressing privacy concerns,
individual privacy is slowly eroding. Individually, most of these
applications do not pose major threats, but taken together they are
bringing us closer to a surveillance society.
Lorrie Faith Cranor
is a Senior Technical Staff Member in the Secure
Systems Research Department at AT&T Labs-Research Shannon Laboratory
in Florham Park, New Jersey. She is chair of the Platform for Privacy
Preferences Project (P3P) Specification Working Group at the World
Wide Web Consortium. Her research has focussed on a variety of areas
where technology and policy issues interact, including online privacy,
electronic voting, and spam. For more information, please see her home
page at http://www.research.att.com/~lorrie/.
How should a firm value information capital? This essay offers one framework that combines ideas
from economics and computer science. Drawing a distinction between data and procedures, it
augments the traditional Bayesian model that treats information as a change in uncertainty with
instruments that permit information as instructions to be executed and reused. It then applies the
standard hedonic methods - used in marketing to value tangible goods - to information goods. This
leads to a generalized method for ascribing value that can be applied both to procedural information
such as software, blueprints, and production know-how as well as to arbitrary resources that have
instrumental qualities, that is, they represent tools for effecting outcomes. This approach has the added
advantage of supporting efficient information transfers since consumers need not always see the
information they are about to buy.
Marshall Van Alstyne is an assistant professor at the University of Michigan, where he teaches
information economics, electronic commerce, and computer simulation. He holds a bachelor's in
computer science from Yale and MS and Ph.D. degrees in information technology from MIT. In 1999,
he received an NSF Career Award and Intel Young Investigator Fellowship to explore research on the
economics of information. Before returning to academia, he worked as a technology management
consultant and he co-founded a software venture. Past clients include Fortune 500 companies as well
as US state and federal government agencies. He has published in several journals including Science
and Sloan Management Review, and his research has been the subject of radio broadcasts in the US
and Canada.
Text is passe. The growing popularity of the MP3 and RealAudio formats
is leading to a new revolution on the Internet -- online music. In this
talk, I will discuss some challenging problems that arise in building
multi-media web crawlers, and in mining the web for nuggets of
text and audio information.
Narayanan Shivakumar is currently the Chief Scientist of Gigabeat.com,
an online music startup in Palo Alto, CA. His current research interests
include data mining, multi-media, databases, and digital libraries. He
has been a summer visitor at Microsoft Corp., Bell Labs, and Xerox PARC.
He is a member of ACM and Tau Beta Pi. He received his computer science
degrees (MS'97/PhD'99) from Stanford, and from UCLA (BS'94).
In 2015, for a few hundred dollars a year, you can have a personal petabyte
database (PetDB) that you can access from any point of connection, with any
device. It stores and organizes any kind of digital data you want to
have, without
losing structure or information. All this data is queryable and it is
arranged by
type, content, structure, association and multiple categorizations
and groupings.
You can also locate items by when or how you encountered them, what you have
done with them, where you were when you accessed them.
What could you fit in a personal petabyte store? Some possibilities:
The PetDB isn't predicated upon some massive improvement in holographic memory
technology or DNA-based storage units. Rather, it is an example of
what could be
done with a new generation of software infrastructure we term Net Data Managers
(NDMs). NDMs are a radical departure from the capabilities and structure of
current database management systems. They focus on data movement, rather
than data storage, working equally well with live streams of data as with files
in secondary storage. They will be capable of storing data of arbitrary types,
without a matching database schema having been defined previously. They will
efficiently execute queries over thousands or tens of thousands of information
sites. They will locate and select data items by both internal
content as well as
a variety of external contexts. NDMs will also support monitoring rapidly
changing information sources in a way that scales to thousands or even
millions of triggers.
This talk lays out the requirements for Net Data Management, and reports
on the research directions being pursued by the NIAGARA project, a joint
undertaking with David DeWitt and Jeffrey Naughton at the University of
Wisconsin.
David Maier is a professor of Computer Science and Engineering at
Oregon Graduate Institute. His current research interests include
object-oriented databases, query processing, superimposed information
systems, XML and related standards, information assurance, scientific
databases and net data management. He has consulted with most of
the major database vendors, including Oracle, Informix, IBM and Microsoft.
Maier is an ACM Fellow and a holder of the SIGMOD Innovations Award.
He received his PhD from Princeton University in 1978.
This talk describes current limitations, new techniques, and future
directions for information access and data processing on the web. We
describe recent studies analyzing the accessibility, distribution, and
structure of information on the web, which highlight the fact that
there is much room for improvement and new methods. New techniques for
information access and data processing on the web are described,
including two projects at NEC Research Institute: Inquirus, which is a
content-based metasearch engine, and CiteSeer, which is the largest
free full-text index of scientific literature in the world.
Joint work with Lee Giles, Kurt Bollacker, and Eric Glover.
Steve Lawrence is a Research Scientist at NEC Research Institute in
Princeton, NJ. Dr. Lawrence has published over 50 articles in areas
including information retrieval, web analysis, digital libraries, and
machine learning. Dr. Lawrence has done over 100 interviews with news
organizations including the New York Times, Wall Street Journal,
Washington Post, Reuters, Associated Press, UPI, CNN, BBC, MSNBC, and
NPR. Hundreds of articles about his research have appeared worldwide
in over 10 different languages.
Web-related data has been gathered since the inception of Web, often without
the knowledge of a vast majority of Web users. By Web data, I mean client,
proxy, and server logs, and HTTP packet traces. Apart from obvious privacy issues,
Web data has problems relating to gathering, storing, cleaning, and validation.
I have been involved in several aspects of collection of Web-related data from a
wide variety of sources (both inside and outside AT&T) and in creating a
repository in conjunction with the W3C World Wide Web consortium's Web
Characterization group. The data has been used in several applications ranging
from Web caching, improving the HTTP/1.1 protocol, testing Web software components
for compliancy with the protocol, reducing validation traffic, to predicting future
access.
I will cover the basics of collecting Web data, software issues dealing with
cleaning and validating, related protocol issues, and use in a few
applications.
Balachander Krishnamurthy has been with AT&T Labs-Research since receiving
his PhD in Computer Science from Purdue University. He has written and
edited a book called 'Practical Reusable UNIX Software' (John Wiley, 1995)
and was the series editor of 'Trends in Software' (John Wiley) consisting
of 8 books published over a period of five years. He has several patents,
published over thirty five technical papers, given invited lectures in over
twenty countries, and presented tutorials on aspects related to the Web.
He is the area editor for Web related issues for ACM SIGCOMM's 'Computer
Communications Review' and is currently working on a book that will provide
a technical overview of the World Wide Web.
In the first half of this talk, I will share some of my thoughts on
semistructured databases, object-relational databases, XML, web querying,
and how they are all related (or not). Using one of my favorite queries
("find U.S.-made Fender Jazz Bass or Precision Bass guitars available for
under $700 within 50 miles of my home in San Jose, California"), I'll talk
about what one can and can't do on the web today and how the database
community can hopefully help change that. I'll also talk about the pros
and cons of the aforementioned technologies, in terms of making my query
answerable, and I'll propose a possible XML-based research agenda that
might help us get there from here.
In the second half of this talk, I will discuss a new project - called
Xperanto (Xml Publishing of Entities, Relationships, ANd Typed Objects) -
that we have initiated at the IBM Almaden Research Center. The goal of
this project is to provide facilities to enable "XML people" (as opposed to
"SQL people") to conveniently publish content from relational and
object-relational databases on the web in queryable XML form. I will
outline the approach that we're taking, including the architecture of the
system, the roles of the various Xperanto components, and some of the
technical issues and challenges involved in the project.
Michael J. Carey received the Ph.D. degree from UC Berkeley in 1983. He
spent 13 years on the faculty at the University of Wisconsin-Madison,where
he conducted research on DBMS performance, transaction processing,
distributed and parallel database systems, extensibility, and
object-oriented (O-O) databases. He co-directed the EXODUS and SHORE
projects while at Wisconsin. In mid-1995, Carey joined the staff of the IBM
Almaden Research Center, where he has worked on the Garlic heterogeneous
information system project and more recently on object-relational (O-R)
database system technology for DB2. Inspired by a semester spent as
Stonebraker Visiting Fellow at UC Berkeley in 1999, he has also begun to
explore the intersection of XML and object-relational database system
technology. His current interests include O-R DBMS implementation
techniques, the use of XML to publish databases' contents on the web, and
the ongoing evolution of the SQL standard.
XML (eXtensible Markup Language) is a magnet for hype: the successor
to HTML for Web publishing, electronic data interchange, and
e-commerce. In fact, XML is little more than a notation for trees and
for tree grammars, a verbose variant of Lisp S-expressions coupled
with a poor man's BNF (Backus-Naur form). Yet this simple basis has
spawned scores of specialized sublanguages: for airlines, banks, and
cell phones; for astronomy, biology, and chemistry; for the DOD and
the IRS. Domain-specific languages indeed! There is much for the
language designer to contribute here. In particular, as all this is
based on a sort of S-expression, is there a role for a sort of Lisp?
Philip Wadler is a researcher at Bell Labs, Lucent Technologies, and
codesigner of the languages Haskell and GJ. He spends his time on the
border between theory and practice, seeking ways one may inform the
other. He helped turn monads from a concept in algebraic topology
into a way to structure programs in Haskell, and his work on GJ may
help turn quantifiers in second-order logic into a feature of the Java
programming language. He edits the Journal of Functional Programming
for Cambridge University Press, and writes a column for SIGPLAN
Notices. He was an ACM distinguished lecturer 1989--1993, and has
been an invited speaker in Amsterdam, Austin, Boulder, Brest, Gdansk,
London, Montreal, New Haven, Portland, Santa Fe, Sydney, and Victoria.
We present the design of ObjectGlobe, a distributed and open query
processor. Today, data is published on the Internet via Web servers
which have, if at all, very localized query processing capabilities.
The goal of the ObjectGlobe project is to establish an open market
place in which data and query processing capabilities can be
distributed and used by any kind of Internet application. The goal of
the ObjectGlobe project is twofold. First, we would like to create an
infrastructure that makes it as easy to distribute query processing
capabilities (i.e., query operators) as it is to publish data and
documents on the Web today. Second, we would like to enable clients
to execute complex queries which involve the execution of operators
from multiple providers at different sites and the retrieval of data
and documents from multiple data sources. All query operators should
be able to interact in a distributed query plan and it should be
possible to move query operators to arbitrary sites, including sites
which are near the data. The only requirement we make is that all
query operators must be written in Java and conform to the secure
interfaces of ObjectGlobe. One of the main challenges in the design
of such an open system is to ensure s ecurity. We discuss the
ObjectGlobe security requirements, show how basic components such as
the optimizer and runtime system need to be extended. Finally, we
present the results of some performance experiments that assess the
benefits of placing query operators close to the Internet data sources
and the additional cost for ensuring security in such an open system.
This is joint work with R. Braumandl, M. Keidl, D. Kossmann,
A. Kreutz, S. Proels, S. Seltzsam, and K. Stocker (all at the
University of Passau).
Alfons Kemper received his Bachelor degree in Computer Science from
the University of Dortmund (Germany) in 1979. He then moved to the
University of Southern California where he obtained the Masters degree
and the Ph. D. degree in Computer Science in 1981 and 1984,
respectively. From 1984 until 1991 he was an Assistant Professor of
Computer Science at the University of Karlsruhe, Germany. He spent
two years (from 1991 until 1993) as an Associate Professor at the
Technical University (RWTH) of Aachen, Germany. He is currently a Full
Professor of Computer Science at the University of Passau, Germany.
His research interests center around the design and realization of
advanced database technology. His main research focus was on indexing and query
processing techniques for object-oriented and object-relational
database systems and performance issues related to complex database
application systems (such as decision support systems and SAP R/3). In
his recent work he concentrates on distributed database implementation
techniques and distributed query processing over Internet data
sources.
In the not too distant future, many devices (e.g., common household
appliances, PDAs, cellphones, cars) will contain computer chips that
will enable them to exhibit more sophisticated behavior and interact
with other devices. For example, refrigerators will be able to
monitor their contents and automatically order supplies. The heating
system of a house will monitor the alarm clocks and calendars of their
owners to set the temperature in an optimal fashion. In the context
of ubiquitous computing, data exchange and computation occur in the
background in response to cues from users. Computing is centered
around the network and data, rather than around the computing devices
or disks as it is today. Devices are added and removed from the
network on a regular basis, and they must be able to interoperate with
little human intervention.
I will describe the Sagres Project being conducted at the University
of Washington, which addresses the data management issues that arise
in the context of ubiquitous computing. In particular, in Sagres we
consider the issues of modeling a wide variety of devices, describing
the complex interactions between devices and the easy addition and
removal of devices. This is joint work with Qiong Chen
Zack Ives, Jayant Madhavan, Rachel Pottinger, Stefan Saroiu, and Igor Tatarinov.
Alon Levy joined the faculty of the Computer Science and Engineering
Department of the University of Washington in January, 1998. Before
joining the U. of Washington, he was a principal member of technical
staff at AT&T (previously, Bell) Laboratories. He received his Ph.D in
Computer Science from Stanford University in 1993. Alon's interests
are in data integration, web-site management, semi-structured data,
database aspects of ubiquitous computing, query optimization, and
interactions between Databases and Artificial Intelligence. In June,
1999 he co-founded Nimble.com, a company that builds tools for query
processing and data integration for XML.
WHIRL: A Formalism for Representing Web Data
Dr. William Cohen, AT&T Shannon Laboratory
Musings on the Extraction of Structure from the Web
Dr. Rajeev Motwani, Stanford University
Privacy Implications of Online Data Collection
Dr. Lorrie Faith Cranor, AT&T Shannon Laboratory
A Proposal for Valuing Information and Instrumental Goods
Dr. Marshall Van Alstyne, University of Michigan
Online music - The next big revolution
Dr. Narayanan Shivakumar, Gigabeat, Inc.
A Petabyte in Your Pocket: Directions for Net Data Management
Dr. David Maier, Department of Computer Science and
Engineering, Oregon Graduate Institute
Your PetDB doesn't appear to reside on any particular computer-you are never
on the "wrong" machine to access it. More importantly, you don't have
to take any
explicit action to insert data into you PetDB; your PetDB doesn't
appear to have
an "outside" where data is concerned. Thus your PetDB is also your personal
Internet portal: your evolving and customized view of all on-line digital data.
Information Access and Data Processing on the Web:
Current Limitations, New Techniques, and Future Directions
Dr. Steve Lawrence, NEC Research Institute
On collecting and using Web data
Dr. Balachander Krishnamurthy, AT&T Labs - Research
XML + Databases = ?
Dr. Michael Carey, IBM Almaden Research Center
The Next 700 Markup Languages
Dr. Philip Wadler, Lucent Technologies - Bell Labs
ObjectGlobe: Ubiquitous Query Processing on the Internet
Dr. Alfons Kemper, Universitat Passau
Data Management for Ubiquitous Computing
Dr. Alon Levy, University of Washington