DIMACS Working Group on Data Analysis
May 17, 2001
DIMACS Center, Rutgers University, Piscataway, NJ
- Organizers:
- Paul Kantor, Rutgers University, kantorp@cs.rutgers.edu
- David Hull, WhizBang, dhull@whizbang.com
- Michael Berry, University of Tennessee, berry@cs.utk.edu
- Endre Boros, Rutgers University, boros@rutcor.rutgers.edu
- Warren Greiff, Mitre Corporation, greiff@mitre.org
- Liz Liddy, Syracuse University, liddy@mailbox.syr.edu
Presented under the auspices of the auspices of the Special Focus on Next Generation Networks Technologies and Applications and the Special Focus on Data Analysis and Mining
Abstracts:
1.
Statistical Modeling for the Segmentation of Archived News Broadcasts
Warren Greiff, MITRE
Not all information streams come with individual items delimited. For example,
broadcast news captured from the air waves may not be accompanied by indications
of where one story ends and another begins. In this talk we present recent work
carried out at The MITRE Corporation on the automatic segmentation of textual data.
The approach taken is based on a fine-grained Hidden Markov Model. Because of
the very large number of parameters in the model, robust parameter estimation is
a major concern. Since the model's parameter values can be expected to vary
smoothly as a function of time, we have based the determination of conditional
probabilities on non-parametric density estimation. In this presentation,
we will discuss the design of the Hidden Markov Model; reduction of the input
stream to the feature vectors which are treated as observables in the model;
and the non-parametric estimation techniques we employed.
2.
Integrating Background Information in Text Classification
Haym Hirsh, Rutgers University
One popular class of supervised learning problems concerns the task of
classifying items that are comprised primarily of text. Such problems
can occur when assessing the relevance of a Web page to a given user,
assigning topic labels to technical papers, and assessing the importance
of a user's email. Most commonly this task is performed by extrapolating
from a corpus of labeled textual training documents procedures for
classifying further unlabeled documents. However, the widespread
availability of enormous amounts of information in online form opens
up this task to methods that can exploit additional sources information
during classification. This talk will describe two techniques for doing so.
The first uses collections of background text into the vocabulary re-expression
process performed by latent semantic indexing. We show that these two
methods for integrating background information into text classification
improves overall classification accuracy, particularly in cases where
there is only modest amounts of data, as is the case when data must be
obtained by hand-labeling.
3.
PIRCS - An Effective Document Detection Tool
Kui Lam Kwok, Queens College
PIRCS is an in-house developed IR engine and stands for Probabilistic
Indexing and Retrieval - Components - System. It has been used in
various forms to participate in all the past 9 TREC blind retrieval
experiments with consistent and surprisingly good results. This talk
will discuss the main ideas behind its approach, relationship with
other models, a summary of its technology and samples of TREC
experimental results for illustration.
4.
Data-Mining, MetaData and Digital Libraries
Elizabeth Liddy, Syracuse University
A Digital Library is a collection of digital objects (the Repository),
descriptions of these objects (Metadata), to provide search, browsing,
and retrieval (Services), to a distributed set of users (Community).
As such, it can be seen that Digital Libraries are an anachronistic
phenomena that attempt to combine the most traditional aspects of
libraries with the most advanced of computer capabilities. This
potential will only be adequately fulfilled, however, if digital
libraries can take their age-old reliance on human catalogers, who
have traditionally provided the searchable descriptions of library
materials, and replace them with an automated means for determining
and assigning substantive descriptions of digital objects as they
are added to a Digital Library. NLP-based Data-Mining software can
accomplish the automated metatagging of digital materials by interpreting
the textual documents (or their abstracts or introductions) utilizing
algorithms which extract meaning at all the levels of language at
which human catalogers interpret what a document is about and describe
it for effective access by Digital Library users. By utilizing
Data-Mining to automatically produce MetaData, there is really no
limitation to the number and size of Digital Libraries.
5.
Monitoring and Triage of Business News
Foster Provost, New York University
We are interested in systems that monitor information feeds and perform
triage on the information in order to improve the performance of
knowledge workers. For example, financial analysts, attorneys,
business-school professors, market makers, portfolio managers,
reporters, and many others would benefit from timely attention
to certain business news stories. Bloomberg, Reuters, Bridge,
and several other companies have profited greatly selling a variety
of instant-access, business information services. In this project,
we are investigating the use of inductive algorithms to build triage
models. News-story importance can be assessed in many ways; here we
concentrate on the use of prospective, objective criteria that can
be assessed in many ways; here we concentrate on the prospective,
objective criteria that can be used to build massive training sets
automatically from historical data. Examples include the volume
of follow-up stories and the stock-market reaction to a story.
In both cases, for timely attention, it would be useful to be able
to rank any given set of stories by expected importance. However,
in order for the models to be used, it also is important to be
able to understand what knowledge the models capture. Finally,
the knowledge captured may be interesting in its own right;
for example, business-school researchers are interested in
firm-specific news that affects stock price. In this talk,
I will discuss these problems and will present some (preliminary)
results.
Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on May 17, 2001.