DIMACS Working Group on Data Analysis

May 17, 2001
DIMACS Center, Rutgers University, Piscataway, NJ

Paul Kantor, Rutgers University, kantorp@cs.rutgers.edu
David Hull, WhizBang, dhull@whizbang.com
Michael Berry, University of Tennessee, berry@cs.utk.edu
Endre Boros, Rutgers University, boros@rutcor.rutgers.edu
Warren Greiff, Mitre Corporation, greiff@mitre.org
Liz Liddy, Syracuse University, liddy@mailbox.syr.edu
Presented under the auspices of the auspices of the Special Focus on Next Generation Networks Technologies and Applications and the Special Focus on Data Analysis and Mining



Statistical Modeling for the Segmentation of Archived News Broadcasts

Warren Greiff, MITRE

Not all information streams come with individual items delimited.  For example,
broadcast news captured from the air waves may not be accompanied by indications
of where one story ends and another begins.  In this talk we present recent work
carried out at The MITRE Corporation on the automatic segmentation of textual data.
The approach taken is based on a fine-grained Hidden Markov Model.  Because of 
the very large number of parameters in the model, robust parameter estimation is
a major concern.  Since the model's parameter values can be expected to vary
smoothly as a function of time, we have based the determination of conditional
probabilities on non-parametric density estimation.  In this presentation,
we will discuss the design of the Hidden Markov Model; reduction of the input
stream to the feature vectors which are treated as observables in the model;
and the non-parametric estimation techniques we employed.

2. Integrating Background Information in Text Classification Haym Hirsh, Rutgers University One popular class of supervised learning problems concerns the task of classifying items that are comprised primarily of text. Such problems can occur when assessing the relevance of a Web page to a given user, assigning topic labels to technical papers, and assessing the importance of a user's email. Most commonly this task is performed by extrapolating from a corpus of labeled textual training documents procedures for classifying further unlabeled documents. However, the widespread availability of enormous amounts of information in online form opens up this task to methods that can exploit additional sources information during classification. This talk will describe two techniques for doing so. The first uses collections of background text into the vocabulary re-expression process performed by latent semantic indexing. We show that these two methods for integrating background information into text classification improves overall classification accuracy, particularly in cases where there is only modest amounts of data, as is the case when data must be obtained by hand-labeling.
3. PIRCS - An Effective Document Detection Tool Kui Lam Kwok, Queens College PIRCS is an in-house developed IR engine and stands for Probabilistic Indexing and Retrieval - Components - System. It has been used in various forms to participate in all the past 9 TREC blind retrieval experiments with consistent and surprisingly good results. This talk will discuss the main ideas behind its approach, relationship with other models, a summary of its technology and samples of TREC experimental results for illustration.
4. Data-Mining, MetaData and Digital Libraries Elizabeth Liddy, Syracuse University A Digital Library is a collection of digital objects (the Repository), descriptions of these objects (Metadata), to provide search, browsing, and retrieval (Services), to a distributed set of users (Community). As such, it can be seen that Digital Libraries are an anachronistic phenomena that attempt to combine the most traditional aspects of libraries with the most advanced of computer capabilities. This potential will only be adequately fulfilled, however, if digital libraries can take their age-old reliance on human catalogers, who have traditionally provided the searchable descriptions of library materials, and replace them with an automated means for determining and assigning substantive descriptions of digital objects as they are added to a Digital Library. NLP-based Data-Mining software can accomplish the automated metatagging of digital materials by interpreting the textual documents (or their abstracts or introductions) utilizing algorithms which extract meaning at all the levels of language at which human catalogers interpret what a document is about and describe it for effective access by Digital Library users. By utilizing Data-Mining to automatically produce MetaData, there is really no limitation to the number and size of Digital Libraries.
5. Monitoring and Triage of Business News Foster Provost, New York University We are interested in systems that monitor information feeds and perform triage on the information in order to improve the performance of knowledge workers. For example, financial analysts, attorneys, business-school professors, market makers, portfolio managers, reporters, and many others would benefit from timely attention to certain business news stories. Bloomberg, Reuters, Bridge, and several other companies have profited greatly selling a variety of instant-access, business information services. In this project, we are investigating the use of inductive algorithms to build triage models. News-story importance can be assessed in many ways; here we concentrate on the use of prospective, objective criteria that can be assessed in many ways; here we concentrate on the prospective, objective criteria that can be used to build massive training sets automatically from historical data. Examples include the volume of follow-up stories and the stock-market reaction to a story. In both cases, for timely attention, it would be useful to be able to rank any given set of stories by expected importance. However, in order for the models to be used, it also is important to be able to understand what knowledge the models capture. Finally, the knowledge captured may be interesting in its own right; for example, business-school researchers are interested in firm-specific news that affects stock price. In this talk, I will discuss these problems and will present some (preliminary) results.

Previous: Participation
Next: Registration
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on May 17, 2001.