DIMACS Workshop on Data Mining and Scalable Algorithms
August 22 - 24, 2001
DIMACS Center, Rutgers University, Piscataway, NJ
- Organizers:
- Alex Smola, Australian National University, Alex.Smola@anu.edu.au
- Paul Bradley, Digimine Inc., paulb@digimine.com
- Nello Cristianini, Royal Holloway College, University of London,
N.Cristianini@dcs.rhbnc.ac.uk
- Olvi Mangasarian, University of Wisconsin, olvi@cs.wisc.edu
Presented under the auspices of the Special Focus on Data Analysis and Mining.
With the availability of very large collections of data, the areas of
machine learning, statistics, optimization, and databases face the
challenge of making efficient use of this information. Data mining
targets the problem of finding useful, interesting, and understandable
structure or models derived from the data. While there exist advanced
techniques for dealing with nonparametric estimators efficiently when
only limited data is available, often algorithms for large amounts of
data resort to a rather limited class of possible estimates such as
linear models or the assumption that the data can be represented by a
small number of clusters. This restriction is mainly imposed due to
implementation constraints.
Yet this situation is paradoxical since complex models could be more
easily justified from a statistical point of view, especially when
data is abundant. It gives rise to the question whether statistical
methods exist that strike a better balance between complexity and
performance.
Aims and Topics
- Practical Limits of Nonparametric Methods: Runtime, storage, relation to nearest neighbor methods.
- Practical Limits of Parametric Models: Is data really nonlinear or is a simple model good enough?
- Handling Categorical Data: Kernels for categorical data, data with mixed numeric and categorial attributes.
- Novelty Detection and Discovering Patterns: Fraud detection, modeling temporal/cyclic data.
- Missing or Censored Data
- Efficiency: Integration with database systems, efficient model building, efficient model deployment, large datasets.
- Data and Feature Selection: Reduced dataset and feature methods
- Small Training Set - Large Test Set: Can we gain anything by transduction or EM?
- Understandability and Visualization: Prediction explanation, data visualization/navigation.
- Applications: Collaborative filtering, text classification (e.g. email classification), mining of massive document repositories (withhypertextual, multilingual, multimedia features).
Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on July 17, 2001.