DIMACS :: Details

Sensitivity Sampling for Coreset-Based Data Selection

March 05, 2025, 11:00 AM - 12:00 PM

Location:

Conference Room 301

Rutgers University

CoRE Building

96 Frelinghuysen Road

Piscataway, NJ 08854

Vincent Cohen-Addad, Google

Abstract

The scale of modern machine learning models and data has made data selection a central problem. In this talk, we focus on the problem of finding the best representative subset of a dataset to train a machine learning model. We provide a new data selection approach based on 𝑘-means clustering and sensitivity sampling.

Assuming embedding representation of the data and that the model loss is Hölder continuous with respect to these embeddings, we prove that our new approach allows to select a set of ``typical'' 1/𝜖2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±𝜖) factor and an additive 𝜖𝜆Φ𝑘, where Φ𝑘 represents the 𝑘-means cost for the input data and 𝜆 is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods.

We also show that our sampling strategy can be used to define new sampling scores for regression, leading to a new active learning strategy that is comparatively simpler and faster than previous ones like leverage score.

Based on several papers that appeared at FOCS'24 and ICML'24, joint work with Kyriakos Axiotis, Nikhil Bansal, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, Milind Prabhu, David Saulpic, Chris Schwiegelshohn, David Woodruff, Michael Wunder

See: https://theory.cs.rutgers.edu/theory_seminar