The problem of Set Cover-to find the smallest subcollection of sets that covers some universe-is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal. However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.
[ bib | slides | .pdf ] Back
This file was generated by bibtex2html 1.92.