DIMACS TR: 2012-01

A Split-and-Conquer Approach for Analysis of Extraordinary Large Data



Authors: Xueying Chen and Minge Xie

ABSTRACT

If there are extraordinarily large data, too large to fit into a single computer or too expensive to perform a computationally intensive data analysis, what should we do? To deal with this problem, we propose in this paper a split-and-conquer approach and illustrate it using a computationally intensive penalized regression method, along with a theoretical support.  Consider a regression setting of generalized linear models with n observations and p covariates, in which n is extraordinarily large and p is either bounded or goes to infinity at a certain rate of n. We propose to split the data of size n into K subsets of size O(n/K). For each subset of data, we perform a penalized regression analysis and the results from each of the K subsets are then combined to obtain an overall result. We show that the combined overall result still retains the model selection consistency and asymptotic normality under mild conditions. When K is less than O(n^{1/5}), we also show that the combined result is asymptotically equivalent to the corresponding analysis result of using the entire data all together, assuming that there were a super computer that could carry out such an analysis. In addition, when a computational intensive algorithm is used in the sense that its computing expense is at the order of O(n^a), a > 1, we show that the split-and-conquer approach can reduce computing time and computer memory requirement. Furthermore, the split-and-conquer approach involves a random splitting and a systemic combining. We establish an upper bound for the expected number of falsely selected variables and a lower bound for the expected number for truly selected variables. We also demonstrate that, from the splitting and combining, the approach has an inherent advantage of being more resistant to false model selections caused by spurious correlations. The proposed methodology is demonstrated numerically using both simulation and real data examples.

The research is supported in part by the National Science Foundation Grants DMS0915139, SES0851521 and DMS1107012 and by the Department of Homeland Security Grant 2008-DN-077-ARI012-02.

Paper Available at: http://dimacs.rutgers.edu/TechnicalReports/TechReports/2012/2012-01.pdf
DIMACS Home Page