DIMACS TR: 2012-01
A Split-and-Conquer Approach for Analysis of Extraordinary Large
Data
Authors: Xueying Chen and Minge Xie
ABSTRACT
If there are extraordinarily large data, too large to fit into a single
computer or too expensive to perform a computationally intensive data
analysis, what should we do? To deal with this problem, we propose in
this paper a split-and-conquer approach and illustrate it using a
computationally intensive penalized regression method, along with a
theoretical support. Consider a regression setting of generalized
linear models with n observations and p covariates, in which n is
extraordinarily large and p is either bounded or goes to infinity at a
certain rate of n. We propose to split the data of size n into K
subsets of size O(n/K). For each subset of data, we perform a penalized
regression analysis and the results from each of the K subsets are then
combined to obtain an overall result. We show that the combined overall
result still retains the model selection consistency and asymptotic
normality under mild conditions. When K is less than O(n^{1/5}), we
also show that the combined result is asymptotically equivalent to the
corresponding analysis result of using the entire data all together,
assuming that there were a super computer that could carry out such an
analysis. In addition, when a computational intensive algorithm is used
in the sense that its computing expense is at the order of O(n^a), a
> 1, we show that the split-and-conquer approach can reduce
computing time and computer memory requirement. Furthermore, the
split-and-conquer approach involves a random splitting and a systemic
combining. We establish an upper bound for the expected number of
falsely selected variables and a lower bound for the expected number
for truly selected variables. We also demonstrate that, from the
splitting and combining, the approach has an inherent advantage of
being more resistant to false model selections caused by spurious
correlations. The proposed methodology is demonstrated numerically
using both simulation and real data examples.
The research is supported in part by the National Science
Foundation Grants DMS0915139, SES0851521 and DMS1107012 and by the
Department of Homeland Security Grant 2008-DN-077-ARI012-02.
Paper Available at:
http://dimacs.rutgers.edu/TechnicalReports/TechReports/2012/2012-01.pdf
DIMACS Home Page