DIMACS TR: 2008-01

Variance-based Criteria for Clustering and Their Application to the Analysis of Management Styles of Mutual Funds Based on Time Series of Daily Returns

Authors: Nikita I. Lytkin, Casimir A. Kulikowski and Ilya B. Muchnik


The problem of clustering is formulated as the problem of minimization of a certain objective function over the set of all possible clusterings. The objective function measures mathematically the quality of a clustering. According to a previously published theoretical result, if the objective function being minimized is strictly convex, then the corresponding clustering surface is strictly convex. As a direct implication of this result followed the construction of a basic gradient algorithm of search for locally optimal solutions (i.e., clusterings). This gradient procedure constitutes the core of the clustering algorithms proposed in this work for minimization of two novel objective functions. An important direction in statistical sampling theory deals with construction of optimal stratified samples from a population. One of the problems addressed by stratified sampling is the construction of a sample for estimation of the mean value of a particular scalar parameter, such that the variance of the estimate is minimized. For this purpose, a criterion for optimal partitioning of the population into a certain number of groups (strata) was derived. This criterion is known as Neyman's criterion for optimal stratified sampling. Neyman's criterion was originally developed for one-dimensional data. This criterion is generalized to $n$-dimensional space, and is used as an objective function for clustering. A proof of strict concavity of the generalized objective function is given. Strict concavity provided the basis for a K-means-like gradient-based clustering algorithm proposed in this work. The other objective function investigated in this work, also originated in statistical sampling theory, where it was derived for optimal selection of representative types for estimation of the mean value of a certain scalar parameter. Selection of representative types differs from stratified sampling in that, under the former sampling scheme, a single representative is sampled from each group. A proof of non-convexity of this objective function is given. An efficient stochastic procedure for minimization of this objective function is proposed. The procedure uses, as an elementary operation, a modification of the basic gradient algorithm mentioned above, and has the same computational complexity as the well-known K-means algorithm. The proposed clustering methods and the K-means method are similar, both intuitively and mathematically. For this reason, a systematic comparative analysis of these methods was performed. Experimental results obtained on synthetic data illustrate the differences in the forms of the discriminant surfaces constructed by each of the three clustering algorithms. In contrast to K-means, which produces linear discriminant surfaces, the other two criteria produce quadratic discriminant surfaces. These results also help in the interpretation of differences between clusterings produced by the three algorithms. An application of all three methods to clustering real-world time series data is demonstrated. In this work, time series were treated as static points in $n$-dimensional space. The dataset consisted of time series of daily returns of 6671 mutual funds from May 2005 until May 2006. The results obtained closely corresponded to the outcomes of an informal financial analysis of hidden information on the funds' management styles (dynamics of the funds' portfolios). By applying three different clustering methods, three different results were obtained. The consistent part of these clusterings was interpreted as the most robust and objectively consistent component in the existing classification of mutual funds by their management styles. Preliminary experimental results obtained in this work suggest that in practice it is useful to apply all three clustering methods together in order to aid in the discovery of consistent cores within the data. Further plans for applications of clustering methods on time series data in the domain of cyber security are outlined.

Paper Available at: ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2008/2008-01.pdf
DIMACS Home Page