DIMACS TR: 2008-01
Variance-based Criteria for Clustering and Their Application to the Analysis of Management Styles of Mutual Funds Based on Time Series of Daily Returns
Authors: Nikita I. Lytkin, Casimir A. Kulikowski and Ilya B. Muchnik
ABSTRACT
The problem of clustering is formulated as the problem of
minimization of a certain objective function over the set of all
possible clusterings. The objective function measures
mathematically the quality of a clustering. According to a
previously published theoretical result, if the objective
function being minimized is strictly convex, then the
corresponding clustering surface is strictly convex. As a direct
implication of this result followed the construction of a basic
gradient algorithm of search for locally optimal solutions (i.e.,
clusterings). This gradient procedure constitutes the core of the
clustering algorithms proposed in this work for minimization of
two novel objective functions.
An important direction in statistical sampling theory deals with
construction of optimal stratified samples from a population. One
of the problems addressed by stratified sampling is the
construction of a sample for estimation of the mean value of a
particular scalar parameter, such that the variance of the
estimate is minimized. For this purpose, a criterion for optimal
partitioning of the population into a certain number of groups
(strata) was derived. This criterion is known as Neyman's
criterion for optimal stratified sampling.
Neyman's criterion was originally developed for one-dimensional
data. This criterion is generalized to $n$-dimensional space, and
is used as an objective function for clustering. A proof of
strict concavity of the generalized objective function is
given. Strict concavity provided the basis for a K-means-like
gradient-based clustering algorithm proposed in this work.
The other objective function investigated in this work, also
originated in statistical sampling theory, where it was derived
for optimal selection of representative types for estimation of
the mean value of a certain scalar parameter. Selection of
representative types differs from stratified sampling in that,
under the former sampling scheme, a single representative is
sampled from each group. A proof of non-convexity of this
objective function is given. An efficient stochastic procedure
for minimization of this objective function is proposed. The
procedure uses, as an elementary operation, a modification of the
basic gradient algorithm mentioned above, and has the same
computational complexity as the well-known K-means algorithm.
The proposed clustering methods and the K-means method are
similar, both intuitively and mathematically. For this reason, a
systematic comparative analysis of these methods was
performed. Experimental results obtained on synthetic data
illustrate the differences in the forms of the discriminant
surfaces constructed by each of the three clustering
algorithms. In contrast to K-means, which produces linear
discriminant surfaces, the other two criteria produce quadratic
discriminant surfaces. These results also help in the
interpretation of differences between clusterings produced by the
three algorithms.
An application of all three methods to clustering real-world time
series data is demonstrated. In this work, time series were
treated as static points in $n$-dimensional space. The dataset
consisted of time series of daily returns of 6671 mutual funds
from May 2005 until May 2006. The results obtained closely
corresponded to the outcomes of an informal financial analysis of
hidden information on the funds' management styles (dynamics of
the funds' portfolios). By applying three different clustering
methods, three different results were obtained. The consistent
part of these clusterings was interpreted as the most robust and
objectively consistent component in the existing classification
of mutual funds by their management styles.
Preliminary experimental results obtained in this work suggest
that in practice it is useful to apply all three clustering
methods together in order to aid in the discovery of consistent
cores within the data.
Further plans for applications of clustering methods on time
series data in the domain of cyber security are outlined.
Paper Available at:
ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/2008/2008-01.pdf
DIMACS Home Page