DIMACS Center, CoRE Building, Rutgers University

**Organizers:****Jie Wang**, University of Massachusetts, wang@cs.uml.edu**Weili Wu**, University of Texas, weiliwu@utdallas.edu

Title: Learning Embeddings for Similarity-Based Retrieval

Similarity-based retrieval is the task of identifying database patterns that are the most similar to a query pattern. Retrieving similar patterns is a necessary component of many practical applications, in fields as diverse as computer vision, speech recognition, and bioinformatics. This talk presents BoostMap, a method for efficient similarity-based retrieval in spaces with computationally expensive distance measures. Our method constructs embeddings that map database and query patterns into a vector space with a computationally efficient distance measure. Using such a mapping, similar patterns can be retrieved efficiently - often orders of magnitude faster compared to retrieval using the original distance measure. In the BoostMap method, embedding construction is treated as a machine learning problem, and embedding quality is optimized using information from training data. A key property of the learning-based formulation is that the optimization criterion does not depend on geometric properties and is equally valid in both metric and non-metric spaces. In experiments with several datasets, our method compares favorably to alternative methods for efficient retrieval, and provides highly competitive results for applications such as handwritten character recognition and time series indexing.

Title: The Inferelator: A Method for Learning Dynamical Regulatory Networks from Data

Our system for network inference and modeling consists of two major components: cMonkey (a method for learning co-regulated biclusters and pathways), and the Inferelator (regulatory network inference). These components have been described individually, we have also show their integration into a functioning integrated system by applying them to Halobacterium and several other model organisms. This effort represents one of the first coordinated functional genomics effort in archaea carried out in collaberation with the Baliga lab. cMonkey: We have developed an integrative biclustering algorithm, cMonkey, which groups genes and conditions into biclusters on the basis of 1) coherence in expression data across subsets of experimental conditions, 2) co-occurrence of putative cis-acting regulatory motifs in the regulatory regions of bicluster members and 3) the presence of highly connected sub-graphs in metabolic and functional association networks. We describe the algorithm and the results of extensive tests of several previously described methods, showing that cMonkey has several advantages in the context of regulatory network inference. The Inferelator: We have described a network inference algorithm, the Inferelator, which infers regulatory influences for genes and/or gene clusters from mRNA and/or protein expression levels. The procedure can simultaneously model equilibrium and time-course expression levels, such that both kinetic and equilibrium expression levels may be predicted by the resulting models. Through the explicit inclusion of time, and gene-knockout information, the method is capable of learning causal relationships. It also includes a novel solution to the problem of encoding interactions between predictors. We discuss the results from an initial application of this method to the halophilic archaeon, Halobacterium NRC-1. We have found the network to be predictive of 130 newly collected microarray datasets and have also validated parts of the network using ChIP-chip. This network offers a means of deciphering how this organism maintains homeostasis and responds to wide varieties of metabolic, genetic and environmental states. For Background on our integrated system see: http://genomebiology.com/2006/7/5/R36 http://www.biomedcentral.com/1471-2105/7/280 http://www.biomedcentral.com/1471-2105/7/176 http://www.nature.com/msb/journal/v3/n1/full/msb4100118.html.

Title: Optimization in Spatio-Temporal Classification for Early Disease Diagnosis

In this study, novel optimization approaches for multidimensional time series classification are proposed. The first approach, namely Support Feature Machine (SFM), is inspired by the optimization model of support vector machine and the nearest neighbor rule to incorporate both spatial and temporal of the multi-dimensional time series data. The second approach, namely connectivity support vector machine (C-SVM), is based on the integration of statistical graph theoretic modeling and support vector machine. This paper also describes an application of these approaches for early detection of disease diagnosis. Epilepsy, the second most common brain disorder, is a case point in this talk. Typically neurologists study the brain function through neurophysiological signals like Electroencephalograms (EEGs), in which spatial and temporal properties are encrypted. Analyzing these massive data from the brain to identify the abnormalities or anomalies is extremely challenging. I will discuss the challenges in developing a quick and accurate epilepsy screening framework for initial brain diagnosis as well as a prediction/warning framework used to identify the brain's susceptibility (pre-seizure) periods (e.g., normal vs pre-seizure).

Title: Spatio-Temporal Database Application in Mobile Telemedicine

A spatio-temporal database captures simultaneously spatial and temporal aspects of data. Spatio-temporal query languages are the query languages used to query spatio-temporal databases. Spatio-temporal databases supports spatial and temporal data types in addition to traditional data types. Temporal data types include time instants and time intervals. Spatial data types include points, line segments and polygons. Other spatial objects can be approximated by polygons. Mobile telemedicine uses wireless sensor network technology to deliver clinical care to a patient at a distance. Sensors used in mobile telemedicine will collect large volumes of temporal-locality-aware data. How to handle these data becomes an important issue. Spatio-temporal database will help process these data efficiently and store these data effectively. In this paper, we will present an overview of spatio-temporal database, and how to apply spatio-temporal query techniques to mobile telemedicine.

Title: A Novel Method for Signal Transduction Network Inference

We introduce a new method of combined synthesis and inference of biological signal transduction networks. A main idea of our method lies in representing observed causal relationships as network paths and using techniques from combinatorial optimization to find the sparsest graph consistent with all experimental observations. Our contributions are twofold: (a) we formalize our approach, study its computational complexity and prove new results for exact and approximate solutions of the computationally hard transitive reduction substep of the approach. (b) we validate the biological usability of our approach by successfully applying it to a previously published signal transduction network by Li et al. and show that our algorithm for the transitive reduction substep performs well on graphs with a structure similar to those observed in transcriptional regulatory and signal transduction networks. (Joint work with R. Albert, R. Dondi, S. Kachalo, E. Sontag, A. Zelikovsky and K. Westbrooks)

Title: A clustering approach to analyze large high-dimensional cancer datasets

The TNM (Tumor, Lymph Node, Metastasis) is the most widely used staging system for patients with cancer. However, the TNM is not accurate in prediction, partially due to the fact that only three variables are used in the system. Based on the availability of large cancer patient datasets, there is need for the expanded system that can incorporate additional prognostic factors. In this talk, we show how the idea of clustering can be applied to large high-dimensional cancer datasets in order to incorporate new prognostic factors while preserving the TNM staging system. We first quantify the contribution to survival of potential prognostic factors individually and in various combinations, and then evaluate the association among factors through statistical tests. A demonstration is given for breast cancer.

Title: Protein Interaction Module Detection Using A New Clique/Biclique Finding Algorithms

Proteins carry out most cellular processes as protein functional modules. Systematic identification of protein functional modules provide essential knowledge linking proteome dynamics to cellular functions. We detect protein functional modules as cliques in the protein interaction network. We propose new matrix-based algorithms to compute maximal cliques and bicliques. These algorithms are extremely simple to implement and very effective in practice. Using constrained optimization theory, we rigorously prove the correctness and convergence of these algorithms. Applying these algorithms to Yeast, Pyrococcus, Sulfolobus, Halobacterium interaction networks, we obtain a large number of protein interaction modules, many of them conserved across several species, and some of them have been experimentally verified by our collaborators. We discuss their biological significance. Many uncharacterized proteins are found to be new members of important protein complexes.

Brief Bio: Georges Grinstein is Professor of Computer Science at the University of Massachusetts Lowell, Head of its Bioinformatics and Cheminformatics Program, Co-director of its Institute for Visualization and Perception Research, and of its Center for Biomolecular and Medical Informatics. His research interests are broad and include computer graphics, visualization, data mining, virtual environments, and user interfaces with the emphasis on the modeling, visualization, and analysis of complex information systems, most often biomedical in nature. He received his Ph.D. in Mathematics from the University of Rochester. He has over 30 years in academia with extensive private consulting, over 100 research grants, products in use nationally and internationally, several patents, numerous publications in journals and conferences; founded several companies, and has been the organizer or chair of national and international conferences and workshops in Computer Graphics, in Visualization, and in Data Mining. He is on the editorial boards of several journals in Computer Graphics and Data Mining, has been a member of ANSI and ISO, a NATO Expert, and a technology consultant for various government agencies.

Title: Investigating Support Vector Machines with Unsupervised Learning

Support vector machines (SVM) are the machine learning models of choice for many classification problems in the biological sciences due to their superior performance. However, support vector machine models are not transparent and therefore difficult to interpret beyond raw performance metrics. On the other hand, the models themselves consist of points in high-dimensional spaces that define a decision surface and visualizing these points together with the input data provides substantial insight into the model. In this talk we show that an approach to unsupervised learning using emergent self-organizing maps (ESOM) can be successfully applied to the visualization of these models.

Title: Singular perturbations in Mendelian Systems

We consider as a so-called "Mendelian System" three flux-functions generating population-strengths in the inbreeding of three separate population-branches. Of these are two homogeneous and one heterogeneous with respect to a marking at a given genetic locus. The system is the singularly perturbed solution of non-linear autonomous differential equations from which both the Weinberg-Hardy relations are asymptotically derived as well as the long-time duration of inbreeding on an absolute time-scale.

In the case of the Rhesus blood-groups these calculations might shed light on the age of Modern Man, given recent medical data from "erythroblastosis fetalis". Seen from control-theoretical point of view the analysis of this solution shows the significance of concepts like "biological time" and "entropy". This phenomenological approach may find other uses in mathematical biomedicine.

Title: The Freezing-out Clustering Method for Identifying Biological Functions of Unknown Transcripts in the Yeast Transcriptome

We have developed a new method for clustering time-dependent genome-wide RNA data that is based on an analogy between liquid-solid (disorder-order) phase transitions occurring at critical temperatures in material media and the tendency of functional clusters of RNAs to increase their density in a topological space here called "principal grid" as the number, k, of clusters is increased. For the details of the clustering algorithm used, see A. Zinovyev at http://bioinfo-out.curie.fr/projects/vidaexpert/. So k is analogous to 1/T, and the order parameter in our case is what is here called the transcript density (T) defined as: T = k(% of the total transcripts with a given function)/(the number of clusters over which the transcripts are distributed).

Title: An Efficient Clustering Method for High Dimensional Data

Clustering algorithms are useful in data analysis and information retriever. Clustering finds clusters of points in the given data set. Due to the "curse of dimension", finding clusters in large amount of high dimensional data has always been a challenge. In this paper, we propose a clustering method that constructs a consensus clustering from multiple projected clusters for high dimensional data. The method compares a set of projected partitions generated by any low dimensional clustering method, determines the similarities between each pair of individual partitions, and returns an approximate consensus of these partitions, which renders the final clustering of the original data set. We provide a running time and space requirement analysis of our method and also evaluate our method's scalability in terms of number of dimensions and number of objects, sensitivity to order of input, and clustering quality through several experiments.

Title: Large Scale Biological Network Analysis using Evolutionary Computing and Neural Network

Recent advances in high throughput experiments and annotation via published literatures have provided interaction maps of several biomolecular networks, such as metabolic, protein-protein, and protein-DNA interaction networks. Molecules are the vertices in the network, and the interactions between them form the edges of the network. The architecture of molecular networks can reveal important principles of cellular organization and molecular functions. Analyzing such networks, i.e. discovering the dense regions in the network, is an important way to identify protein complexes and functional modules. This task can be formulated as a general problem of finding the heavy subgraphs. However, this is an NP-hard problem. Researchers have proposed a series of approximation algorithms and exact algorithms based on the conventional problem formulation, the Heaviest k-Subgraph Problem (k-HSP). However, any methods based on the k-HSP requires the input parameter k. Moreover, there is a fact that even an exact solution of k-HSP may be a "spurious" heavy subgraph. Therefore, the methods based on the formulation k-HSP is not practical in analyzing the large-scale biological networks. In this paper, we proposed a new formulation rank-HSP and two dynamical systems derived from the evolutionary computing and neural networks to approximate the rank-HSP. For attacking the problem of the "spurious" heavy subgraph and determining a reasonable heavy subgraph, a novel metric, SM-Ratio, considering the homogeneous edge weight distribution of the subgraph was proposed. This metric can also automate the discovery of the heavy subgraph by setting a fixed threshold. A novel algorithm, FindHSs, for finding the heavy subgraphs in the large-scale networks was proposed based on the rank-HSP, two dynamical systems and the metric SM-Ratio. Empirical results on both the simulation graphs and biological networks have demonstrated its efficiency and effectiveness

Title: Difference Networks: Inferring Subgroup Differences between Variable Interactions

Genes regulate the function of the cells in our body by interacting with each other through proteins, forming genetic pathways. High-throughput gene-expression micro-arrays allow the simultaneous measurement of the activity or inactivity of thousands of genes by measuring their expression, and they hold great promise for elucidating genetic pathways [1]. While Bayesian network inference is a commonly used method for inferring interactions between genes from gene-expression micro-array data, they do not directly infer differences in gene interactions between subsets of micro-array datasets (e.g., gene A's expression is positively correlated with gene B's expression in normal tissue, but the expression levels of the two genes are negatively correlated in cancer tissue). We present a novel computational method derived from Bayesian network inference for inferring differences in conditional relationships between subpopulations and its application to breast cancer data, resulting in new biological hypotheses about differences between two types of breast cancer.

Title: Data Mining Approach to Study of Protein Disorder

Data mining challenges considered in this presentation include: (i) mining of heterogeneous distributions; (ii) learning from biased labeled samples; and (iii) enhancing predictive modeling by exploiting unlabeled data. The aim of the talk is to describe our fairly general methodology for addressing these challenges and to demonstrate its applicability in bioinformatics. The applications will be illustrated on the problem of determining the commonness, type and functions carried by intrinsic disordered proteins that fail to fold to a fixed 3D structure on their own. The results will be reported providing strong evidence that: (1) disorder is a very common element of protein structure; (2) eucaryotes may have a higher fraction of intrinsic protein disorder than eubacteria or archaebacteria; (3) at least three different types of protein disorder exist in nature; (4) functions of disordered proteins can be studied more efficiently by reranking PubMed retrieved articles based on their relevance to this topic that is difficult to express as a query; and (5) a cost-effective enlargement of a disordered proteins database is possible by using unlabeled data to improve classification and outliers detection as compared to learning from small and possibly biased labeled samples alone. The reported data mining methods and bioinformatics results were obtained through a collaboration with C.J. Brown, A.K. Dunker, B. Han, K. Peng and S. Vucetic funded by NSF-CSE-II-9711532, NSF-CSE-IIS-0196237, NIH-R01-LM06916 and NIH-R01-LM007688-01A1 research grants.

Brief Bio:

Zoran Obradovic is Director of the Center for Information Science and Technology and a Professor of Computer and Information Sciences at Temple University in Philadelphia. His research interests focus on developing data mining and statistical learning methods for knowledge discovery at large databases. He contributed to about 190 refereed articles including bioinformatics, medical informatics, geophysical and social science applications. He is team leader for the best rated predictor of intrinsically disordered protein regions at three consecutive worldwide biannual competitions organized by protein structure prediction assessment community (CASP 5-7). His research is supported by grants from National Institute of Health, National Institute of Justice and National Science Foundation.

Title: What is special about mining spatial datasets?

The importance of spatial data mining is growing with the increasing incidence and importance of large geo-spatial datasets such as maps, repositories of remote-sensing images, and the decennial census. Applications include M(obile)-commerce industry (location-based services), NASA (studying the climatological effects of El Nino, land-use classification and global change using satellite imagery), National Institute of Health (predicting the spread of disease), National Geo-spatial Intelligence Agency (creating high resolution three-dimensional maps from satellite imagery), National Institute of Justice (finding crime hot spots), and transportation agencies (detecting local instability in traffic).

Classical data mining techniques often perform poorly when applied to spatial data sets because of the following reasons. First, spatial data is embedded in a continuous space, whereas classical datasets are often discrete. Second, spatial patterns are often local where as classical data mining techniques often focus on global patterns. Finally, one of the common assumptions in classical statistical analysis is that data samples are independently generated. When it comes to the analysis of spatial data, however, the assumption about the independence of samples is generally false because spatial data tends to be highly self correlated. For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. In spatial statistics this tendency is called spatial autocorrelation. Ignoring spatial autocorrelation when analyzing data with spatial characteristics may produce hypotheses or models that are inaccurate or inconsistent with the data set.

Thus new methods are needed to analyze spatial data to detect spatial patterns. This talk surveys some of the new methods including those for discovering spatial co-locations, detecting spatial outliers and location prediction.

Note: Some of the results discussed in this talk appeared in the following publications:

- S. Shekhar and S. Chawla, Spatial Databases: A Tour (Chapter 7), Prentice Hall 2003, ISBN 0-13-017480-7.
- S. Shekhar, P. Zhang, V. R. Raju and Y. Huang, Trends in Spatial Data Mining ( pdf ) , Proc. NSF Workshop on Future Directions in Data Mining (Ed. H. Kargupta et al.), MIT Press, 2004 (isbn 0-262-61203-8).

Slides:

- Spatial data mining : pdf (3/2005)
- Spatial and spatio-temporal data mining : pdf (2/2007)

Brief Bio: Shashi Shekhar is a McKnight Distinguished University Professor and the Director of the Army High Performance Computing Research Center at the University of Minnesota. He was elected an IEEE fellow and received the IEEE Technical Achievement Award for contributions to spatial database storage methods, data mining, and geographic information systems (GIS). He co-authored a textbook on Spatial Databases and served as a member of the Mapping Science Committee of the National Research Council (National Academy of Sciences). He is serving as a co-Editor-in-Chief of Geo-Informatica: An International Journal on Advances in Computer Sc. for GIS; and a member of the steering committee of the ACM Intl. Conference on GIS.

Title: Surface-based Morphometry using SPHARM for Computational Neuroanatomy

Statistical morphometric analysis is used in biomedical imaging to study various structures of interest, and aims to identify morphometric abnormalities associated with a particular condition in order to aid diagnosis and treatment. We present computational techniques for surface-based morphometry, where two key issues are (1) how to model 3D surfaces and (2) how to perform statistical analysis with these surface models. We employ the spherical harmonic (SPHARM) method for surface modeling, and then perform high-dimensional pattern classification and statistical inference based on random field theory to localize regionally specific shape changes between groups of 3D objects. Although SPHARM has been shown to be promising, algorithmic problems related to its robustness and scalability need to be solved before it can be of broad use. To address these issues, we present new methods usable in three crucial steps for modeling SPHARM surfaces: (1) spherical parameterization, (2) 3D shape registration, and (3) spherical harmonic expansion. We demonstrate these techniques by applying them to several studies in computational neuroscience and imaging genetics.

Title: Efficient Non-unique Probes Selection in Microarray Design

A probe is a short oligonucleotide, used for identifying the presence of targets in a biological sample through hybridizations of DNA microarrays. When each probe hybridizes to a unique target, the identification is straightforward. However, finding unique probes is often not so easy, especially when targets have a high degree of similarity. In this talk, we study how to select a minimum set of non-unique probes to identify the presence of at most d targets in a sample where each non-unique probe can hybridize to a set of targets. We present an Integer Linear Programming method to select a minimum number of non-unique probes by constructing a d-disjunct matrix. We also analyze the time complexity of how to decode the test results obtained from the microarrays experiments.

Title: Protein Functional Module Extraction in Protein Complexes

Proteins usually do not act isolated in a cell but function within complicated cellular pathways, interacting with other proteins either in pairs or as components of larger complexes. While many protein complexes have been identified by large-scale experimental studies, due to a large number of false-positive interactions existing in current protein complexes, it is still difficult to obtain an accurate understanding of functional modules, which encompass groups of proteins involved in common elementary biological function. In this talk, we introduce a hyperclique pattern discovery approach for extracting functional modules (hyperclique patterns) from protein complexes. A hyperclique pattern is a type of association pattern containing proteins that are highly affiliated with each other. The analysis of hyperclique patterns shows that proteins within the same pattern tend to present in the protein complex together. Also, statistically significant annotations of proteins in a pattern using the Gene Ontology suggest that proteins within the same hyperclique pattern more likely perform the same function and participate in the same biological process. More interestingly, the 3-D structural view of proteins within a hyperclique pattern reveals that these proteins physically interact with each other. Finally, we show that several hyperclique patterns corresponding to different functions can participate in the same protein complex as independent modules.

Title: Transport of Bio-Polymer in Microarrays

Bio-polymers, such as DNA or actin filaments, are often modeled as a slender elastic fiber immersed in a Stokesian fluid. In this work we report that a slender elastic fiber moving in a Stokesian fluid can be susceptible to a buckling instability -- termed the ``stretch-coil'' instability -- when moving in the neighborhood of a hyperbolic stagnation point of the flow. When the stagnation point is embedded in an extended cellular flow, it is found that immersed fibers can move as random walkers across time-independent closed-stremaline flow. It is also found that the flow is segregated into transport regions around hyperbolic stagnation points and their manifolds, and closed entrapment regions around elliptic points. These results can be applied to designs of microfluidic arrays for transporting the bio-polymers across space. We will also address the effects of finite extensibility on the transport properties.

Title: Discrete Optimization Approach to Disease Association Search and Susceptibility Prediction Problems

Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is critical to search and analyze multiple causes each representing interactions of multiple genes.

This talk is devoted to analysis of genotype case/control data with discrete combinatorial methods. We will first give optimization formulations for searching disease-associated and disease-resistant multi-SNP combinations (MSCs) as well as predicting disease susceptibility for given case-control study. Then we describe several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies. Finally, we apply disease association search methods to disease susceptibility prediction.

Our methods have been validated on case-control studies for the following diseases: Crohn's and Alzheimer diseases, autoimmune disorder, tick-borne encephalitis, rheumatoid arthritis, and lung cancer. For all studies, we found MSCs significantly associated with the disease even after multiple-testing adjustment while single SNP association is not significant. Our susceptibility prediction methods have higher sensitivity, and specificity than previously known methods such as SVM and random forest.

Previous: Program

Workshop Index

DIMACS Homepage

Contacting the Center

Document last modified on March 12, 2007.