This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).
Title: Studying Disease Dynamics Using a Network Model
Globalization increases the likelihood that disease agents spread more quickly and reach a larger number of hosts. Recent research suggests that small-world networks are a more realistic representation of human social interactions than random or regular networks. We create small-world networks by starting with a circulant and rewiring each edge randomly with a probability swnP (a parameter setting of our model). Our model explores vaccinating random nodes, nodes with the highest degrees (hubs), nodes with the lowest clustering coefficients, nodes with the highest clustering coefficients and nodes contained in cross-cut edges on regular, small-world, and random networks. We also investigate how k (the number of neighbors that each node has in the original circulant) and NCR (the probability that influenza will spread from one neighbor to another each day of the infectious period) change the dynamics of disease spread. We have found that an increase in k or an increase in NCR results in a higher percentage of the population becoming infected.
Title: Group testing to annihilate added hyperedges in a hypergraph
We describe what group testing for critical vertices is and give some applications to network analysis, network security and (biological) experimental design. We describe a random approach to this problem and show how to optimize the random method. We will demonstrate a Maple application that simulates the performance of a random approach to this problem.
Title: Direct identification of network modules from gene-expression measurement
Out-degrees, modules and motifs in a gene-to-gene regulatory network (Mika Gustafsson, Michael Hornquist, Anna Lombardi and Jesper Tegnér)
We analyze a directed gene-to-gene network, with signed interactions, with respect to out-degree distribution, detected modules, and motifs. Among the genes with high out-degrees, there is an excess of genes involved in transcription and of lethal genes in the cell-cycle. These genes also turn out to be more repressed than others.
The detected modules often correspond to known biological processes (according to the Gene Ontology database), but turn out to have no relation to hierarchical clustering. Among the detected motifs, the sign information provides additional insights compared with earlier studies. For example, the bi-parallel and bi-fan motifs have clearly over-represented symmetric versions compared with the non-symmetric ones.
The studied network is formed from the extended Spellman data, consisting of 6178 mRNA levels measured by microarrays at 73 instances in four time-series during the cell-cycle of the yeast Saccharomyces cerevisiae. The reverse engineering is performed on the full data set. By assuming a linear model of a gene-to-gene regulatory network, and imposing an extra constraint (called the Lasso), we obtained the directed regulatory network with positive and negative weights on the edges.
Title: An evolution-based clustering method to separate orthologous genes from out-paralogs
Two genes from two different species are said to be orthologs if they evolved directly from a single gene in the last common ancestor. Genes that evolved from a single gene that was duplicated within a genome are called paralogs. Typically, orthologs perform the same function, whereas a paralog in a genome evolves to perform a new function. Paralogs are further classified into two types: in-paralogs and out-paralogs. With respect to a given speciation event E, paralogs that evolved by gene duplications that happened after E are called in-paralogs, whereas paralogs that evolved by gene duplications that happened before E are called out-paralogs. The Clusters of Orthologous Groups of proteins (COGs) resource at NCBI classifies genes across multiple complete genomes based on their homologous relationship. A COG comprises individual orthologous genes or orthologous groups of paralogs (in-paralogs and out-paralogs) from three or more lineages. COGs do not have a mechanism to distinguish between in-paralogs and out-paralogs. We developed a new hierarchical clustering method (COCO-CL) based on phylogenetic relationships, which can be used to distinguish in-paralogs and out-paralogs in a given set of homologous proteins, and thus help refine COGs to contain only orthologs and in-paralogs. Since the idea of COGs present a frame-work using which gene functions for newly sequenced or poorly characterized genomes can be predicted, it is very important that any orthologous group of proteins contain only orthologs and in-paralogs and not out-paralogs. Application of our clustering mechanism on COG0568 (a complex COG with multiple paralogs used as an example in the original COG paper that appeared in Science) containing 103 genes from 50 organisms revealed an early duplication event, the main driving force behind the inclusion of 25 out-paralogs into the COG. Removal of these out-paralogs resulted in a refined COG with 78 genes from 50 organisms.
Title: Modelling plant-herbivore evolutionary rates using a graph theoretical approach
An individual-based, evolutionary model of plant-herbivore dynamics was used to determine how network structural relationships among individuals within species, and across trophic levels change over evolutionary time. Each individual carries a pseudo-genetic sequence (interaction code) whose hamming distance (the number of changes required a conversion from one bit string to another) determines the ability to mate and feed. Mutations and recombination of the interaction code occur during mating. Each individual is treated as a vertex in a network graph and individuals that can mate are connected by edges. Assuming that a connected component within a trophic level is a species, we analyzed the rate of evolution (the movement of the center of mass in sequence space over time) of the dominant species for plants and herbivores. We found that increased specialization in feeding resulted in more species. We also found that individuals within the dominant species were spatially clustered.
+National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
#National Center for Supercomputing Applications, Beckman Institute for Advance Science and Technology, University of Illinois at Urbana-Champaign, 405 N. Mathews Avenue, Urbana, IL 61801, USA
*Corresponding Author
Title: A Comprehensive Phylogeny of ART-LGIC Superfamily
Acetylcholine receptor type ligand-gated ion channels (ART-LGIC; also known as Cys-loop receptors) are a superfamily of proteins that include the receptor for major neurotransmitters like acetylcholine, serotonin, glycine, GABA, glutamate and histamine, and Zn++ ions. They play a major role in fast synaptic signaling in the animal nervous systems and have thus far not been reported outside of metazoa. Using sensitive sequence profile searches we have identified homologues of ART-LGICs in several bacteria and a single archaeal genus Methanosarcina. The homology between the animal receptors and the prokaryotic homologues spans the entire length of the former, including both the ligand-binding and channel forming transmembrane domains. Furthermore, a detailed analysis of the ligand-binding box based on the structure of the Lymnaea stagnalis acetylcholine-binding protein and the newly detected prokaryotic versions indicates the presence of at least one aromatic residue in the ligand binding boxes of almost all representatives of the superfamily. Analysis of the domain architectures of the bacterial forms shows that they may often show fusions with other small molecule binding domains, such as the periplasmic binding protein superfamily I (PBP-I), CACHE and MCP-N domains. Some of the bacterial forms also often occur in predicted operons with the genes of the PBP-I superfamily. Analysis of phyletic patterns suggests that other than animals, the ART-LGICs are currently absent in all other eukaryotic lineages with available genomic information. Moreover, phylogenetic analysis and conserved sequence motifs also suggest that a subset of the bacterial forms is closer to the metazoan forms.
The conservation pattern in the channel forming TM domains also suggests similar channel-gating mechanisms in the prokaryotic versions. Based on the distribution of charged residues in the prokaryotic M2 TM segments, we expect that there will be examples of both cation and anion selectivity within the prokaryotic members. The domain architectures and gene neighborhoods suggest that the prokaryotic forms may function as chemotatic receptors for amino acids or other low molecular weight solutes. The phyletic patterns and phylogenetic relationships suggest the possibility the metazoan receptors emerged through an early lateral transfer from a prokaryotic source, prior to the divergence of extant metazoan lineages.
This study is recently published in Genome Biology. PMID: 15642096
Title: Two applications of Delaunay contact graphs in the analysis of protein structures
Protein structures have been analyzed with a technique from computational geometry known as Delaunay tessellation. In summary, each amino acid is abstracted to a point. These points are then joined by edges to form a set of non-overlapping, irregular, space-filling tetrahedra each having the property that the sphere on the surface of which all four vertices reside does not contain a vertex from any other tetrahedron ("the empty sphere property"). The tetrahedra are called Delaunay simplices and the process by which they are generated is called tessellation. The Delaunay contact graph is obtained from the tessellated structure by assigning one vertex to each residue and joining vertices with an unweighted, undirected edge when their corresponding residues are joined by a simplex edge shorter than some cutoff in the tessellation of the protein structure.
The first application of contact graph analysis, called DP for 'Delaunay-Potts', is a variation of the Ising-lattice like method of W. R. Taylor (DOMS) for assigning protein structural domains. Like the original, DP requires only carbon alpha coordinates, but differs from DOMS in that it uses very different convergence criteria, asynchronous updating, does not require inverse distance weighting and, instead of simple proximity, contacts are defined with the Delaunay contact graph. The method was tested on a set of 100 structures compiled from several other test sets detailed in the domain assignment literature and included a large number of cases shown previously to be difficult. DP is even simpler than Taylor's original method, more robust to variations in cutoff, and in very good agreement with other existing assignment methods.
Second, using metric multi-dimensional scaling/principal component analysis, the effective dimensionality D of the space in which a residue contact graph lives can be computed from a matrix of all inter-residue graph distances. Studying how D varies with simplex edge length cutoff immediately gives a set of natural distance scales for proteins. Additionally, analysis of the graph properties of real and model structures shows that D determines how the characteristic path length of the contact graph (the average inter-residue graph distance between all pairs of residues) scales with the number of residues.
Title: Graph theoretic properties of contact networks formed by the Delaunay tessellation of protein structures
Here we model a protein structure as a network where nodes represent residues and unweighted edges the presence of interactions between residues. We call such networks residue contact networks and have analyzed a large number of them derived from real and model protein structures. Contact network analysis is not new. Previously, contacts were defined by simple proximity, but in this work residues are defined to be in contact when a simplex edge joins them in the Delaunay tessellation of the structure. Several quantities that have been used to characterize contact networks previously include: the average minimum path length L between all pairs of residues in the network, also called the characteristic path length; the average clustering coefficient C measuring the tendency of residue triples to be mutual neighbors; betweenness Bv measuring the tendency of a residue to be on minimum paths between other residue pairs.
We have analyzed contact networks from one set of real protein chains and three sets of simplified model structures. First L, C, and Bv were computed for these data sets and fitted as functions of N. Next, Fourier spectra were computed on the betweeness profiles to look for periodicity, and a small test set was studied in more detail.
The data show that contact networks for real proteins have small L like random graphs and large C like regular graphs and unlike random graphs. However, detailed analysis shows that L scales as the fourth root of N, not log(N)/log(k) as it would for small world networks or random graphs (where N is the number of residues and k is the average degree). There is little difference between how L and C scale with N for real structures and model random self-avoiding polymer chains. Long, isolated alpha helices, both real and artificial, behave very differently. Their contact graphs are regular 1-lattices under a 10Å edge cutoff, and true small world graphs with no cutoff.
Fourier power spectra of graph betweenness were computed on the resulting betweenness profiles for all the structures. The structures in each data set were divided into six size ranges and all the spectra from the structures in a size range were then binned, summed, and smoothed. The resulting profiles for real structures show a persistent peak at ~30 residues for all six size ranges, unlike random self-avoiding polymers. Berezovsky et al. have reported a peak at ~22-32 residues in the histogram of main chain separation for residues separated by less than 10Å in space. Based on results from polymer physics, they have suggested that this is the closed loop size that forms most readily in polypeptides and such loops form first creating a folding nucleus. High betweenness residues tend to occur at the ends of such closed loops (hence the peak in betweenness spectra) , and also tend to have high scores under a Delaunay tessellation derived 4-body contact potential.
Title: A new method for protein secondary structure assignment based on a simple topological descriptor
Protein structures have been analyzed with a technique from computational geometry known as Delaunay tessellation. In summary, each amino acid is abstracted to a point. These points are then joined by edges to form a set of non-overlapping, irregular, space-filling tetrahedra each having the property that the sphere on the surface of which all four vertices reside does not contain a vertex from any other tetrahedron ("the empty sphere property"). The tetrahedra are called Delaunay simplices and the process by which they are generated is called tessellation.
Simplices resulting from the Delaunay tessellation of protein structures can be divided into 5 classes based on the way the main chain threads through them. In type 0 simplices, none of the four residues at the vertices of the simplex are consecutive in primary sequence. In type 1 simplices, exactly one pair are consecutive in primary sequence. In type 2, two pairs are consecutive, but the pairs are separated from each other in primary sequence. In type 3, exactly three residues are consecutive. In type 4, all four residues are consecutive. One can tally the number of simplices of each type that a given residue belongs to and we call these sums t-numbers. A descriptor of five t-numbers can therefore be assigned to each residue. The total number of simplices of all types in which a residue can participate varies greatly, ranging from 1 to 72 in the set of 996 nonredundant protein structures examined here. We have devised rules which accurately map t-numbers to the DSSP secondary structure assignment.
The degree of agreement between the t-number based methods and DSSP is in line with other secondary structure assignment methods as is the degree of self-consistency in assignment. It is surprising that t-number assignment agrees well with previous methods since, unlike previously published techniques, it is based solely on main chain topology/connectivity--the descriptor does not explicitly depend on any angles, lengths, areas, or putative hydrogen bonds. It might therefore be possible to formulate a simple, elegant definition of secondary structure in terms of t-numbers.
Title: A Faster Algorithm For Network Motif Detection
Network motifs are small connected subgraphs occuring in significantly higher frequencies than in random graphs. Recently, they have recently gathered much attention as a useful concept to uncover structural design principles of complex networks. Kashtan et al. proposed a sampling algorithm for efficiently performing the computationally challenging task of detecting network motifs. Among other drawbacks, this algorithm suffers from sampling bias and is only efficient when the motifs are small (3~or~4 nodes). Based on detailed analysis of this algorithm, we present a new algorithm which overcomes these drawbacks. Experiments on a testbed of biological networks (E. Coli and S. Cerevisiae transcriptional network, C. Elegans neuronal network, and Ythan estuary food web) show our algorithm to be many orders of magnitude faster than previous approaches. This allows to detect larger motifs in bigger networks than was previously possible, facilitating deeper insight into the field of network motifs.