Co-sponsored by DIMACS and the Program in Mathematics and Molecular Biology (PMMB) of the Florida State University.
1. Barry Honig, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY "Domain Definition and Superposition" In the next few years the increasingly availability of three-dimensional structural information will have major impact on the ability to exploit genomic information. An integrated sequence/structural-analysis/homology model building program will be introduced. The program consists of a variety of linked modules which include the facility to carry out sequence analysis, structure-based sequence alignment, fast structure-structure superposition using a unique structural domain database, multiple structure alignment, threading and homology model building. Some of the features of the program will be described and a number of applications will be presented. 2. Temple Smith, Biomolecular Engineering Research Center Boston University, Boston, Mass. "The functional domain dissection of large complex proteins" It was shown, using rather simple ideas from graph theory and measures of local sequence similarity, that many functional regions could be distinguished among protein sequences composed of heterologous sets of domains (Adams, Das, and Smith, Protein Science 5, 1996). Minimally connected local similarity graphs were analyzed for complete subgraphs sets colored by the degree of overlap of pairwise sequence similarities. By replacing simple direct regional measures of sequence similarity with a prior-based profile procedure, the method has now been shown to identify clearly many functional domains that can be recognized by their occurrence in very different sequence contexts. One of the results has been the identification of a number of yeast multidomain proteins that have been assembled from genes found linked in bacteria by operon structures. 3. Slava Brover, RUTCOR, Rutgers University, NJ "Recognition of Protein Domains by Amino-Acid Sequences" The objective of this work is to find domains in a protein given its amino-acid sequence. The idea is to find significant alignments of the target protein and other proteins, collect the boundaries of these alignments as a set of numbers, and make clustering of these numbers. The centers of the obtained clusters are the estimates of the boundaries of the domains in the target protein. 4. Israel Gelfand and Alexander Kister Math Department, Rutgers University, NJ "The classification of the human heavy chains defines patterns, lengths of sequences and structures of the antigen-binding regions (H1 and H2)" We have suggested a new approach for the analysis of immunoglobulin sequences. Central of this approach is the discovering a very few crucial positions. Knowledge of residues at these class-determining positions allows us to deduce almost all residues in a sequence as well its secondary and three-dimensional structure. For each class we determine the lengths (number of residues ) of two H1 and H2 antigen-binding regions in the human heavy chains. 5. Cassandra L. Smith and Niels Storm, Center for Advanced Biotechnology and Departments of Biomedical Engineering, Biology and Pharmacology, Boston University, Mass. "Targeted Expression and Genomic Analysis of Gene Families" There are a number of comparative experimental methods that allow differences in gene expression (or genomic sequence) to be characterized. One approach is to use random PCR to amplify a genome subset whose complexity is amenable to analysis. Differential display compares cDNAs from two cells by comparing the distribution of size fractionated products. The robustness of differential display has been enhanced by targeting the analysis to genes that share a common sequence. The common sequence is used to capture and PCR amplify a target set of genes. Unlike conventional differential display this method allows genes that are expressed at low levels to be analyzed. The samples that are generated by this approach can be analyzed in a variety of ways, including input for array technology. Although, there are a number of efforts that identify shared amino acid sequences in gene families, there are few efforts that identify consensus DNA sequences than can be used for targeting. 7. Martin Farach-Colton, DIMACS, Rutgers Univ., NJ "On the approximability of numerical taxonomy" We consider the problem of fitting an $n\times n$ distance matrix $D$ by a tree metric $T$. Let $\eps$ be the distance to the closest tree metric under the $L_{\infty}$ norm, that is, $\eps=\min_T\{\norm{T-D}{\infty}\}$. First we present an $O(n^2)$ algorithm for finding a tree metric $T$ such that $\norm{T-D}{\infty}\leq 3\eps$. Second we show that it is ${\cal NP}$-hard to find a tree metric $T$ such that $\norm{T-D}{\infty}<\frac{9}{8}\eps$. 8. Lev Goldfarb and John Abela, Faculty of Computer Science, University of New Brunswick, Canada "Inductive model for "multi-domain" sequences: a deterministic combinatorial model" How do we model multi-domain protein sequences taking into consideration their evolutionary and non-evolutionary history? In this paper we suggest a deterministic combinatorial inductive model based on the recently proposed inductive learning model--evolving transformation system (ETS) model. One of the central concepts of the ETS model (in the present context) is the concept of an inductive representation of the class of sequences, where "class" stand for all possible, including prospective, examples of a chosen domain. In the ETS model, the inductive class representation (ICR) is constructed during the learning process, based on a small finite number of given examples, and takes the form of a set of symbolic operations with non-negative weights assigned to each of them. All possible sequences from the class can then be constructed and therefore predicted based on such ICR: they are those sequences whose symbolic operation based distance from a fixed given sequence is less than some constant. We then offer a model for domain segmentation of a given unknown "multi-domain" sequence based on a dictionary of domains' ICRs previously constructed. 9. Poe Xing, Casimir Kulikowski and Ilya Muchnik, DIMACS and CS Department, Rutgers Univ., NJ "Analysis of ribosomal RNA sequences by combinatorial clustering" We present a clustering approach for sequence analysis based on three stages: sequence segmentation, separate clustering on each segment and intersection of clustering results from all segments. The approach is illustrated by clustering on ribosomal RNA sequences. Ribosome is a complex subcellular machinery that catalyzes protein synthesis. It is composed of one small subunit which is responsible for mRNA and tRNA binding, and one large subunit catalyzing peptide bond formation. More than half of the weight of a ribosome is RNA, and increasing evidence shows that the ribosomal RNA (rRNA) molecules play a central role in protein synthesis. Despite size difference, the complicated folded structure of the rRNA molecule in the small subunit is highly conserved across numerous organisms; there are also close homologies between the large subunit rRNA in different organisms. The predominant role of rRNA in ribosome structure and function is likely to reflect the ancient origin of protein synthesis, which is thought to have evolved in an environment dominated by RNA-mediated catalysis. Due to its conserved nature, rRNA sequences is an ideal object for phylogenetic analysis of genetic evolution. In addition, since rRNA is also the skeleton where many ribosome proteins are attached and excises their catalytic or regulatory functions, detailed analysis of rRNA sequence structure is also crucial for the understanding of the nature of RNA-protein interaction during ribosome assemblage and the regulation of protein synthesis. In this study, 409 eukaryotic small subunit rRNA sequences, each from a different organism, were analyzed using the three-stage combinatorial clustering in an attempt to extract subsets of sequences that are similar in a specific fragment of sequences. Conventional procedures for nucleotide or protein sequence clustering are usually based upon a global multi-alignment of the full context of the sequences. Although straightforward and easy to implement, it is sometimes [START of message] error-prone due to the interference of the non-informative sequence fragments which constituent functionally nonessential structure and are either poor conserved or randomly spread in the gene of different species. On the other hand, a procedure based only on the essential sequence information (i.e. domain or functional motif) are expected to lead to more accurate clustering. We assume that, for a sequence fragment to be regarded as functionally essential, it must be present in the sequence from the majority of the organism and with a certain degree of sequence conservation and uniformity of nucleotide composition. This feature can be revealed on a multiple-sequence-alignment as segments with rare gaps and lack of significant preference of nucleotide composition. To extract such segments, the entire collection of the 409 multi-aligned sequences were first transformed into one frequency sequence in which each nucleotide site corresponding to the site in the original alignment sits a 5-D vector with each dimension represents the frequency of each NT type at that site. Using dynamic programming method, this frequency sequence was partitioned into pre-specified number of segments such that the sum of variance of each segment is minimized. The 3982 nucleotides long multiple alignments composed of 409 small subunit rRNA sequences were divided into 25 segments with their length varied from 34 to 367 nucleotides. Intermediate to high frequencies of gap occurrence were observed in 18 of the segments which suggest poor conservation of these sequence fragment among organisms. However, there were 7 segments in which the frequency of gap occurrence is lower than the average frequency of the nucleotide, suggesting that these segment are present in most of the organisms and potentially correspond essential functional units. Through the above mentioned segmentation of the frequency sequence, the informative segments along the multialignment, characterized by their low probability of gap occurrence, were delimitated. These mini alignment blocks, further condensed by removing the nucleotide sites where gap occurrence is higher than 30%, were independently used for sequence clustering. Clustering of the 409 rRNA sequences on each segment of the alignment is performed with a seriation procedure that leads to the global maximum of a corresponding "minimum slit" set function, in this case, the dissimilarity between a sequence of the set and the consensus sequence of the set. The problem is formulated as a combinatorial optimization problem whose criterion has a particular property called quasi-convexity, which facilitates fast polynomial procedures to solve the problem. Although the clustering was carried out independently on 7 different segments along the multi-aligned sequences, the resulting partitioning of the sequences showed significant similarity. The sizes of the resulting clusters determined from seven different segments ranged from 284 to 361. When each sequence was labeled by a pattern determined by its presence or absence in any one of the seven clustering results, we observed that the 249 of the sequences have the pattern that corresponds to presence in all seven clusters although there are all together 27 possible patterns existed. The predominance of this intersection over all other combinations of sequence distribution pattern in the multiple clustering results indicated a significant consistency of clustering using informative sequence segments. It is also an effective procedure to improve clustering on each individual segment. Even when all the minor distribution patterns were accounted, all 409 sequences falls into 33 patterns (with 4 patterns have greater than 10 sequences). This is a great reduction of randomness compared with the 128 patterns that could be resulted from seven independent clustering. This reduction, together with the dominance of intersection during combination of all separate clusters leads to a clear interpretation: the introduction of segmentation prior to clustering greatly reduce the interference of random information from non-conserved or nonessential sequences fragments interspersed in the gene sequence. 10. Philip E. Bourne, San Diego Supercomputer Center, Dept. of Pharmacology, Univ. of CA, CA "Clustering of Protein Structures using a Combinatorial Extension Algorithm" Finding and aligning the 3-dimensional structures of proteins with low sequence homology yet similar folds is a non-trivial problem. The issues are what constitutes a "good" alignment and the computational tractability of the problem when faced with approximately 1000 unique protein chains within the current corpus. A new approach - Combinatorial Extension (CE) of the optimum path will be introduced and compared to other methods. Using CE and 24,000 processor hours on a Cray T3E a database has been constructed comparing the 3D structures of all protein chains. The database provides a good opportunity to explore protein fold space. Current work is refining the algorithm to take into account the functional characteristics of particular protein families in an effort to refine the alignments by protein family. The complete database of alignments is available at http://cl.sdsc.edu/all-to-all/all-to-all.html 11. Leonid I. Perlovsky, Nichols Research Corp., MA "Clustering With Model-Based Neural Network" An internal model is considered an essential ingredient of an intelligent system and of intelligence in general. In the past, however, attempts to combine clustering with adaptive model estimation led to combinatorial explosion of computational complexity. The reason for this was that the model estimation problem was formulated for a particular clustering result, while clustering depends on the model. Therefore multiple combinations of models and clustering had to be evaluated in the solution process. I will describe a method of performing both, model estimation and clustering concurrently. This mathematical technique is implemented in a parallel architecture and is called model-based neural network. Neural networks with complicated internal models can serve at intermediate processing levels between pure "connectionist" and "symbolic" approaches. They can serve as intermediaries between signal samples and concepts or symbols. This is achieved by having each concept or symbol modeled with a sub-model of the internal model. Adaptive models lead to a dynamical adaptive mathematical description of symbols. The internal model is thus a compositional model composed of submodels-symbols. These dynamic symbols-submodels better correspond to the notion of symbol developed in psychology, semiotics, and neurophysiology than static symbols of classical "symbolic AI". Several application results will be presented indicating significant improvements over classical techniques. 12. Vladimir Grishin, View Trends Int., OH "Combinatorial visual clustering of multidimentional data" A multivariate data clusterization usually bases on a search of commonalities and differences of data vectors allowing to find these vectors groups making sense for problem solving. A human vision has unique possibilities to do that for reality pictures. Many years the author is developing "pictorial" methods to use this human vision power for a clustering of multivariate data by means of different transformations of initial data to pictures. They have to represent these data as wholistic images which can be visually analysed and described with hierarchical system of hundreds of local and integral features of shapes, colors, textures,etc. To get such representations and effectively use them for a data clustering it is needed to solve some combinatorial problems. First of them is an algorithm synthesis and its effectiveness estimation for permutation rows, columns and cells on matrix representations of data. In the simplest case it appears when each number of some data matrix is represented by a certain combination of color, intensity, texture,etc. These color picture can be very inconvinient for visual analysis because of many small separate spots which can not be connected visualy in some shapes allowing to effectively apply the hierarchy of visual features. A simple smothing is not effective enough in this case and main tools to get good pictures are permutations of columns, rows and cells, whatever problem allows. They can create much more connective image than initial one. Man-machine algorithms based on computer analysis of special graph representation of a data matrix and a visual analysis of intermediate steps of permutations to choose a right next step are considered in this report. Shown that for real data set an acceptable value of a picture connectivity criteria can be provided by this algorithms for reasonable amount of steps. Second combinatorial problem considered is about an estimation of possible lengthes of disjanctive-conjunctive data vectors dichotomies visually detected by polar and reactangular developments of these vectors. It is shown that binary projections of many multidimensional data have pretty short dichotomical solutions. 13. Nickolai Alexandrov, Amgen, Inc., CA "The number of domain families" Protein domains are grouped into families by sequence similarity and 3D structural resemblance. Analysis of the domain clustering allows us to make predictions on the total number of sequence families and different structural folds. Presented results are based on PFAM database of protein families and SCOP classification of protein folds. 14. Sandor Vajda, Department of Biomedical Engineering, Boston University, MA "Analysis and design of receptor-ligand interactions" Docking and design are the major computational steps toward understanding and affecting receptor-ligand interactions, and thus are important both for the analysis of cellular processes and for the development of drugs. Docking can be reduced to the problem of finding the global minimum (and possibly some local minima) of an appropriate target function over a space that includes rotational, translational, and possibly conformational degrees of freedom. Design expands the search to a space that also includes some description of the chemical structure of the potential ligand. We will discuss how to select target functions and search algorithms in frequently occurring applications. In particular, we describe an extension of the dynamic programming algorithm for docking and design of flexible molecules. 15. Simon Streltsov, Alphatech Inc., MA "Making heuristic clustering algorithms globally optimal" Clustering algorithms can be viewed as a class of non-convex optimization problems characterized by very large computational complexity and a special structure of the objective function. Many algorithms find an approximate solution by combining an initial heuristic approximation with subsequent local optimization. We consider a global optimization problem over the parameterized initial approximation, and describe a methodology that finds a trade-off between quality of the solution and the respective costs of initial approximation and local optimization algorithms. This global optimization methodology is based on Brownian motion approximation of the non-convex objective function combined with optimal on average index policy of selection between competing search processes. As an example of the proposed approach, we consider feature-based divisive hierarchical clustering algorithm with initial approximation computed using 1-D clusters along each feature. 16. Saveli Goldberg, MediSpectra Inc., MA "Definition of anti-syndrome and empirical data classification problems" Minimal anti-syndrome of A-class of the objects is a collection of signs. This collection does not occur with any object of class A and any sub-collection of this collection occurs with same objects of class A. Set of minimal anti-syndromes of class A is considered as a pattern of class A. Metrics functions and pseudometricses on a set of classes are obtained on the base of sets of minimal anti-syndromes. We present a procedure of modeling of the universe of the class A with the help of minimal anti-syndrome set manipulations. The solutions are illustrated on some applied examples. 17. Peter Hammer, RUTCOR, Rutgers Univ., NJ "Logical analysis of data in application" 18. Iosif I. Vaisman, Alexander Tropsha, and Weifan Zheng, Univ. of North Carolina, NC "Compositional Preferences in Quadruplets of Nearest Neighbor Residues in Protein Structures" Three-dimensional structure and amino acid sequence of proteins are related by an unknown set of rules that is often referred to as the folding code. This code is believed to be significantly influenced by nonlocal interactions between the residues. A quantitative description of nonlocal contacts requires the identification of neighboring residues. We applied statistical geometry approach to analyze the patterns of spatial proximity of residues in known protein structures. Structures from a dataset of well resolved nonhomologous proteins with a single point representation of residues by Ca atoms were tessellated using Delaunay algorithm. The Delaunay tessellation generates an aggregate of space-filling irregular tetrahedra, or Delaunay simplices. The vertices of each simplex objectively define four nearest neighbor Ca atoms and therefore four nearest neighbor residues. Compositional analysis of Delaunay simplices reveals highly nonrandom clustering of amino acid residues in protein structures. Relative abundance or deficiency of residue quadruplets with certain compositions reflects propensities of different types of amino acids to be associated or disassociated in folded proteins. The likelihood of occurrence of four residues in one simplex displays strong nonrandom signal also in the case of a reduced amino acid alphabet. We used several different three-letter alphabets based on the residue chemical and structural properties and on the complementarity of the corresponding codons. In all cases the clustering of residues correlates with their properties or genetic origin. The results of this analysis are being implemented in algorithms for protein structure classification and prediction. 19. Ying Xu, Computational Biosciences Section, Life Sciences Div., ORNL, TN "Protein threading by combinatorial optimization methods" Computational recognition of native-like folds of an anonymous amino acid sequence from a protein fold database is considered to be a promising approach to the three-dimensional (3D) fold prediction of the amino acid sequence. We present a new method for protein fold recognition through optimally aligning an amino acid sequence and a protein fold template ({\it protein threading}). The fitness of aligning an amino acid sequence with a fold template is measured by (1) the singleton fitness, representing the compatibility of substituting one amino acid by another and the combined preference of secondary structure and solvent accessibility for a particular amino acid, (2) the pairwise interaction, representing the contact preference between a pair of amino acids, and (3) alignment gap penalties. Though a protein threading problem so defined is known to be NP-hard in the most general sense, our algorithm runs efficiently if we place a cutoff distance on the pairwise interactions, as many of the existing threading programs do. For an amino acid sequence of size $n$ and a fold template of size $m$ with $M$ core secondary structures, the algorithm finds an optimal alignment in $O(Mn^{1.5C+1} + m n^{C+1})$ time and $O(Mn^{C+1})$ space, where $C$ is a (small) nonnegative integer, determined by a particular mathematical property of the pairwise interactions. As a case study, we have demonstrated that $C$ is less than or equal to 4 for about 75\% of the 293 unique folds in our protein database, when pairwise interactions are restricted to amino acids $\le$ 7\AA\ apart (measured between their $beta$ carbon atoms). An approximation scheme is developed for fold templates with $C > 4$, when threading requires too much memory and time to be practical on a typical workstation. 20. Leonard J. Schulman Georgia Inst. Technology Quadratic Euclidean Clustering We present an algorithm which partitions a set of points in R^d into two blocks, so as to minimize the sum of the squares of the Euclidean distances between pairs of points in the blocks. The algorithm runs in time $O(n^{d+1} \log n)$. An extension of the method gives a polynomial time solution of the analogous problem for any constant number of blocks. 21. Franco Armando Bignone, Dep. of Preclinical Oncology, Lab. of Experimental Oncology, National Cancer Institute, Italy "A case study in genetic networks dynamics: gastrulation in Caenorhabditis elegans as a model system" The main problem facing biology in the near future, with the end of several sequencing projects approaching or already accomplished, and a vast amount of information on biochemical interactions becoming available, will be an understanding of the context in which information stored in the DNA is expressed and regulated. This detailed description is going to be necessary in order to understand how the molecular level is influenced, and in turn give rise, to the observed higher dimensional cellular structures. While decoding of the physical structure of Nucleic Acids is now, in several instances, routine, the study of how the interaction between the DNA and its microenvironment give rise to the life forms we observe has still a long way to go, both in terms of pure description and in the study of general physico-chemical mechanisms involved. This is particularly true if we aim at intervention on the scale of organisms -- i.e. modifications of organisms with predictive capabilities --. The reasons behind this difficulty are partially due to the complexity of genetic networks and metabolic networks which give rise to the observed behavior, and partially to the spatial distribution and organization of proteins and other molecules involved. In Molecular Biology and in Chemistry, during the last thirty years, there has been a thin but steady line of attempts to deal with the problem of developing an analytical description for complex chemical systems in which multiple chemical species are involved. Moreover the study of this problem has been mirrored, on a much wider base, in Physics, Chemistry, and to a lesser extent Biology, with the study of spatially extended structures which emerge spontaneously in systems away from equilibrium. Thus in dynamical systems theory several tools have been introduced recently, for modeling complex dynamics emerging from the spatial coupling of many degrees of freedom. These include: Coupled Maps Lattices, Cellular Automata, and Lattice Gas Cellular Automata. These results have fostered a new development of this area. Taken together, and applied depending on the situations, these concepts bring new possibilities for the study of biological systems both for pure andapplied research. In order to study some general aspects in the distribution of regulatory signals among cells we have studied recently the main dynamical features of a Coupled Maps Lattice that mimics the evolution in time of simple genetic networks in regular cellular clusters. Each cell has been represented by the same network of genes whose protein products determine mutual interactions both in space and time. The global dynamics -- quantified by the mean concentrations of the products on the overall space --, and the symbolic dynamics -- activity-inactivity patterns of the genes --, have been analyzed in the different regimes shown by the system in certain regions of the parameter space. This kind of formalizations of Biological Systems as lattices of fixed size can be important to study their basic dynamical properties, however, for a more realistic study it should be taken into account that local space-time dynamics set lattice's size and shape -- i.e. number of sites and topological distribution of signal sources -- while biochemical signals travel through it driving growth, death and/or cell movement. This phenomenon results in a continuous coupling between the studied dynamics and lattice structure. Biological phenomena that shows this kind of plasticity are known to be quite general and occurs at early and late stages of development, in several pathologies -- i.e. cancer progression -- and during homeostasis. In order to undertake a detailed study of this class of phenomena we have used the Nematode Caenorhabditis elegans which displays a stereotype cell lineage during embryogenesis. An important aspect of Biological Systems, viewed here from a dynamical point of view, is that they are a complex example of symbolic encoding. The obvious case of this behavior is DNA but this is not the only one. In fact at an higher dimensional scale, this concept can be applied again. Organisms such as C. elegans, or other organisms such as Gastropods or Anellids, whith their stereotyped cell lineage during development, give rise to a strict symbolic encoding of cells generated from the zigote. This fact can be quite useful in order to understand necessary parameter reduction, in order to achieve a correct model building in this complex situation. The work done until now has been a detailed description of the path of development through time lapse cinematography. We have traced the path of all the cells generated from the zigote in normal embryos and mutants, from the first cell division up to the end of gastrulation -- a total of 370 cells --. The results of this study indicate that early stages of development and tissue organization from a conceptual point of view can be studied as a problem similar to a dynamic protein folding. In this case the study is complicated by the fact that the behavior of every element of the "folded chain" - cells - has possibility of movement and characteristics that increase the complications already present in classical problems of protein folding. At the same time several conceptual tools applied to proteins, such as for example contact maps, can be adapted to the study of this problem from a dynamical point of view. Keywords: Gene expression, Genetic Networks, Caenorhabditis elegans Coupled maps lattices, Cellular automata, Grammars, Graphs. References F.A. BIGNONE, R. LIVI, M. PROPATO, Long transients dynamics in biochemical networks, to appear, {\sl Il Nuovo Cimento D}, 1988. F.A. BIGNONE, Coupled maps lattice dynamics on a variable space for the study of development: a general discussion on Caenorhabditis elegans, to appear, {\em Theoretical Computer Science}, 1988. F.A. BIGNONE, R. LIVI, M. PROPATO, Complex evolution in genetic networks, {\em Europhys. Lett.}, {\bf 40}(5), 497-502, 1977. F.A. BIGNONE, R. LIVI, M. PROPATO, Symbolic and global dynamics in coupled genetic networks: evolution of artificial cell populations at the edge and within chaotic transient behaviour, in: P. Husbands and I. Harvey, editors, {\em Fourth European Conference on Artificial Life, ECAL97}, MIT Press, Cambridge, Massachusetts, 1997. F.A. BIGNONE, Complex dynamics in coupled genetic networks. Proceedings meeting, " La matematica e i problemi dell'ambiente, della biologia e della medicina: aspetti modellistici, analitici e computazionali", Urbino 29-31, 1996, {\sl Studi Urbinati}, {\bf 1}, 159-168, 1997. F.A. BIGNONE, R. LIVI, M. PROPATO, Dynamical stability and finite amplitude perturbations in coupled genetic-networks, {\em PHYSICA D}, {\bf 108}, 379-396, 1997. F.A. BIGNONE, Cells-Gene interactions simulation on a coupled map lattice, {\it Journal of Theoretical Biology}, {\bf 161}(2), 231-249, 1993. 22. Teresa Przytycka, Rajeev Aurora, and George D. Rose przytyck@grserv.med.jhmi.edu "Protein Classification and Secondary Structure Sequence" We investigate the question whether secondary structure sequences embody enough information to allow for fold recognition and classification. Here by a secondary structure sequence of a protein we mean the sequence of secondary structure types and their lengths. To compute similarity score between two proteins we align their secondary structure sequences using a dynamic programming algorithm. The alignment algorithm is similar to the standard sequence alignment algorithm with the modification required by the fact that an individual element of a secondary structure sequence corresponds to a whole secondary structure not to a single residuum. By computing similarity score for each pair of proteins in a test set we obtain a distance matrix. (We use here the term ``distance" in an informal meaning - we do not require that the triangular inequality is satisfied). Given a distance matrix we apply a clustering algorithm to construct a "similarity tree". We compare the resulting tree with the SCOPE classification and VAST similarity search. Our computational experiments suggests that alignment of secondary structure sequence can be used as a tool for protein classification. The resulting hierarchy would be related to but not identical to SCOP classification. In particular, proteins with the same folds (including the number of strands and helices) are clustered together. However, since our method treats all secondary structures equally, the proteins with significantly different number of strands/helices may not cluster together even if they are considered to belong to the same fold family as defined by SCOPE. By comparing our similarity with the one computed from VAST data we observe that most of the proteins whose sequence distance is below 0.2 are also considered to have same fold by VAST and vice versa. However there are some proteins form the same SCOPE family which we identify as having the same fold but VAST does not as well as proteins that VAST cluster together while our method mixes with proteins form other SCOPE families that have similar number of strands. This is still remarkable, since unlike VAST our method uses no information about a fold other than the secondary structure sequence. 23. Anders Wallqvist, Yoshifumi Fukunishi, Addi Fadel and Ronald M. Levy. Department of Chemistry, Rutgers University Protein Fold Recognition based on Secondary Structure Similarities. Sequence alignment techniques are very powerful tools for identifying the fold of protein sequences. Partial structural information of protein sequences can be built into a general alignment technique to augment existing alignments procedures. In this work we investigate the possibility of identifying protein families by aligning secondary structure sequences. We have carried out an extensive analysis of how secondary structure alignments can be used for fold recognition when the amino acid sequence identities (as determined by structurally aligned protein pairs) within the folding families is low. We identify a "twilight zone" for fold recognition based on secondary structure alignments at ~65% secondary structure identity as compared with ~21% amino acid identity using the standard amino acid sequence alignment approach. The success rate for identifying protein folds over a small database of 455 proteins divided into 86 families is the same (~86%) for either secondary structure or amino acid alignments. Thus the known secondary structure sequence carries enough information to discriminate folds, i.e. the sequential arrangement of secondary structure elements is a strong fold determinant. This work explicitly provides the rationale for alignment protocols which optimize amino acid and secondary structure identities simultaneously. If time permits we will briefly discuss the results of our collaborative efforts with Prof. Kulikowski and co-workers to construct RASSP, a Representative and Aligned Secondary Protein Database. 24. Steven E. Brenner, Stanford University, CA "Recognizing Distant Evolutionary Relationships from Protein Sequence and Structure" Because evolutionarily related proteins have a shared history, they will have the same structure and are likely to share similar functions. Consequently, methods of detecting homology may be used to characterize newly discovered proteins, such as those from genome projects. Indeed, sequence comparison is the principal computational tool underlying structural and functional genomics. Because structure provides the most powerful means of detecting homology, the scop: Structural Classification of Proteins database organizes all known protein structures according to their evolutionary and structural relationships. The scop data have been employed to assess methods of identifying homology from sequence. Analysis of the results has allowed us to optimize automated sequence comparison methods to detect far more relationships, while retaining extremely high accuracy. 25. Richard Fine and Boris Klebansky, Paradygm Technologies Inc., NJ "Topographical and Biochemical Maps of Protein Surfaces: Recognizing Similarity and Complementarity" One of the more astounding accomplishments of living cells is their ability to orchestrate and control specific chemical interactions between tens of thousands of proteins in the same reaction mixture. The specificity of its interactions is imparted by a protein^Òs three-dimensional structure, which provides a rich and complex set of topographical and chemical features on protein surfaces whose match dictate cognate specificity. The biochemical nature of these features is only poorly sampled at the primary sequence level, while their topographical nature is not sampled at all. The current accumulation of structural information in the Protein Databank (PDB) as organized into families and super-families in SCOP provides a rich set of structural information which can be used to examine similarities and differences between biologically interesting surface features of proteins. We are implementing a general suite of tools and an underlying database for the characterization, comparison, and query of known biochemical and topographical features on protein surfaces. The surface is represented by an hierarchical tree of tokens which characterizes features on the surface at different levels of resolution. This representation is coupled to a search engine for identifying both similar and complementary patches of surface in the collection of existing proteins in the PDB. The structure of the database has been designed to support learning systems to identify key features such as enzyme active site similarities and differences within protein families. Coupled to family sequence alignments this database suggests a plethora of practical applications including the design and interpretation of site directed mutagenesis experiments, the generation of sequence motifs corresponding to conserved and differentiated surface features within families of proteins, and the extension of existing small-molecule database screening tools to differentially target one among several related proteins. Methods will be discussed and examples given of the application of these concepts. 26. Donald Jacobs, Department of Phyics and Astronomy Michigan State University, MI "Real-time Protein Domain Evaluator and Design Tool" In proteins it is possible to separate hard covalent forces involving bond lengths and bond angles from other weak forces. The microstructure of a protein is modeled as a generic bar-joint truss framework, where the hard covalent forces and strong hydrogen bonds are regarded as rigid bar constraints. The mechanical stability of a given protein structure is then analyized using a combinatorial constraint counting algorithm to identify the protein's Floppy Inclusion and Rigid Substructure Topography (FIRST). The FIRST algorithm is based on a generalization of Laman's theorem [1] on graph rigidity in the plane and the extension of the 2D pebble game algorithm [2,3] to give a complete characterization of generic rigidity in three dimensional bond-bending networks [4]. Unlike many methods for parsing protein folds, the FIRST algorithm gives exact mechanical properties of a protein structure (and other macromolecules) subject to a certain set of force-constraints. These properties include counting the number of independent degrees of freedom, identifying rigid clusters, overconstrained regions and low energy collective motions. The FIRST algorithm can be used in real time; for example, it can analize a protein consisting of 1000 residues in less than a second on a Pentium PC. To achieve this computational speed, all imposed constraints are treated equally as infinitely rigid. The selection of the imposed set of constraints is determined by introducing a cut-off criteria on the interaction strength of hydrogen bonding. As the cut-off criteria is relaxed, a hierarchical characterization of the mechanical properties can be obtained. The approach used here has a high potential to be developed into a 3D-structural analysis and design tool for evaluating protein domains and identifying conformational flexibility. [1] G. Laman, "On graphs and rigidity of plane skeletal structures, J. Eng. Math 4, 331 (1970) [2] D. J. Jacobs and M. F. Thorpe, "Generic Rigidity: The Pebble Game", Phys. Rev. Lett. 75, 4051-4054 (1995) [3] D.J. Jacobs and B. Hendrickson, "An Algorithm for Two Dimensional Rigidity Percolation: The Pebble Game", J. Comp. Phys. 137, 346-365, (1997) [4] D. J. Jacobs, "Generic Rigidity in Three Dimensional Bond-bending Networks", Preprint Aug (1997), submitted to J. Phys. A. (Math. and Gen.) 27. Manfred Zohn. Protein Domain Prediction in the context of Genome Annotation Manfred Zorn, Lawrence Berkeley National Laboratory, MDZorn@lbl.gov Genome sequencing is gearing up to finish the sequencing the human genome while tools to understand sequence are lagging behind. The Genome Annotation Consortium is developing a framework for high-throughput genome annotation that combines traditional sequence analysis with machine learning techniques, data mining, and visualization to facilitate extracting knowledge from the sequence information. In the talk I will focus on the protein classification work and the framework developments being done at LBNL as well as related ongoing and future collaborations.