DIMACS Workshop on Combinatorial Clustering and Multi-Domain Protein Structure Analysis

June 26-27, 1998
DIMACS Center, CoRE Building (Room 431), Rutgers University, Busch Campus, Piscataway, NJ

Principal Organizers:

Sylvia Spengler, Lawrence Berkeley Nat.Lab, Life Sciences Div.
Manfred D. Zorn, Lawrence Berkeley Nat.Lab, National Energy Research Scientific Computing Division
Inna Dubchak, Lawrence Berkeley Nat.Lab, National Energy Research Scientific Computing Division
Fred Roberts, Rutgers University, DIMACS
Ilya Muchnik, Rutgers University, DIMACS
Casimir A. Kulikowski, Rutgers University, CS Department

Presented under the auspices of the Special Year on Massive Data Sets and the Special Year in Mathematical Support for Molecular Biology.

Co-sponsored by DIMACS and the Program in Mathematics and Molecular Biology (PMMB) of the Florida State University.

ABSTRACTS

Barry Honig, Department of Biochemistry and Molecular
Biophysics, Columbia University, New York, NY

"Domain Definition and Superposition"

In the next few years the increasingly availability of
three-dimensional structural information will have major impact on the
ability to exploit genomic information. An integrated
sequence/structural-analysis/homology model building program will be
introduced. The program consists of a variety of linked modules which
include the facility to carry out sequence analysis, structure-based
sequence alignment, fast structure-structure superposition using a
unique structural domain database, multiple structure alignment,
threading and homology model building. Some of the features of the
program will be described and a number of applications will be
presented.

Temple Smith, Biomolecular Engineering Research Center
Boston University, Boston, Mass.

"The functional domain dissection of large complex proteins"

It was shown, using rather simple ideas from graph theory and
measures of local sequence similarity, that many functional regions could
be distinguished among protein sequences composed of heterologous sets of
domains (Adams, Das, and Smith, Protein Science 5, 1996). Minimally
connected local similarity graphs were analyzed for complete subgraphs sets
colored by the degree of overlap of pairwise sequence similarities. By
replacing simple direct regional measures of sequence similarity with a
prior-based profile procedure, the method has now been shown to identify
clearly many functional domains that can be recognized by their occurrence
in very different sequence contexts. One of the results has been the
identification of a number of yeast multidomain proteins that have been
assembled from genes found linked in bacteria by operon structures.
3.

Slava Brover, RUTCOR, Rutgers University, NJ

"Recognition of Protein Domains by Amino-Acid Sequences"

The objective of this work is to find domains in a protein given its
amino-acid sequence.
The idea is to find significant alignments of the target protein and other
proteins, collect the boundaries of these alignments as a set of numbers,
and make clustering of these numbers.
The centers of the obtained clusters are the estimates of the boundaries
of the domains in the target protein.

Israel Gelfand and Alexander Kister
Math Department, Rutgers University, NJ

"The classification of the human heavy chains defines
patterns, lengths of sequences and structures of the
antigen-binding regions (H1 and H2)"

We have suggested a new approach for the analysis of immunoglobulin
sequences. Central of this approach is the discovering a very few
crucial positions. Knowledge of residues at these class-determining
positions allows us to deduce almost all residues in a sequence as
well its secondary and three-dimensional structure. For each class we
determine the lengths (number of residues ) of two H1 and H2
antigen-binding regions in the human heavy chains.

Cassandra L. Smith and Niels Storm, Center for Advanced
Biotechnology and Departments of Biomedical Engineering,
Biology and Pharmacology, Boston University, Mass.

"Targeted Expression and Genomic Analysis of Gene Families"

There are a number of comparative experimental methods that allow
differences in gene expression (or genomic sequence) to be
characterized. One approach is to use random PCR to amplify a genome
subset whose complexity is amenable to analysis. Differential display
compares cDNAs from two cells by comparing the distribution of size
fractionated products. The robustness of differential display has been

enhanced by targeting the analysis to genes that share a common
sequence. The common sequence is used to capture and PCR amplify a
target set of genes. Unlike conventional differential display this
method allows genes that are expressed at low levels to be
analyzed. The samples that are generated by this approach can be
analyzed in a variety of ways, including input for array technology.
Although, there are a number of efforts that identify shared amino
acid sequences in gene families, there are few efforts that identify
consensus DNA sequences than can be used for targeting.

Martin Farach-Colton, DIMACS, Rutgers Univ., NJ

"On the approximability of numerical taxonomy"

We consider the problem of fitting an $n\times n$ distance matrix $D$
by a tree metric $T$. Let $\eps$ be the distance to the closest tree
metric under the $L_{\infty}$ norm, that is,
$\eps=\min_T\{\norm{T-D}{\infty}\}$. First we present an $O(n^2)$
algorithm for finding a tree metric $T$ such that
$\norm{T-D}{\infty}\leq 3\eps$. Second we show that it is ${\cal
NP}$-hard to find a tree metric $T$ such that
$\norm{T-D}{\infty}<\frac{9}{8}\eps$.

Lev Goldfarb and John Abela, Faculty of Computer Science,
University of New Brunswick, Canada

"Inductive model for "multi-domain" sequences:
a deterministic combinatorial model"

How do we model multi-domain protein sequences taking into consideration
their evolutionary and non-evolutionary history? In this paper we suggest
a deterministic combinatorial inductive model based on the recently
proposed inductive learning model--evolving transformation system (ETS)
model.

One of the central concepts of the ETS model (in the present context) is
the concept of an inductive representation of the class of sequences,
where "class" stand for all possible, including prospective, examples of a
chosen domain. In the ETS model, the inductive class representation (ICR)
is constructed during the learning process, based on a small finite number
of given examples, and takes the form of a set of symbolic operations with
non-negative weights assigned to each of them. All possible sequences from
the class can then be constructed and therefore predicted based on such
ICR: they are those sequences whose symbolic operation based distance from
a fixed given sequence is less than some constant.

We then offer a model for domain segmentation of a given unknown
"multi-domain" sequence based on a dictionary of domains' ICRs
previously constructed.

Poe Xing, Casimir Kulikowski and Ilya Muchnik,
DIMACS and CS Department, Rutgers Univ., NJ

"Analysis of ribosomal RNA sequences by
combinatorial clustering"

We present a clustering approach for sequence analysis based on three
stages: sequence segmentation, separate clustering on each segment and
intersection of clustering results from all segments. The approach is
illustrated by clustering on ribosomal RNA sequences.

Ribosome is a complex subcellular machinery that catalyzes protein
synthesis. It is composed of one small subunit which is responsible for
mRNA and tRNA binding, and one large subunit catalyzing peptide bond
formation. More than half of the weight of a ribosome is RNA, and
increasing evidence shows that the ribosomal RNA (rRNA) molecules play a
central role in protein synthesis. Despite size difference, the
complicated folded structure of the rRNA molecule in the small subunit is
highly conserved across numerous organisms; there are also close
homologies between the large subunit rRNA in different organisms. The
predominant role of rRNA in ribosome structure and function is likely to
reflect the ancient origin of protein synthesis, which is thought to have
evolved in an environment dominated by RNA-mediated catalysis. Due to its
conserved nature, rRNA sequences is an ideal object for phylogenetic
analysis of genetic evolution. In addition, since rRNA is also the
skeleton where many ribosome proteins are attached and excises their
catalytic or regulatory functions, detailed analysis of rRNA sequence
structure is also crucial for the understanding of the nature of
RNA-protein interaction during ribosome assemblage and the regulation of
protein synthesis.

In this study, 409 eukaryotic small subunit rRNA sequences, each from a
different organism, were analyzed using the three-stage combinatorial
clustering in an attempt to extract subsets of sequences that are similar
in a specific fragment of sequences.

Conventional procedures for nucleotide or protein sequence clustering are
usually based upon a global multi-alignment of the full context of the
sequences. Although straightforward and easy to implement, it is sometimes
[START of message]
error-prone due to the interference of the non-informative sequence
fragments which constituent functionally nonessential structure and are
either poor conserved or randomly spread in the gene of different species.
On the other hand, a procedure based only on the essential sequence
information (i.e. domain or functional motif) are expected to lead to more
accurate clustering.

We assume that, for a sequence fragment to be regarded as functionally
essential, it must be present in the sequence from the majority of the
organism and with a certain degree of sequence conservation and uniformity
of nucleotide composition. This feature can be revealed on a
multiple-sequence-alignment as segments with rare gaps and lack of
significant preference of nucleotide composition. To extract such
segments, the entire collection of the 409 multi-aligned sequences were
first transformed into one frequency sequence in which each nucleotide
site corresponding to the site in the original alignment sits a 5-D vector
with each dimension represents the frequency of each NT type at that site.
Using dynamic programming method, this frequency sequence was partitioned
into pre-specified number of segments such that the sum of variance of
each segment is minimized.

The 3982 nucleotides long multiple alignments composed of 409 small
subunit rRNA sequences were divided into 25 segments with their length
varied from 34 to 367 nucleotides. Intermediate to high frequencies of gap
occurrence were observed in 18 of the segments which suggest poor
conservation of these sequence fragment among organisms. However, there
were 7 segments in which the frequency of gap occurrence is lower than the
average frequency of the nucleotide, suggesting that these segment are
present in most of the organisms and potentially correspond essential
functional units.

Through the above mentioned segmentation of the frequency sequence, the
informative segments along the multialignment, characterized by their low
probability of gap occurrence, were delimitated. These mini alignment
blocks, further condensed by removing the nucleotide sites where gap
occurrence is higher than 30%, were independently used for sequence
clustering.

Clustering of the 409 rRNA sequences on each segment of the alignment is
performed with a seriation procedure that leads to the global maximum of a
corresponding "minimum slit" set function, in this case, the dissimilarity
between a sequence of the set and the consensus sequence of the set. The
problem is formulated as a combinatorial optimization problem whose
criterion has a particular property called quasi-convexity, which
facilitates fast polynomial procedures to solve the problem.

Although the clustering was carried out independently on 7 different
segments along the multi-aligned sequences, the resulting partitioning of
the sequences showed significant similarity. The sizes of the resulting
clusters determined from seven different segments ranged from 284 to 361.
When each sequence was labeled by a pattern determined by its presence or
absence in any one of the seven clustering results, we observed that the
249 of the sequences have the pattern that corresponds to presence in all
seven clusters although there are all together 27 possible patterns
existed. The predominance of this intersection over all other combinations
of sequence distribution pattern in the multiple clustering results
indicated a significant consistency of clustering using informative
sequence segments. It is also an effective procedure to improve clustering
on each individual segment.

Even when all the minor distribution patterns were accounted, all 409
sequences falls into 33 patterns (with 4 patterns have greater than 10
sequences). This is a great reduction of randomness compared with the 128
patterns that could be resulted from seven independent clustering. This
reduction, together with the dominance of intersection during combination
of all separate clusters leads to a clear interpretation: the introduction
of segmentation prior to clustering greatly reduce the interference of
random information from non-conserved or nonessential sequences fragments
interspersed in the gene sequence.

10.

Philip E. Bourne, San Diego Supercomputer Center,
Dept. of Pharmacology, Univ. of CA, CA

"Clustering of Protein Structures using a Combinatorial
Extension Algorithm"

Finding and aligning the 3-dimensional structures of proteins with low
sequence homology yet similar folds is a non-trivial problem. The
issues are what constitutes a "good" alignment and the computational
tractability of the problem when faced with approximately 1000 unique
protein chains within the current corpus. A new approach - Combinatorial
Extension (CE) of the optimum path will be introduced and compared to
other methods. Using CE and 24,000 processor hours on a Cray T3E a
database has been constructed comparing the 3D structures of all protein
chains. The database provides a good opportunity to explore protein fold
space. Current work is refining the algorithm to take into account the
functional characteristics of particular protein families in an effort to
refine the alignments by protein family. The complete database of
alignments is available at http://cl.sdsc.edu/all-to-all/all-to-all.html

11.
Leonid I. Perlovsky, Nichols Research Corp., MA

"Clustering With Model-Based Neural Network"

An internal model is considered an essential ingredient of an
intelligent system and of intelligence in general. In the past,
however, attempts to combine clustering with adaptive model estimation
led to combinatorial explosion of computational complexity. The reason
for this was that the model estimation problem was formulated for a
particular clustering result, while clustering depends on the model.
Therefore multiple combinations of models and clustering had to be
evaluated in the solution process. I will describe a method of
performing both, model estimation and clustering concurrently. This
mathematical technique is implemented in a parallel architecture and is
called model-based neural network.
Neural networks with complicated internal models can serve at
intermediate processing levels between pure "connectionist" and
"symbolic" approaches. They can serve as intermediaries between signal
samples and concepts or symbols. This is achieved by having each
concept or symbol modeled with a sub-model of the internal model.
Adaptive models lead to a dynamical adaptive mathematical description of
symbols. The internal model is thus a compositional model composed of
submodels-symbols. These dynamic symbols-submodels better correspond to
the notion of symbol developed in psychology, semiotics, and
neurophysiology than static symbols of classical "symbolic AI".
Several application results will be presented indicating significant
improvements over classical techniques.

12.

Vladimir Grishin, View Trends Int., OH

"Combinatorial visual clustering of multidimentional data"

A multivariate data clusterization usually bases on a search of
commonalities and differences of data vectors allowing to find these vectors
groups making sense for problem solving. A human vision has unique
possibilities to do that for reality pictures. Many years the author
is developing "pictorial" methods to use this human vision power for a
clustering of multivariate data by means of different transformations of
initial data to pictures. They have to represent these data as wholistic
images which can be visually analysed and described with hierarchical system
of hundreds of local and integral features of shapes, colors, textures,etc. To
get such representations and effectively use them for a data clustering it is
needed to solve some combinatorial problems.
First of them is an algorithm synthesis and its effectiveness
estimation for permutation rows, columns and cells on matrix representations
of data. In the simplest case it appears when each number of some data matrix
is represented by a certain combination of color, intensity, texture,etc.
These color picture can be very inconvinient for visual analysis because of
many small separate spots which can not be connected visualy in some shapes
allowing to effectively apply the hierarchy of visual features. A simple
smothing is not effective enough in this case and main tools to get good
pictures are permutations of columns, rows and cells, whatever problem allows.
They can create much more connective image than initial one. Man-machine
algorithms based on computer analysis of special graph representation of a
data matrix and a visual analysis of intermediate steps of permutations
to choose a right next step are considered in this report. Shown that for real
data set an acceptable value of a picture connectivity criteria can be
provided by this algorithms for reasonable amount of steps.
Second combinatorial problem considered is about an estimation of
possible lengthes of disjanctive-conjunctive data vectors dichotomies visually
detected by polar and reactangular developments of these vectors. It is shown
that binary projections of many multidimensional data have pretty short
dichotomical solutions.

13.
Nickolai Alexandrov, Amgen, Inc., CA

"The number of domain families"

Protein domains are grouped into families by
sequence similarity and 3D structural resemblance.
Analysis of the domain clustering allows us to make
predictions on the total number of sequence families
and different structural folds. Presented results are
based on PFAM database of protein families and SCOP
classification of protein folds.

14.
Sandor Vajda, Department of Biomedical Engineering,
Boston University, MA

"Analysis and design of receptor-ligand interactions"

Docking and design are the major computational steps toward understanding
and affecting receptor-ligand interactions, and thus are important both
for the analysis of cellular processes and for the development of drugs.
Docking can be reduced to the problem of finding the global
minimum (and possibly some local minima) of an appropriate target function
over a space that includes rotational, translational, and possibly
conformational degrees of freedom. Design expands the search to a space
that also includes some description of the chemical structure of
the potential ligand. We will discuss how to select target functions
and search algorithms in frequently occurring applications. In particular,
we describe an extension of the dynamic programming algorithm
for docking and design of flexible molecules.

15.

Simon Streltsov, Alphatech Inc., MA

"Making heuristic clustering algorithms globally optimal"

Clustering algorithms can be viewed as a class of non-convex
optimization problems characterized by very large computational
complexity and a special structure of the objective function. Many
algorithms find an approximate
solution by combining an initial heuristic approximation with subsequent
local optimization. We consider a global optimization problem over the
parameterized initial approximation, and describe a methodology that
finds a trade-off between quality of the solution and the respective
costs of initial approximation and local optimization algorithms. This
global optimization methodology is based on Brownian motion
approximation of the non-convex objective function combined with optimal
on average index policy of selection between competing search processes.
As an example of the proposed approach, we consider feature-based
divisive hierarchical clustering algorithm with initial approximation
computed using 1-D clusters along each feature.
16.

Saveli Goldberg, MediSpectra Inc., MA

"Definition of anti-syndrome and empirical data
classification problems"

Minimal anti-syndrome of A-class of the objects is a collection of signs. This
collection does not occur with any object of class A and any sub-collection of
this collection occurs with same objects of class A. Set of minimal
anti-syndromes of class A is considered as a pattern of class A.
Metrics functions and pseudometricses on a set of classes are obtained on the
base of sets of minimal anti-syndromes. We present a procedure of modeling
of the universe of the class A with the help of minimal anti-syndrome set
manipulations. The solutions are illustrated on some applied examples.

17.

Peter Hammer, RUTCOR, Rutgers Univ., NJ

"Logical analysis of data in application"

18.

Iosif I. Vaisman, Alexander Tropsha, and Weifan Zheng,
Univ. of North Carolina, NC

"Compositional Preferences in Quadruplets of Nearest
Neighbor Residues in Protein Structures"

Three-dimensional structure and amino acid sequence of proteins are
related by an unknown set of rules that is often referred to as the
folding code. This code is believed to be significantly influenced by
nonlocal interactions between the residues. A quantitative description
of nonlocal contacts requires the identification of neighboring
residues. We applied statistical geometry approach to analyze the
patterns of spatial proximity of residues in known protein structures.
Structures from a dataset of well resolved nonhomologous proteins with a
single point representation of residues by Ca atoms were tessellated
using Delaunay algorithm. The Delaunay tessellation generates an
aggregate of space-filling irregular tetrahedra, or Delaunay simplices.
The vertices of each simplex objectively define four nearest neighbor Ca
atoms and therefore four nearest neighbor residues. Compositional
analysis of Delaunay simplices reveals highly nonrandom clustering of
amino acid residues in protein structures. Relative abundance or
deficiency of residue quadruplets with certain compositions reflects
propensities of different types of amino acids to be associated or
disassociated in folded proteins. The likelihood of occurrence of four
residues in one simplex displays strong nonrandom signal also in the
case of a reduced amino acid alphabet. We used several different
three-letter alphabets based on the residue chemical and structural
properties and on the complementarity of the corresponding codons. In
all cases the clustering of residues correlates with their properties or
genetic origin. The results of this analysis are being implemented in
algorithms for protein structure classification and prediction.

19.

Ying Xu, Computational Biosciences Section, Life Sciences
Div., ORNL, TN

"Protein threading by combinatorial optimization methods"

Computational recognition of native-like folds of an anonymous amino acid
sequence from a protein fold database is considered to be a promising approach
to the three-dimensional (3D) fold prediction of the amino acid sequence. We
present a new method for protein fold recognition through optimally aligning an
amino acid sequence and a protein fold template ({\it protein threading}). The
fitness of aligning an amino acid sequence with a fold template is measured by
(1) the singleton fitness, representing the compatibility of substituting one
amino acid by another and the combined preference of secondary structure and
solvent accessibility for a particular amino acid, (2) the pairwise
interaction,
representing the contact preference between a pair of amino acids, and (3)
alignment gap penalties. Though a protein threading problem so defined is known
to be NP-hard in the most general sense, our algorithm runs efficiently if we
place a cutoff distance on the pairwise interactions, as many of the existing
threading programs do. For an amino acid sequence of size $n$ and a fold
template
of size $m$ with $M$ core secondary structures, the algorithm finds an optimal
alignment in $O(Mn^{1.5C+1} + m n^{C+1})$ time and $O(Mn^{C+1})$ space, where
$C$
is a (small) nonnegative integer, determined by a particular mathematical
property
of the pairwise interactions. As a case study, we have demonstrated that $C$ is
less than or equal to 4 for about 75\% of the 293 unique folds in our protein
database, when pairwise interactions are restricted to amino acids $\le$ 7\AA\
apart (measured between their $beta$ carbon atoms). An approximation scheme is
developed for fold templates with $C > 4$, when threading requires too much
memory and time to be practical on a typical workstation.

20.

Leonard J. Schulman
Georgia Inst. Technology

Quadratic Euclidean Clustering

We present an algorithm which partitions a set of points in R^d into
two blocks, so as to minimize the sum of the squares of the Euclidean
distances between pairs of points in the blocks. The algorithm runs in
time $O(n^{d+1} \log n)$. An extension of the method gives a
polynomial time solution of the analogous problem for any constant
number of blocks.

21.

Franco Armando Bignone, Dep. of Preclinical Oncology,
Lab. of Experimental Oncology, National Cancer Institute,
Italy

"A case study in genetic networks dynamics: gastrulation in
Caenorhabditis elegans as a model system"

The main problem facing biology in the near future, with the end of
several sequencing projects approaching or already accomplished, and a
vast amount of information on biochemical interactions becoming available,
will be an understanding of the context in which information stored in
the DNA is expressed and regulated. This detailed description is going
to be necessary in order to understand how the molecular level is
influenced, and in turn give rise, to the observed higher dimensional
cellular structures.

While decoding of the physical structure of Nucleic Acids is now, in several
instances, routine, the study of how the interaction between the DNA and
its microenvironment give rise to the life forms we observe has still a long
way to go, both in terms of pure description and in the study of general
physico-chemical mechanisms involved.
This is particularly true if we aim at intervention on the scale of
organisms -- i.e. modifications of organisms with predictive capabilities --.
The reasons behind this difficulty are partially due to the complexity of
genetic networks and metabolic networks which give rise to the observed
behavior, and partially to the spatial distribution and organization
of proteins and other molecules involved.

In Molecular Biology and in Chemistry, during the last thirty years, there
has been a thin but steady line of attempts to deal with the problem of
developing an analytical description for complex chemical systems in which
multiple chemical species are involved. Moreover the study of this problem
has been mirrored, on a much wider base, in Physics, Chemistry, and to a
lesser extent Biology, with the study of spatially extended structures
which emerge spontaneously in systems away from equilibrium.

Thus in dynamical systems theory several tools have been introduced recently,
for modeling complex dynamics emerging from the spatial coupling of many
degrees
of freedom. These include: Coupled Maps Lattices, Cellular Automata, and
Lattice
Gas Cellular Automata. These results have fostered a new development of this
area. Taken together, and applied depending on the situations, these concepts
bring new possibilities for the study of biological systems both for pure andapplied research.

In order to study some general aspects in the distribution of regulatory
signals
among cells we have studied recently the main dynamical features of a Coupled
Maps Lattice that mimics the evolution in time of simple genetic networks in
regular cellular clusters. Each cell has been represented by the same network
of
genes whose protein products determine mutual interactions both in space and
time.
The global dynamics -- quantified by the mean concentrations of the products on
the
overall space --, and the symbolic dynamics -- activity-inactivity patterns of
the genes --, have been analyzed in the different regimes shown by the system
in certain regions of the parameter space.

This kind of formalizations of Biological Systems as lattices of fixed size
can be important to study their basic dynamical properties, however, for a
more realistic study it should be taken into account that local
space-time dynamics set lattice's size and shape -- i.e. number of sites and
topological distribution of signal sources -- while biochemical signals
travel through it driving growth, death and/or cell movement.
This phenomenon results in a continuous coupling between the studied
dynamics and lattice structure. Biological phenomena that shows this kind
of plasticity are known to be quite general and occurs at early and late
stages of development, in several pathologies -- i.e. cancer progression --
and during homeostasis.

In order to undertake a detailed study of this class of phenomena we have
used the Nematode Caenorhabditis elegans which displays a stereotype cell
lineage during embryogenesis.
An important aspect of Biological Systems, viewed here from a dynamical
point of view, is that they are a complex example of symbolic encoding.
The obvious case of this behavior is DNA but this is not the only one.
In fact at an higher dimensional scale, this concept can be applied again.
Organisms such as C. elegans, or other organisms
such as Gastropods or Anellids, whith their stereotyped cell lineage
during development, give rise to a strict symbolic encoding of cells
generated from the zigote. This fact can be quite useful in order to
understand necessary parameter reduction, in order to achieve a correct
model building in this complex situation.

The work done until now has been a detailed description of the path of
development through time lapse cinematography. We have traced the path of
all the cells generated from the zigote in normal embryos and mutants,
from the first cell division up to the end of gastrulation -- a total
of 370 cells --.
The results of this study indicate that early stages of development and
tissue organization from a conceptual point of view can be studied as a
problem similar to a dynamic protein folding. In this case the study is
complicated by the fact that the behavior of every element of the "folded
chain" - cells - has possibility of movement and characteristics that
increase the complications already present in classical problems of protein
folding. At the same time several conceptual tools applied to proteins,
such as for example contact maps, can be adapted to the study of this
problem from a dynamical point of view.

Keywords: Gene expression, Genetic Networks, Caenorhabditis elegans
Coupled maps lattices, Cellular automata, Grammars, Graphs.

References

F.A. BIGNONE, R. LIVI, M. PROPATO, Long transients dynamics in biochemical
networks, to appear, {\sl Il Nuovo Cimento D}, 1988.

F.A. BIGNONE, Coupled maps lattice dynamics on a variable space for
the study of development: a general discussion on Caenorhabditis elegans,
to appear, {\em Theoretical Computer Science}, 1988.

F.A. BIGNONE, R. LIVI, M. PROPATO, Complex evolution in genetic networks,
{\em Europhys. Lett.}, {\bf 40}(5), 497-502, 1977.

F.A. BIGNONE, R. LIVI, M. PROPATO, Symbolic and global dynamics in
coupled genetic networks: evolution of artificial cell populations at the
edge and within chaotic transient behaviour, in: P. Husbands and I. Harvey,
editors, {\em Fourth European Conference on Artificial Life, ECAL97},
MIT Press, Cambridge, Massachusetts, 1997.

F.A. BIGNONE, Complex dynamics in coupled genetic networks.
Proceedings meeting, " La matematica e i problemi dell'ambiente,
della biologia e della medicina: aspetti modellistici, analitici e
computazionali", Urbino 29-31, 1996, {\sl Studi Urbinati}, {\bf 1},
159-168, 1997.

F.A. BIGNONE, R. LIVI, M. PROPATO, Dynamical stability and finite
amplitude perturbations in coupled genetic-networks,
{\em PHYSICA D}, {\bf 108}, 379-396, 1997.

F.A. BIGNONE, Cells-Gene interactions simulation on a coupled map
lattice, {\it Journal of Theoretical Biology}, {\bf 161}(2), 231-249,
1993.

22.

Teresa Przytycka, Rajeev Aurora, and George D. Rose
przytyck@grserv.med.jhmi.edu

"Protein Classification and Secondary Structure Sequence"

We investigate the question whether secondary structure sequences
embody enough information to allow for fold recognition and
classification. Here by a secondary structure sequence of a protein we mean
the
sequence of secondary structure types and their lengths.

To compute similarity score between two proteins we align their
secondary structure sequences using a dynamic programming algorithm.
The alignment algorithm is similar to the standard sequence alignment
algorithm with the modification required by the fact that an
individual element of a secondary structure sequence corresponds to
a whole secondary structure not to a single residuum.

By computing similarity score for each pair of proteins in a test set
we obtain a distance matrix. (We use here the term ``distance" in an informal
meaning - we do not require that the triangular inequality is satisfied).
Given a distance matrix we apply a clustering algorithm to construct
a "similarity tree".

We compare the resulting tree with the SCOPE classification and VAST similarity
search.
Our computational experiments suggests that alignment of secondary structure
sequence can be used as a tool for protein classification. The resulting
hierarchy would be related to but not identical to SCOP classification. In
particular, proteins with the same folds (including the number of strands and
helices)
are clustered together. However, since our method treats all secondary
structures equally, the proteins with significantly different number of
strands/helices may not cluster together even if they are considered to belong
to the same fold family as defined by SCOPE.

By comparing our similarity with the one computed from VAST data we observe
that most of the proteins whose sequence distance is below 0.2 are also
considered to have same fold by VAST and vice versa. However there are some
proteins form the same SCOPE family which we identify as having the same fold
but VAST does not as well as proteins that VAST cluster together while our
method mixes with proteins form other SCOPE families that have similar number
of
strands. This is still remarkable, since unlike VAST our method uses no
information about a fold other than the secondary structure sequence.

23.
Anders Wallqvist, Yoshifumi Fukunishi, Addi Fadel
and Ronald M. Levy.
Department of Chemistry, Rutgers University

Protein Fold Recognition based on Secondary Structure Similarities.

Sequence alignment techniques are very powerful tools for identifying
the fold of protein sequences. Partial structural information of protein
sequences can be built into a general alignment technique to augment
existing alignments procedures. In this work we investigate the
possibility of identifying protein families by aligning
secondary structure sequences. We have carried out an extensive analysis
of how secondary structure alignments can be used for fold recognition
when the amino acid sequence identities (as determined by structurally
aligned protein pairs) within the folding families is low.
We identify a "twilight zone" for fold recognition based on secondary
structure alignments at ~65% secondary structure identity as compared
with ~21% amino acid identity using the standard amino acid sequence
alignment approach. The success rate for identifying protein folds
over a small database of 455 proteins divided into 86 families is the
same (~86%) for either secondary structure or amino acid alignments.
Thus the known secondary structure sequence carries enough information
to discriminate folds, i.e. the sequential arrangement of secondary
structure elements is a strong fold determinant. This work explicitly
provides the rationale for alignment protocols which optimize amino acid
and secondary structure identities simultaneously. If time permits we
will briefly discuss the results of our collaborative efforts with
Prof. Kulikowski and co-workers to construct RASSP, a Representative and
Aligned Secondary Protein Database.

24.
Steven E. Brenner, Stanford University, CA

"Recognizing Distant Evolutionary Relationships
from Protein Sequence and Structure"

Because evolutionarily related proteins have a shared history, they will
have the same structure and are likely to share similar functions.
Consequently, methods of detecting homology may be used to characterize
newly discovered proteins, such as those from genome projects. Indeed,
sequence comparison is the principal computational tool underlying
structural and functional genomics. Because structure provides the most
powerful means of detecting homology, the scop: Structural Classification
of Proteins database organizes all known protein structures according to
their evolutionary and structural relationships. The scop data have been
employed to assess methods of identifying homology from sequence.
Analysis of the results has allowed us to optimize automated sequence
comparison methods to detect far more relationships, while retaining
extremely high accuracy.

25.
Richard Fine and Boris Klebansky, Paradygm
Technologies Inc., NJ

"Topographical and Biochemical Maps of Protein Surfaces:
Recognizing Similarity and Complementarity"

One of the more astounding accomplishments of living cells is
their ability to orchestrate and control specific chemical interactions
between tens of thousands of proteins in the same reaction mixture. The
specificity of its interactions is imparted by a protein^�s
three-dimensional structure, which provides a rich and complex set of
topographical and chemical features on protein surfaces whose match
dictate cognate specificity. The biochemical nature of these features
is only poorly sampled at the primary sequence level, while their
topographical nature is not sampled at all.

The current accumulation of structural information in the Protein
Databank (PDB) as organized into families and super-families in SCOP
provides a rich set of structural information which can be used to
examine similarities and differences between biologically interesting
surface features of proteins. We are implementing a general suite of
tools and an underlying database for the characterization, comparison,
and query of known biochemical and topographical features on protein
surfaces. The surface is represented by an hierarchical tree of tokens
which characterizes features on the surface at different levels of
resolution. This representation is coupled to a search engine for
identifying both similar and complementary patches of surface in the
collection of existing proteins in the PDB. The structure of the
database has been designed to support learning systems to identify key
features such as enzyme active site similarities and differences within
protein families. Coupled to family sequence alignments this database
suggests a plethora of practical applications including the design and
interpretation of site directed mutagenesis experiments, the generation
of sequence motifs corresponding to conserved and differentiated surface
features within families of proteins, and the extension of existing
small-molecule database screening tools to differentially target one
among several related proteins. Methods will be discussed and examples
given of the application of these concepts.

26.
Donald Jacobs, Department of Phyics and Astronomy
Michigan State University, MI

"Real-time Protein Domain Evaluator and Design Tool"

In proteins it is possible to separate hard covalent forces involving
bond lengths and bond angles from other weak forces. The microstructure
of a protein is modeled as a generic bar-joint truss framework, where
the hard covalent forces and strong hydrogen bonds are regarded as
rigid bar constraints. The mechanical stability of a given protein
structure is then analyized using a combinatorial constraint counting
algorithm to identify the protein's Floppy Inclusion and Rigid
Substructure Topography (FIRST). The FIRST algorithm is based on a
generalization of Laman's theorem [1] on graph rigidity in the plane
and the extension of the 2D pebble game algorithm [2,3] to give a
complete characterization of generic rigidity in three dimensional
bond-bending networks [4]. Unlike many methods for parsing protein
folds, the FIRST algorithm gives exact mechanical properties of a
protein structure (and other macromolecules) subject to a certain set
of force-constraints. These properties include counting the number
of independent degrees of freedom, identifying rigid clusters,
overconstrained regions and low energy collective motions. The FIRST
algorithm can be used in real time; for example, it can analize a
protein consisting of 1000 residues in less than a second on a
Pentium PC. To achieve this computational speed, all imposed
constraints are treated equally as infinitely rigid. The selection
of the imposed set of constraints is determined by introducing
a cut-off criteria on the interaction strength of hydrogen bonding.
As the cut-off criteria is relaxed, a hierarchical characterization
of the mechanical properties can be obtained. The approach used here
has a high potential to be developed into a 3D-structural analysis
and design tool for evaluating protein domains and identifying
conformational flexibility.

[1] G. Laman, "On graphs and rigidity of plane skeletal structures,
J. Eng. Math 4, 331 (1970)

[2] D. J. Jacobs and M. F. Thorpe, "Generic Rigidity: The Pebble Game",
Phys. Rev. Lett. 75, 4051-4054 (1995)

[3] D.J. Jacobs and B. Hendrickson, "An Algorithm for Two Dimensional
Rigidity
Percolation: The Pebble Game", J. Comp. Phys. 137, 346-365, (1997)

[4] D. J. Jacobs, "Generic Rigidity in Three Dimensional Bond-bending
Networks", Preprint Aug (1997), submitted to J. Phys. A. (Math. and
Gen.)

27. Manfred Zohn.

Protein Domain Prediction in the context of Genome Annotation

Manfred Zorn, Lawrence Berkeley National Laboratory, MDZorn@lbl.gov

Genome sequencing is gearing up to finish the sequencing the human genome
while tools to understand sequence are lagging behind. The Genome Annotation
Consortium is developing a framework for high-throughput genome annotation
that combines traditional sequence analysis with machine learning
techniques, data mining, and visualization to facilitate extracting
knowledge from the sequence information. In the talk I will focus on the
protein classification work and the framework developments being done at
LBNL as well as related ongoing and future collaborations.

Other Workshops

DIMACS Homepage

Contacting the Center
Document last modified on April 13, 1998.