The accelerating accumulation of DNA and protein sequences poses challenges and provides opportunities in analyzing genomic organization and evolution.
Methods and concepts described in this talk provide means for assessment and interpretation of heterogeneities within and between DNA sequences. We will focus on the following data: (1) Patterns and anomalies of di-, tri-, and tetranucleotides; (2) phylogenetic reconstructions based on distance measures of dinucleotide relative abundances; (3) identification of exceptional peptides and oligonucleotides (e.g., rare and frequent words) in protein and genomic sequences; and (4) counts and spacings of various marker arrays such as specific words, purine tracts, regulatory motifs, nucleosome placements, and restriction targets.
Three classes of statistical functionals can aid in identifying and evaluating distinctive sequence features: (a) r-scan analysis used in discerning anomalies (clustering, overdispersion, evenness) in the spacings of a specified marker along the sequence; (b) segmental quantile distributions compared across genomic data sets or with appropriate reference distributions; (c) score based sequence analysis as a means of characterizing anomalies in sequence text and as applied in multiple sequence comparisons, in sensitivity measures of nucleotide distributions and in gene predictions.
In the first talk we will describe and apply methods (a) and (b). The second talk focuses on applications of method (c).
This presentation reviews the method of score-based sequence analysis with the objectives of discerning distinctive segments in single sequences and identifying significant common segments in sequence comparisons. We will describe methods and results for both the theory and its applications. These include distributional theory involving several high scoring segments in single sequences useful in identifying transmembrane segments, distribution formulas for general scoring regimes useful in multiple sequence comparisons, applications for predicting exons and genes in DNA sequences, and identifying distinguished charge patterns in protein sequences.