This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).
Title: Recombinations-based Population Genomics
We will give a summary of a list of hot problems in (human) population genetics which is at the base of wide development in genomic sciences, biomedicine and biology at large. These problems span from using population genetics for redrawing population history to reconstructing genome sequences, with many implications in disease genetics. One of the points is the inference of recombinations in SNP data (or, in the future, sequence data) in a set of chromosomes. This has been implemented in a system, IRiS, and the results yield a new set of information for population genetics.
In particular, we will discuss our study of human population diversity using evidence of past recombinations (termed recotypes) as genetic markers. Our inferred recombinations indicate strong agreement with past in vitro and in silico recombination rate estimates. The correlation between traditional allele frequency based distances and recombinational distances bring further credence to the study of population structure using recotypes. Furthermore our results indicate recotypes are more representative of the underlying population structure than the haplotypes they are derived from.
Title: Dimensionality reduction in the analysis of human genetics data
Dimensionality reduction algorithms have been widely used for data analysis in numerous application domains including the study of human genetics. For instance, linear dimensionality reduction techniques (such as Principal Components Analysis) have been extensively applied in population genetics. In this talk we will discuss such applications and their implications for human genetics, as well as the potential of applying non-linear dimensionality reduction techniques in this area.
Title: More powerful genome-wide association methods for case-control data
In case-control Single Nucleotide Polymorphism (SNP) data, there are three distinct sources of information about genetic association, and correspondingly three different tests: the Allele frequency, Hardy Weinberg Disequilibrium (HWD) and Linkage Disequilibrium (LD) contrast tests. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP.We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for dominant, recessive and heterozygote disadvantage diseases, the genotypic test has the most power. Among the six two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.
Title: Two (population genetics and phylogenetics) Solutions in Search of Killer Apps
Title: Statistical Alignment, Footprinting and Transfer of Knowledge
Methods of sequence analysis based on stochastic models of the insertion-deletion process (Statistical Alignment) has been a major surprising success story of the last decade after three decades of dominance of non-statistical optimization (similarity maximization) approaches. There are still many challenges to statistical alignment, both in terms of biological realism and the computational challenges of increasingly large data set. Combining statistical alignment with annotation techniques will have clear advantages and is only recently being explored. This talk will present recent advances in combining statistical alignment with the search for regulatory signals that unambiguously outperfoms competing approaches.
Title: Efficient algorithms for ascertaining markers for controlling for population substructure
Human population substructure has been traditionally the study of human population genetics for making inferences about past evolutionary events, either related to demographic factors or to selective pressures. Nevertheless, human population substructure has acquired an additional relevance in the last years due to its role as a putative confounding factor in epidemiological case-control studies as well as their putative importance in the forensic field. So far, different statistical approaches have been proposed for controlling for this confounding effect, including the use of markers ascertained specially for their information to detect population substructure (AIMs, also called ancestry sensitive markers or ASMs). This has been a quite active field of research and several algorithms have been developed for ascertaining sets of AIMs, mainly differing in whether pre-defined clusters of individuals were considered or not, and the kind of metrics to estimate the amount of population substructure among other factors. Nevertheless, it has been shown that population substructure is only a confounding factor in quite particular scenarios, namely when the genetic variation covariates with the phenotypic variation. Therefore, it seems reasonable that the ascertained markers for their further use in case-control studies should include this information. We have implemented an algorithm that allows one to ascertain markers that are (population) associated to particular phenotypes, either discrete or continuous; results suggest that these markers that geographically covariate at a population level with the phenotype could be indeed associated to it.
Title: Imputation-based local ancestry inference in admixed populations
Mp This is joint work with Bogdan Pasaniuc (ICSI Berkeley) and Justin Kennedy
(UCONN).
Title: High-dimensional data-sets and the problems they cause
Perhaps more than any other scientific discipline, the biological
sciences are currently in the midst of a golden era of technological
advances. These advances are allowing us to collect data that are a
quantum leap better (i.e. more detailed) and bigger (i.e. genomewide)
than has hitherto been available. These data represent a gold-mine in
our efforts to understand the relationship between our genetic and
phenotypic make-ups, but they also introduce problems. Models that
were tractable for smaller data-sets become intractable in the new
era, explicit calculation often becomes impossible, and many analysis
methods begin to break down. We discuss these problems and illustrate
proposed solutions using examples drawn from applications in
population genetics.
Title: Human Population Genomics: Man, Woman, Birth, Death, Infinity, Plus Altruism, Cheap Talks, Bad Behavior, Money, God and Diversity on Steroids
Our ancestors became almost extinct twice, the most recent being about 40,000 to 60,000 years ago. At one point, the population had shrunk to as few as 4,000 individuals, but expanded rapidly as humans migrated to other parts of the world and learned to farm and domesticate animals. The genomes of the current human population record this history as it has been molded by mutations (polymorphisms), migration, genetic drifts and selection. The statistical distributions of genes and other genomic elements are hard to decipher since it mixes huge amount of diversity fueled by genetic drift, resulting from small populations and non-random mating, with significant differences that contribute each individual's overall traits.
However, as we prepare to usher in the age of individualized medicine, we have to attack the underlying statistical analysis problem on several fronts: (1) Technology, (2) Systems Biology and Genetics, (3) Statistical Algorithms, and (4) Large-Scale System Building. My group has been engaged in developing a single-molecule sequencing technology (SMASH) and sequence assembly algorithms (SUTTA) to collect very high-quality haplotypic sequencing data from a large number of individuals. Using this data, we aim to catalog and understand how different polymorphisms (SNP, CNV, segmental rearrangements and possibly many others) originate and diffuse through the population. This will then lead to various novel non-parametric algorithms to model the stochastic processes that are modulated by population sizes, migration and mating patterns. This integrated technology can then be used to discover and exploit groups of genetic markers to drive the core recommender engine of individualized medicine. I will discuss various open problems related to this strategy and their possible solutions.
Title: Population genetic analyses of next-generation sequencing data
Low coverage next-generation sequencing data poses special problems for population genetic analyses because of low coverage, missing data, and sequencing errors. We present some new methods for addressing these problems and show applications to the estimation of inbreeding coefficients, population scaled mutations rates, frequency spectra and other statistics of interest to population geneticists. We illustrate with a number of applications in humans and other organisms.
Title: RECOMBINOMICS: Myth or Reality?
The talk is in two parts. In the first part we explore the general problem of reconstructability of pedigree history using a random graphs framework. How plausible is it to unravel the history of a complete unit (chromosome) of inheritance?
In the second part of the talk we discuss our approach to reconstructing the recombinational history of a sample of individuals. I will describe the underlying algorithms in a system called IRiS that we have used in studying population diversity.
Title: Haplotype clusters and imputed genotypes in diverse human populations
Shared descent of similar haplotypes from a common ancestor enables the
inference of haplotype phase from diploid genotypes and the imputation of
unmeasured alleles. This talk will examine a series of problems that
arise in human population genetics from the consideration of phasing and
imputation. Topics that will be discussed include (1) the development of
an encoding of haplotypes pointwise along the genome for use in
population-genetic analysis, (2) the measurement of the accuracy of
genotype imputation in diverse human populations, and (3) the evaluation
of the connection between imputation error and the power of
association-mapping studies.
Title: Genome shrinkage by elimination of duplicates
Over evolutionary time scales, genomes may expand and shrink considerably. A variety of environmental and functional selective forces have been adduced to account for these changes, One well-known mechanism for sudden expansion is whole genome doubling (WGD). Following an episode of WGD gene duplicates are lost at a high rate through processes such as pseudogenization and deletion of chromosomal segments containing one or more genes, while intra- and interchromosomal rearrangement mechanisms redistribute chromosomal segments both large and small across the genome. The genome of the present-day descendant can be largely decomposed into a set of duplicated DNA segments dispersed among the chromosomes, with all the duplicate pairs exhibiting a similar degree of sequence divergence, and with single-copy segments interspersed among them. In this paper, we introduce approaches to analyzing the evolution of doubled genomes, based entirely on gene order evidence, in order to explain aspects of the gene loss process and to reconstruct the rearrangement steps leading from the doubled ancestral genome to the present day descendant. This is based on the recently-developed ``Guided halving" algorithm and statistical analysis of ``Conserved frames". We apply our methods to yeast, cereal and poplar genomes.
Title: Forensic DNA analysis and multi-locus match probability in finite
populations: a fundamental difference between the Moran and Wright-Fisher models
A classical problem in population genetics, which being also of
importance to forensic science, is to compute the match probability
(MP) that two individuals randomly chosen from a population have
identical alleles at a collection of loci. At present, 11 to 13
unlinked autosomal microsatellite loci are typed for forensic use. In
a finite population, the genealogical relationships of individuals can
create statistical non-independence of alleles at unlinked loci.
However, the so-called product rule, which is used in courts in the
US, computes the MP for multiple unlinked loci by assuming statistical
independence, multiplying the one-locus MPs at those loci.
Analytically testing the accuracy of the product rule for more than 5
loci has hitherto remained an open problem.
In this talk, I will describe how a flexible graphical framework can
be employed to compute multi-locus MPs analytically. I will consider
two standard models of random mating, namely the Wright-Fisher and
Moran models, and describe the computation of MPs for up to 10 loci in
the Wright-Fisher model and up to 13 loci in the Moran model. For a
finite population, I will show that the MPs for a large number of loci
predicted by the product rule are highly sensitive to mutation rates
in the range of interest, while the true multi-locus MPs are not.
Furthermore, I will show that the Wright-Fisher and Moran models may
produce drastically different MPs for a finite population, and that
this difference grows with the number of loci and mutation rates.
Although the two models converge to the same coalescent or diffusion
limit, in which the population size approaches infinity, I will
demonstrate that, when multiple loci are considered, the rate of
convergence in the Moran model is significantly slower than that in
the Wright-Fisher model. Hence, our work reveals a striking
fundamental difference between the two standard models of random
mating.
Title: Estimating human demographic parameters from DNA sequence data
We present a composite-likelihood framework for estimating
demographic parameters from DNA resequencing data. We then apply this
method to estimate split times and migration rates between different
populations. Our results suggest that population structure within
Africa is quite old, and likely predates the expansion of modern
humans to other continents. We also outline how this method can be
used to detect ancient admixture events (e.g., between modern humans
and 'archaic' human groups such as Neandertals), and speculate that
ancient admixture may be a common feature in the history of many
extant populations.
Title: A fatgraph model of protein structure
The CATH database is a hierarchical classification of protein domain
structures with four main levels. Classification on the top level is
relatively easy but already at the second level, manual work is
needed. We present a novel method for describing domain structures
based on concepts from algebraic topology. Using the locations of the
backbone atoms and the hydrogen bonds we create a combinatorial object
-- a so-called fatgraph -- which is then transformed into a
topological object. The topological object of our method does not
depend on any particular embedding in an Euclidean space, and this
leads to defining intrinsic quantities -- topological invariants -- of
protein domains. We have implemented algorithms to calculate these
quantities and other quantities of interest. We show some results for
classification of domain structures using topological invariants; even
simple classification schemes perform remarkably well. Apart from the
model's use in protein classification, it might eventually be used to
guide structure prediction and structural annotation of proteins.
Paul Marjoram, Keck School of Medicine, USC, USA
Bud Mishra, NYU, USA
Rasmus Nielsen, UC Berkeley, USA
Laxmi Parida, IBM T J Watson Research
Noah Rosenberg, University of Michigan, USA
David Sankoff, Department of Mathematics and Statistics, University of Ottawa
Yun S Song, UC Berkeley, USA
Jeff Wall, UC San Francisco, USA
Carsten Wiuf, Aarhus University, Denmark
Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on April 14, 2009.