Vineet Bafna, The Center for the Advancement of Genomics
Title: Haplotyping as perfect phylogeny
We consider the following problem: Given n genotypes, does there exist a set H of haplotypes such that each genotype is generated by a pair from this set, and this det can be derived on a perfect phylogeny. Recently, Gusfield , 2002, presented a polynomial time algorithm to solve this problem that uses established results from matroid and graph theory. In this work, we present an O(nm^2) algorithm for this problem using elementary techniques. We also describe a linear space representation for representing all possible solutions, and provide a formula for counting the number of possible solutions.
(Joint work with Dan Gusfield, Giuseppe Lancia, and Shibu Yooseph)
Andrew Clark, Cornell University and Celera/Applied Biosystems
Title: "Exhaustive enumeration and Bayesian phase inference"
A strategy for haplotype inference will be described using
graph-theoretic approaches to enumeration of admissible haplotype
phases, coupled with Bayesian statistical inference for assessing
likelihoods of alternative phasings. An application to the special
problem of inference of phase from trisomic data (Down syndrome) will
be presented.
Nancy Cox, University of Chicago
Title: How does choice of polymorphism influence estimation of LD and
mapping?
Fine-mapping and positional cloning studies may focus on only a subset of
polymorphisms identified within a region. In part this may reflect the
trade-off between information and economics. For example, for many of the
genetic analytic approaches commonly used in fine-mapping and positional
cloning, there is little value in typing polymorphisms that are in
near-perfect linkage disequilibrium (LD) with each other. Consequently,
investigators may focus initially on only those polymorphisms that have
unique patterns in some small subset of individuals used in screening
studies. While this strategy is economical and scientifically defensible
for many common analytic approaches, it precludes unbiased estimation of
the extent of LD in a region. Moreover, consequences for more
sophisticated multipoint LD mapping approaches considering extended
haplotypes are difficult to predict. We report here on studies of the
NIDDM1 region of 2qter and a region on chromosome 15 including CYP19 in
which we have made a systematic examination of the effects of this type of
polymorphism choice. While our conclusions are clearly limited because we
have studied only these two regions, our results suggest that multipoint
linkage disequilibrium mapping methods can be sensitive to choice of
polymorphism. In contrast, studies on the extent of LD in a region may be
less sensitive to this type of polymorphism choice, as long as all common
polymorphisms with unique patterns are included in analyses.
David Cutler, Johns Hopkins University School of Medicine
Title: Haplotype Inference in Random Population Samples
The application of statistical methods to infer and reconstruct linkage
phase in samples of diploid sequences is a potentially time- and
labor-saving method. The Stephens-Smith-Donnelly (SSD) algorithm is one
such method, which incorporates concepts from population genetics theory
in a Markov chain-Monte Carlo technique.We applied a modified SSD method,
as well as the expectation-maximization and partition-ligation algorithms,
to sequence data from eight loci spanning >1.5 Mb on the human X
chromosome. We demonstrate that the accuracy of the modified SSD method is
better than that of the other algorithms and is superior in terms of the
number of sites that may be processed. Also, we find phase reconstructions
by the modified SSD method to be highly accurate over regions with high
linkage disequilibrium (LD). If only polymorphisms with aminor allele
frequency 10.2 are analyzed and scored according to the fraction of
neighbor relations correctly called, reconstructions are 95.2% accurate
over entire 100-kb stretches and are 98.6% accurate within blocks of high
LD.
Peter Donnelly, University of Oxford
Title: Bayesian methods for statistical reconstruction of haplotypes
The talk will describe the general issues in Bayesian approaches to
haplotype reconstruction, including the distinction between choice of
prior distribution, and the computational methods used to approximate
posterior distributions. We describe recent improvements to the software
PHASE, including handling missing data and improved computational methods,
and compare its behaviour to some other published methods.
Dan Gusfield, University of California, Davis
Title: Combinatorial Approaches to Haplotype Inference
We have developed several distinct combinatorial approaches to the
haplotype inference problem. I will talk about a few of the most recent of
these approaches. One approach, the "pure parsimony" approach is to find N
pairs of haplotypes, one for each genotype, that explain the N genotypes
and MINIMIZE the number of distinct haplotypes used. Solving this problem
is NP-hard, however, for reasonable size data (larger than in general use
today), the "pure-parsimony" solution can be efficiently found in
practice. I will also talk about an approach that mixes pure-parsimony
with Clark's subtraction method for haplotyping. Simulations show that the
efficiently of both methods depends positively on the level of
recombination - the more recombination, the more efficiency, but the
accuracy depends inversely on the level of recombination. I will also
discuss a practical ways to greatly boost the accuracy of Clark's
subtraction method, and identify haplotype pairs with high
confidence. This approach has been tested on molecularly determined data,
which will be published along with the method. Comparisons are made with
PHASE and HAPLOTYPER. I will also mention some recent developments in
haplotype inference that are based on viewing the problem in the context
of the perfect phylogeny problem. This builds on a near-linear-time
algorithm to determine whether genotype (unphased) SNP data is consistent
with the no-recombination, infinite sites coalescent model of haplotype
evolution. Stated differently, whether there are haplotype pairs for the
genotypes, which satisfy the 4-gamete condition for tree-form
evolution. The algorithm finds in linear time an implicit representation
of the set of all solutions to the problem. A detailed treatment of a
simple alternative algorithm for that problem will be given in the talk by
V. Bafna.
Parts of this work are joint with different collaborators, including
R.H. Chung, V. Bafna, G. Lancia, S. Orzack, V. Stanton and S. Yooseph
Eran Halperin, ICSI and UC Berkeley
Title: Large Scale Recovery of Haplotypes from Genotype Data using Imperfect Phylogeny
Critical to the understanding of the genetic basis for complex diseases is
the modeling of human variation. Most of this variation can be
characterized by single nucleotide polymorphisms (SNPs) which are
mutations at a single nucleotide position. To characterize an
individual's variation, we must determine an individual's haplotype or
which nucleotide base occurs at each position of these common SNPs for
each chromosome. In this paper, we present results for a highly accurate
method for haplotype resolution from genotype data. Our method leverages
a new insight into the underlying structure of haplotypes which shows that
SNPs are organized in highly correlated ``blocks''. The majority of
individuals have one of about four common haplotypes in each block. Our
method partitions the SNPs into blocks and for each block, we predict the
common haplotypes and each individual's haplotype. We evaluate our method
over biological data. Our method predicts the common haplotypes perfectly
and has a very low error rate ($0.47\%$) when taking into account the
predictions for the uncommon haplotypes.
Jun Liu, Harvard University
Title: Haplotype Inference and Haplotype Information
Haplotypes have become increasingly popular because of the abundance of
single nucleotide polymorphisms (SNPs) and the limited power of the
single-locus analyses. To contend with some weaknesses of the existing
haplotype inference methods, we propose new algorithms based on the
partition-ligation idea. In particular, we first partition the whole
haplotype into smaller segments. Then, we use either the Gibbs sampler or
the EM algorithm to construct the partial haplotypes of each segment and
to assemble all the segments together. Our algorithm can infer haplotype
frequencies rapidly and accurately for a large number of linked SNPs and
provides a robust estimate of their standard deviations. The algorithms
are robust to the violation of Hardy-Weinberg equilibrium and can handle
missing marker data easily. As a follow-up study, we also investigated two
related questions: how much the haplotype information contributes to
linkage disequilibrium (LD) mapping and whether an in silico haplotype
construction preceding the LD analysis can help. For simple disease gene
mapping our conclusions are as follows: (a) if a proper statistical model
is employed, the loss of haplotype information for either control or
disease data do not have a great impact on LD fine mapping, and
(b) haplotype inference should be carried out jointly with LD analysis to
achieve the most accurate location estimation.
Dahlia Nielsen, North Carolina State University
Title: Multi-locus linkage disequilibrium and haplotype-based tests of association
The hope behind association mapping is to use linkage disequilibrium
(LD) as an indicator of proximity of a marker to a susceptibility
locus. This follows from the expectation that marker-phenotype
association is proportional to linkage disequilibrium, which is inversely
related to recombination. If there are more than two alleles at a locus
affecting risk, the association statistic is instead a weighted sum of
linkage disequilibria and genotypic susceptibilities. There is no longer
a simple relationship, even in expectation, with recombination. These
results extend to marker haplotypes. In addition to the pairwise
association terms of the single marker tests, marker haplotype
associations depend on the weighted sum of multi-locus disequilibria and
genotypic susceptibilities. Several tests of haplotype association are
presented here, along with a comparison of these tests within different LD
contexts.
Magnus Nordborg, University of Southern California
Title: The Pattern of Polymorphism on Human Chromosome 21
Polymorphism data from 20 partially resequenced copies of human chromosome
21---more than 20,000 polymorphic sites---are analyzed. The
allele-frequency distribution shows no deviation from the simplest
population genetic model with a constant population size (although we show
that our analysis has no power to detect population growth). The average
rate of recombination per site is estimated to be roughly one half of the
rate of mutation per site, again in agreement with simple model
predictions. However, sliding-window analyses of the amount of
polymorphism and the extent of linkage disequilibrium (LD) shows
significant deviations from standard models. This could be due to the
history of selection or demographic change, but it is impossible to draw
strong conclusions without much better knowledge of variation in the
relationship between genetic and physical distance along the chromosome.
Jonathan Pritchard, University of Chicago
Title: Use of a local approximation to the ancestral recombination graph
for fine mapping disease genes
We describe a novel coalescent-based method for estimating the location of
a disease susceptibility locus. This is designed for the situation where
we have genotype data from a sample of cases and controls, in a region
that is believed to contain a disease mutation. For a given position on
the marker-map, we use the marker information of both cases and controls
to reconstruct local approximations of the ancestral recombination graph
using Markov Chain Monte Carlo. From this, we can compute the likelihood
of the phenotype data assuming a susceptibility gene at this position; the
procedure is repeated at a series of locations across the region to
estimate the posterior density.
Co-authors: Sebastian Zöllner and Jonathan Pritchard
Molly Przeworski, Max Planck Institute for Evolutionary Anthropology
Title: Insights into recombination from patterns of linkage disequilibrium
Recent studies of linkage disequilibrium (LD) have suggested that
(1) recombination rates vary tremendously across the genome, such that
large scale estimates of the recombination rate based on a comparison of
physical and genetic maps may not be informative about local patterns of
LD (2) models of recombination that include gene conversion as well as
crossing-over better predict levels of LD at short scales (3) levels of LD
are lower in samples from sub-Saharan African populations than in other
population samples. These observations have important implications for
linkage disequilibrium based association-studies; they suggest that large
areas of the genome can be tagged with few markers, and that genome-wide
studies and fine scale mapping efforts might best be conducted in
different populations. To examine the generality of these observations, we
analyzed over 80 data sets sequenced in 24 African-Americans and 23
individuals of European descent (data from
http://pga.mbt.washington.edu/). As an index of LD, we estimated the
population rate of crossing-over q in the two ?population? samples. We
compared estimates of p (with and without gene conversion) to those
obtained from a comparison of genetic and physical maps. To gain a sense
of recombination rate variation at a small scale, we also considered how
much estimates of p vary along sequences in actual data compared to
simulated data.
Bruce Rannala, University of Alberta
Title: Joint Bayesian estimation of mutation location and age using
linkage disequilibrium
A non-random association of disease and marker alleles on chromosomes in
populations can arise as a consequence of historical forces such as
mutation, selection and genetic drift, and is referred to as ``linkage
disequilibrium'' (LD). LD can be used to estimate the map position of a
disease mutation relative to a set of linked markers, as well as to
estimate other parameters of interest, such as mutation age. Parametric
methods for estimating the location of a disease mutation using marker
linkage disequilibrium in a sample of normal and affected individuals
require a detailed knowledge of population demography, and in particular
require users to specify the postulated age of a mutation and past
population growth rates. A new Bayesian method is presented for jointly
estimating the position of a disease mutation and its age. The method is
illustrated using haplotype data for the cystic fibrosis Delta F508
mutation in europe and the DTD mutation in Finland. It is shown that, for
these datasets, the posterior probability distribution of disease mutation
location is insensitive to the population growth rate when the model is
averaged over possible mutation ages (using a prior for age based on the
population frequency of the disease mutation). Fewer assumptions are
therefore needed for parametric LD mapping.
Kathryn Roeder, Carnegie Mellon University
Title: Evolutionary-based Association Analysis Using Haplotype Data
Association studies, both family-based and population-based, can be
powerful means of detecting disease-liability alleles. To increase the
information of the test, various researchers have proposed targeting
haplotypes. The larger number of haplotypes, however, relative to alleles
at individual loci, could decrease power because of the additional degrees
of freedom required for the test. An optimal strategy would focus the
test on particular haplotypes or groups of haplotypes, much as is done
with cladistic-based association analysis. First suggested by Templeton
and colleagues, such analyses use the evolutionary relationships among
haplotypes to produce a limited set of hypothesis tests and to increase
the interpretability of these tests. To more fully utilize the information
contained in the evolutionary relationships among haplotypes and in the
sample, we propose generalized linear models (GLM) for the analysis of
data from family-based and population-based studies. These models fully
account for haplotype phase ambiguity and allow for covariates. The
models are encoded into a software package, EHAP (for Evolutionary-based
Haplotype Analysis Package), which also provides for various kinds of
exploratory data analysis. The exploratory analyses, such as error
checking, estimation of haplotype frequencies, and tools for building
cladograms, should facilitate the implementation of cladistic-based
association analysis with haplotypes.
Russell Schwartz, Carnegie Mellon University
Title: Inferring Piecewise Ancestral History from Haploid Sequences
The determination of complete human genome sequences, subsequent work on
mapping human genetic variations, and advances in laboratory technology
for screening for these variations in large populations are together
generating tremendous interest in genetic association studies as a means
for characterizing the genetic basis of common human
diseases. Considerable recent work has focused on using haplotypes to
reduce redundancy in the datasets, improving our ability to detect
significant correlations between genotype and phenotype while
simultaneously reducing the cost of performing assays. A key step in
applying haplotypes to human association studies is determining regions of
the human genome that have been inherited intact by large portions of the
human population from ancient ancestors. This talk describes computational
methods for the problem of predicting segments of shared ancestry within a
genetic region among a set of individuals. Our approach is based on what
we call the haplotype coloring problem: coloring segments of a set of
sequences such that like colors indicate likely descent from a common
ancestor. I will present two methods for this problem. The first uses the
notion of ?haplotype blocks" to develop a two-stage coloring
algorithm. The second is based on a block-free probabilistic model of
sequence generation that can be optimized to yield a likely coloring. I
will describe both methods and illustrate their performance using real and
contrived data sets.
Montgomery Slatkin, University of California, Berkeley
Title: Testing for differences in haplotype frequencies in case-control studies
The problem of testing for significant differences in haplotype
frequencies between a random sample of individuals (randoms) and a sample
of individuals with a genetic disease (cases) is considered. The
questions are (1) What is the statistical power in testing for differences
in haplotype frequencies? and (2) How much statistical power is lost when
haplotype phase cannot be resolved but instead must be inferred using a
maximum likelihood method? A likelihood ratio test of differences in
haplotype frequencies in randoms and cases is used, and the theory is
developed in terms of the non-centrality parameter of the non-central
chi-square distribution. If a causative allele has a multiplicative effect
on penetrance, and if sample sizes are large and the effect of the
causative allele is small, analytic expressions for the non-centrality
parameter are obtained when haplotypes can and cannot be resolved. The
loss in power is independent of the frequency of the causative allele and
can in general be compensated for by increasing the sample sizes by a
factor of two or less. For a dominant causative allele, numerical results
are obtained that show the important features of the results for
multiplicative penetrance still are valid. We conclude that maximum
likelihood inference of haplotype frequencies and a likelihood ratio test
of differences in inferred frequencies can be useful in a case-control
setting.
Matthew Stephens, University of Washington
Title: Haplotypes, hotspots, and a multilocus model for Linkage Disequilibrium
Abstract: Current methods for understanding the relationship between LD and
the underlying recombination rate are limited. The most common approach is
to compute a measure of LD between every pair of sites in the region, and
to form a graphical display of the results. However, it is typically
difficult to assess the significance of observed patterns. More
sophisticated coalescent-based statistical methods for estimating local
recombination rate from patterns of LD are either computationally
impractical for moderate-sized regions, or suffer from loss of information
by using only a summary of the data. Furthermore, they all assume constant
recombination rate, making them poor tools for studying local recombination
rates. Here we propose a novel computationally-tractable model for LD
across multiple loci. We apply this model to the problem of inferring
recombination rates from population data, and in particular to identifying
variation in the local recombination rate ("hotspots" and "coldspots") long
chromosomes. We outline how this model might be used to develop more
powerful methods for LD mapping.
Fengzhu Sun, University of Southern California
Title: Dynamic programming algorithms for haplotype block partition and
applications to association studies
We develop a dynamic programming algorithm for haplotype block
partitioning to minimize the number of representative single nucleotide
polymorphisms (SNPs) required to account for most of the haplotype quality
in each block. The block quality is a function of the haplotypes defined
by the SNPs in the block. Any measure of haplotype quality can be used in
the algorithm and of course the measure should depend on the specific
application. The dynamic programming algorithm is applied to analyze the
haplotype data on chromosome 21 of Patil et al. Using the same criteria
as in Patil et al. (6), we identify a total of 3,582 representative SNPs
and 2,575 blocks which are 21.5% and 37.7%, respectively, smaller than
those identified using a greedy algorithm of Patil et al. We also compare
the power of association studies using all SNPs, tag SNPs and the same
number of randomly chosen SNPs.
Elizabeth Thompson, University of Washington
Title: Genome sharing in small populations
The haplotype structure of a population derives from the recombination
events in meioses that are ancestral to current population members. In
very small populations of conservation importance, the extent of genome
survival depends also on recombination which distributes founder genome
across the population. Relative to a founder population, a junction is a
recombination point between genomes of different founder origins. In
comparing two extant genomes, the segments shared IBD will be bounded by
external junctions and may include internal junctions shared by both
genomes. Thus IBD tracts are made up of a random number of segments
bounded by junctions. A study of the process of internal and external
junction types along pairs of chromosomes sampled from a population leads
to new results on the variance of lengths of genome shared between
relatives. This research is based on work with Dr. N. Chapman.
Francisco de la Vega, Applied Biosystems
Title: Patterns of linkage disequilibrium across human chromosomes 6, 21, AND 22
With the aim of developing a linkage disequilibrium (LD) SNP map to serve
as a resource for candidate-gene, candidate-region and whole-genome
association studies, we have genotyped >250,000 SNPs on 90 DNA samples (45
African-American, 45 Caucasian, unrelated) selected from the Coriell Human
variation collection. The individual genotypes thus generated have enabled
us to survey the patterns of LD and haplotype diversity across all gene
regions of the human genome. Here I describe the empirical results of the
first comparative study of the patterns of LD across three entire human
autosomes: Chromosomes 6, 21, and 22. We selected for the study a total of
17,966 SNPs covering more than 209 Mb of chromosomal segments, and
overlapping 2,266 predicted gene regions, with a minor allele frequency
greater than 10% in either population, and that were in Hardy-Weinberg
equilibrium (p>0.01). Several methods to define ?haplotype blocks? were
applied to this dataset, including several forms of the D? method and the
4-gamete rule. Haplotypes were then computationally inferred for the
markers within each block by the EM algorithm to assess haplotype
diversity. In addition, a subset of 277 SNPs spanning 4 Mb across the HLA
region on chromosome 6 was genotyped on 550 DNA samples of unrelated
individuals of European ancestry from north Germany, 93 samples from
Norway, and 77 samples from UK. We analyze the robustness of the different
haplotype block definitions, the differences between the population
samples, and the effect of sample size on the generalization of haplotype
blocks defined in one given population sample. Finally, I present the
preliminary results of haplotype-based power calculations for case-control
studies across the gene regions of these three chromosomes.
Jinghui Zhang, National Cancer Institute, NIH
Title: A Software System for Automated and Visual Analysis of Functionally
Annotated Haplotypes
We have developed a software analysis package, HapScope, which includes a
comprehensive analysis pipeline and a sophisticated visualization tool for
analyzing functionally annotated haplotypes. The HapScope analysis
pipeline supports: a) computational haplotype construction with an EM or
Bayesian statistical algorithm; b) SNP classification by protein coding
change, homology to model organisms or putative regulatory
regions; c) minimum SNP subset selection by either a Brute Force Algorithm
or a Greedy Partition Algorithm. The HapScope viewer displays genomic
structure with haplotype information in an integrated environment,
providing eight alternative views for assessing genetic and functional
correlation. It has a user-friendly interface for: a) haplotype block
visualization; b) SNP subset selection; c) haplotype consolidation with
subset SNP markers; d) incorporation of both experimentally determined
haplotypes and computational results; e) data export for additional
analysis. Comparison of haplotypes constructed by the statistical
algorithms with those determined experimentally shows variation in
haplotype prediction accuracies in genomic regions with different levels
of nucleotide diversity. We have applied HapScope in analyzing haplotypes
for candidate genes and genomic regions with extensive SNP and genotype
data. We envision that the systematic approach of integrating functional
genomic analysis with population haplotypes, supported by HapScope, will
greatly facilitate current genetic disease research.
Maoxia Zheng, University of Chicago
Title: Assessment of goodness of fit of models for block haplotype structure
Our aim is to formalize models for high-resolution haplotype structure in
such a way that they can be useful in statistical methods for LD
mapping. Some steps in that direction have been taken by Daley et
al. '01, who outline a hidden Markov model (HMM) that allows for common
haplotypes in each block. We propose somewhat different models that also
use HMM. In this talk, we address the problem of assessing goodness of
fit of particular models, where each model involves choices such as number
and positions of blocks and common haplotypes in each block. Our models
also allow for haplotypes in a block that are not one of the common
types. We discuss choice of goodness-of-fit statistic, parametrization,
and computational issues involved in assessing the fit of background LD
models to data.
This is joint work with Mary Sara McPeek (U. of Chicago).
Next: Call for Participation
Workshop Index
DIMACS Homepage