DIMACS/RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotype Inference

November 21 - 22, 2002
DIMACS Center, Rutgers University, Piscataway, NJ

Organizers:: Andrew G. Clark, Cornell University and Celera, Andy.Clark@celera.com; Sorin Istrail, Celera, Sorin.Istrail@celera.com; Michael Waterman, University of Southern California and Celera, msw@hto.usc.edu

Presented under the auspices of the Special Focus on Computational Molecular Biology.

Abstracts:

Vineet Bafna, The Center for the Advancement of Genomics

Title: Haplotyping as perfect phylogeny

We consider the following problem: Given n genotypes, does there exist a set H of haplotypes such that each genotype is generated by a pair from this set, and this det can be derived on a perfect phylogeny. Recently, Gusfield , 2002, presented a polynomial time algorithm to solve this problem that uses established results from matroid and graph theory. In this work, we present an O(nm^2) algorithm for this problem using elementary techniques. We also describe a linear space representation for representing all possible solutions, and provide a formula for counting the number of possible solutions.

(Joint work with Dan Gusfield, Giuseppe Lancia, and Shibu Yooseph)

Andrew Clark, Cornell University and Celera/Applied Biosystems

Title: "Exhaustive enumeration and Bayesian phase inference"

A strategy for haplotype inference will be described using graph-theoretic approaches to enumeration of admissible haplotype phases, coupled with Bayesian statistical inference for assessing likelihoods of alternative phasings. An application to the special problem of inference of phase from trisomic data (Down syndrome) will be presented.

Nancy Cox, University of Chicago

Title: How does choice of polymorphism influence estimation of LD and mapping?

Fine-mapping and positional cloning studies may focus on only a subset of polymorphisms identified within a region. In part this may reflect the trade-off between information and economics. For example, for many of the genetic analytic approaches commonly used in fine-mapping and positional cloning, there is little value in typing polymorphisms that are in near-perfect linkage disequilibrium (LD) with each other. Consequently, investigators may focus initially on only those polymorphisms that have unique patterns in some small subset of individuals used in screening studies. While this strategy is economical and scientifically defensible for many common analytic approaches, it precludes unbiased estimation of the extent of LD in a region. Moreover, consequences for more sophisticated multipoint LD mapping approaches considering extended haplotypes are difficult to predict. We report here on studies of the NIDDM1 region of 2qter and a region on chromosome 15 including CYP19 in which we have made a systematic examination of the effects of this type of polymorphism choice. While our conclusions are clearly limited because we have studied only these two regions, our results suggest that multipoint linkage disequilibrium mapping methods can be sensitive to choice of polymorphism. In contrast, studies on the extent of LD in a region may be less sensitive to this type of polymorphism choice, as long as all common polymorphisms with unique patterns are included in analyses.

David Cutler, Johns Hopkins University School of Medicine

Title: Haplotype Inference in Random Population Samples

The application of statistical methods to infer and reconstruct linkage phase in samples of diploid sequences is a potentially time- and labor-saving method. The Stephens-Smith-Donnelly (SSD) algorithm is one such method, which incorporates concepts from population genetics theory in a Markov chain-Monte Carlo technique.We applied a modified SSD method, as well as the expectation-maximization and partition-ligation algorithms, to sequence data from eight loci spanning >1.5 Mb on the human X chromosome. We demonstrate that the accuracy of the modified SSD method is better than that of the other algorithms and is superior in terms of the number of sites that may be processed. Also, we find phase reconstructions by the modified SSD method to be highly accurate over regions with high linkage disequilibrium (LD). If only polymorphisms with aminor allele frequency 10.2 are analyzed and scored according to the fraction of neighbor relations correctly called, reconstructions are 95.2% accurate over entire 100-kb stretches and are 98.6% accurate within blocks of high LD.

Peter Donnelly, University of Oxford

Title: Bayesian methods for statistical reconstruction of haplotypes

The talk will describe the general issues in Bayesian approaches to haplotype reconstruction, including the distinction between choice of prior distribution, and the computational methods used to approximate posterior distributions. We describe recent improvements to the software PHASE, including handling missing data and improved computational methods, and compare its behaviour to some other published methods.

Dan Gusfield, University of California, Davis

Title: Combinatorial Approaches to Haplotype Inference

We have developed several distinct combinatorial approaches to the haplotype inference problem. I will talk about a few of the most recent of these approaches. One approach, the "pure parsimony" approach is to find N pairs of haplotypes, one for each genotype, that explain the N genotypes and MINIMIZE the number of distinct haplotypes used. Solving this problem is NP-hard, however, for reasonable size data (larger than in general use today), the "pure-parsimony" solution can be efficiently found in practice. I will also talk about an approach that mixes pure-parsimony with Clark's subtraction method for haplotyping. Simulations show that the efficiently of both methods depends positively on the level of recombination - the more recombination, the more efficiency, but the accuracy depends inversely on the level of recombination. I will also discuss a practical ways to greatly boost the accuracy of Clark's subtraction method, and identify haplotype pairs with high confidence. This approach has been tested on molecularly determined data, which will be published along with the method. Comparisons are made with PHASE and HAPLOTYPER. I will also mention some recent developments in haplotype inference that are based on viewing the problem in the context of the perfect phylogeny problem. This builds on a near-linear-time algorithm to determine whether genotype (unphased) SNP data is consistent with the no-recombination, infinite sites coalescent model of haplotype evolution. Stated differently, whether there are haplotype pairs for the genotypes, which satisfy the 4-gamete condition for tree-form evolution. The algorithm finds in linear time an implicit representation of the set of all solutions to the problem. A detailed treatment of a simple alternative algorithm for that problem will be given in the talk by V. Bafna.

Parts of this work are joint with different collaborators, including R.H. Chung, V. Bafna, G. Lancia, S. Orzack, V. Stanton and S. Yooseph

Eran Halperin, ICSI and UC Berkeley

Title: Large Scale Recovery of Haplotypes from Genotype Data using Imperfect Phylogeny

Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated ``blocks''. The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate ($0.47\%$) when taking into account the predictions for the uncommon haplotypes.

Jun Liu, Harvard University

Title: Haplotype Inference and Haplotype Information

Haplotypes have become increasingly popular because of the abundance of single nucleotide polymorphisms (SNPs) and the limited power of the single-locus analyses. To contend with some weaknesses of the existing haplotype inference methods, we propose new algorithms based on the partition-ligation idea. In particular, we first partition the whole haplotype into smaller segments. Then, we use either the Gibbs sampler or the EM algorithm to construct the partial haplotypes of each segment and to assemble all the segments together. Our algorithm can infer haplotype frequencies rapidly and accurately for a large number of linked SNPs and provides a robust estimate of their standard deviations. The algorithms are robust to the violation of Hardy-Weinberg equilibrium and can handle missing marker data easily. As a follow-up study, we also investigated two related questions: how much the haplotype information contributes to linkage disequilibrium (LD) mapping and whether an in silico haplotype construction preceding the LD analysis can help. For simple disease gene mapping our conclusions are as follows: (a) if a proper statistical model is employed, the loss of haplotype information for either control or disease data do not have a great impact on LD fine mapping, and (b) haplotype inference should be carried out jointly with LD analysis to achieve the most accurate location estimation.

Dahlia Nielsen, North Carolina State University

Title: Multi-locus linkage disequilibrium and haplotype-based tests of association

The hope behind association mapping is to use linkage disequilibrium (LD) as an indicator of proximity of a marker to a susceptibility locus. This follows from the expectation that marker-phenotype association is proportional to linkage disequilibrium, which is inversely related to recombination. If there are more than two alleles at a locus affecting risk, the association statistic is instead a weighted sum of linkage disequilibria and genotypic susceptibilities. There is no longer a simple relationship, even in expectation, with recombination. These results extend to marker haplotypes. In addition to the pairwise association terms of the single marker tests, marker haplotype associations depend on the weighted sum of multi-locus disequilibria and genotypic susceptibilities. Several tests of haplotype association are presented here, along with a comparison of these tests within different LD contexts.

Magnus Nordborg, University of Southern California

Title: The Pattern of Polymorphism on Human Chromosome 21

Polymorphism data from 20 partially resequenced copies of human chromosome 21---more than 20,000 polymorphic sites---are analyzed. The allele-frequency distribution shows no deviation from the simplest population genetic model with a constant population size (although we show that our analysis has no power to detect population growth). The average rate of recombination per site is estimated to be roughly one half of the rate of mutation per site, again in agreement with simple model predictions. However, sliding-window analyses of the amount of polymorphism and the extent of linkage disequilibrium (LD) shows significant deviations from standard models. This could be due to the history of selection or demographic change, but it is impossible to draw strong conclusions without much better knowledge of variation in the relationship between genetic and physical distance along the chromosome.

Jonathan Pritchard, University of Chicago

Title: Use of a local approximation to the ancestral recombination graph for fine mapping disease genes

We describe a novel coalescent-based method for estimating the location of a disease susceptibility locus. This is designed for the situation where we have genotype data from a sample of cases and controls, in a region that is believed to contain a disease mutation. For a given position on the marker-map, we use the marker information of both cases and controls to reconstruct local approximations of the ancestral recombination graph using Markov Chain Monte Carlo. From this, we can compute the likelihood of the phenotype data assuming a susceptibility gene at this position; the procedure is repeated at a series of locations across the region to estimate the posterior density.

Co-authors: Sebastian Zöllner and Jonathan Pritchard

Molly Przeworski, Max Planck Institute for Evolutionary Anthropology

Title: Insights into recombination from patterns of linkage disequilibrium

Recent studies of linkage disequilibrium (LD) have suggested that (1) recombination rates vary tremendously across the genome, such that large scale estimates of the recombination rate based on a comparison of physical and genetic maps may not be informative about local patterns of LD (2) models of recombination that include gene conversion as well as crossing-over better predict levels of LD at short scales (3) levels of LD are lower in samples from sub-Saharan African populations than in other population samples. These observations have important implications for linkage disequilibrium based association-studies; they suggest that large areas of the genome can be tagged with few markers, and that genome-wide studies and fine scale mapping efforts might best be conducted in different populations. To examine the generality of these observations, we analyzed over 80 data sets sequenced in 24 African-Americans and 23 individuals of European descent (data from http://pga.mbt.washington.edu/). As an index of LD, we estimated the population rate of crossing-over q in the two ?population? samples. We compared estimates of p (with and without gene conversion) to those obtained from a comparison of genetic and physical maps. To gain a sense of recombination rate variation at a small scale, we also considered how much estimates of p vary along sequences in actual data compared to simulated data.

Bruce Rannala, University of Alberta

Title: Joint Bayesian estimation of mutation location and age using linkage disequilibrium

A non-random association of disease and marker alleles on chromosomes in populations can arise as a consequence of historical forces such as mutation, selection and genetic drift, and is referred to as ``linkage disequilibrium'' (LD). LD can be used to estimate the map position of a disease mutation relative to a set of linked markers, as well as to estimate other parameters of interest, such as mutation age. Parametric methods for estimating the location of a disease mutation using marker linkage disequilibrium in a sample of normal and affected individuals require a detailed knowledge of population demography, and in particular require users to specify the postulated age of a mutation and past population growth rates. A new Bayesian method is presented for jointly estimating the position of a disease mutation and its age. The method is illustrated using haplotype data for the cystic fibrosis Delta F508 mutation in europe and the DTD mutation in Finland. It is shown that, for these datasets, the posterior probability distribution of disease mutation location is insensitive to the population growth rate when the model is averaged over possible mutation ages (using a prior for age based on the population frequency of the disease mutation). Fewer assumptions are therefore needed for parametric LD mapping.

Kathryn Roeder, Carnegie Mellon University

Title: Evolutionary-based Association Analysis Using Haplotype Data

Association studies, both family-based and population-based, can be powerful means of detecting disease-liability alleles. To increase the information of the test, various researchers have proposed targeting haplotypes. The larger number of haplotypes, however, relative to alleles at individual loci, could decrease power because of the additional degrees of freedom required for the test. An optimal strategy would focus the test on particular haplotypes or groups of haplotypes, much as is done with cladistic-based association analysis. First suggested by Templeton and colleagues, such analyses use the evolutionary relationships among haplotypes to produce a limited set of hypothesis tests and to increase the interpretability of these tests. To more fully utilize the information contained in the evolutionary relationships among haplotypes and in the sample, we propose generalized linear models (GLM) for the analysis of data from family-based and population-based studies. These models fully account for haplotype phase ambiguity and allow for covariates. The models are encoded into a software package, EHAP (for Evolutionary-based Haplotype Analysis Package), which also provides for various kinds of exploratory data analysis. The exploratory analyses, such as error checking, estimation of haplotype frequencies, and tools for building cladograms, should facilitate the implementation of cladistic-based association analysis with haplotypes.

Russell Schwartz, Carnegie Mellon University

Title: Inferring Piecewise Ancestral History from Haploid Sequences

The determination of complete human genome sequences, subsequent work on mapping human genetic variations, and advances in laboratory technology for screening for these variations in large populations are together generating tremendous interest in genetic association studies as a means for characterizing the genetic basis of common human diseases. Considerable recent work has focused on using haplotypes to reduce redundancy in the datasets, improving our ability to detect significant correlations between genotype and phenotype while simultaneously reducing the cost of performing assays. A key step in applying haplotypes to human association studies is determining regions of the human genome that have been inherited intact by large portions of the human population from ancient ancestors. This talk describes computational methods for the problem of predicting segments of shared ancestry within a genetic region among a set of individuals. Our approach is based on what we call the haplotype coloring problem: coloring segments of a set of sequences such that like colors indicate likely descent from a common ancestor. I will present two methods for this problem. The first uses the notion of ?haplotype blocks" to develop a two-stage coloring algorithm. The second is based on a block-free probabilistic model of sequence generation that can be optimized to yield a likely coloring. I will describe both methods and illustrate their performance using real and contrived data sets.

Montgomery Slatkin, University of California, Berkeley

Title: Testing for differences in haplotype frequencies in case-control studies

The problem of testing for significant differences in haplotype frequencies between a random sample of individuals (randoms) and a sample of individuals with a genetic disease (cases) is considered. The questions are (1) What is the statistical power in testing for differences in haplotype frequencies? and (2) How much statistical power is lost when haplotype phase cannot be resolved but instead must be inferred using a maximum likelihood method? A likelihood ratio test of differences in haplotype frequencies in randoms and cases is used, and the theory is developed in terms of the non-centrality parameter of the non-central chi-square distribution. If a causative allele has a multiplicative effect on penetrance, and if sample sizes are large and the effect of the causative allele is small, analytic expressions for the non-centrality parameter are obtained when haplotypes can and cannot be resolved. The loss in power is independent of the frequency of the causative allele and can in general be compensated for by increasing the sample sizes by a factor of two or less. For a dominant causative allele, numerical results are obtained that show the important features of the results for multiplicative penetrance still are valid. We conclude that maximum likelihood inference of haplotype frequencies and a likelihood ratio test of differences in inferred frequencies can be useful in a case-control setting.

Matthew Stephens, University of Washington

Title: Haplotypes, hotspots, and a multilocus model for Linkage Disequilibrium

Abstract: Current methods for understanding the relationship between LD and the underlying recombination rate are limited. The most common approach is to compute a measure of LD between every pair of sites in the region, and to form a graphical display of the results. However, it is typically difficult to assess the significance of observed patterns. More sophisticated coalescent-based statistical methods for estimating local recombination rate from patterns of LD are either computationally impractical for moderate-sized regions, or suffer from loss of information by using only a summary of the data. Furthermore, they all assume constant recombination rate, making them poor tools for studying local recombination rates. Here we propose a novel computationally-tractable model for LD across multiple loci. We apply this model to the problem of inferring recombination rates from population data, and in particular to identifying variation in the local recombination rate ("hotspots" and "coldspots") long chromosomes. We outline how this model might be used to develop more powerful methods for LD mapping.

Fengzhu Sun, University of Southern California

Title: Dynamic programming algorithms for haplotype block partition and applications to association studies

We develop a dynamic programming algorithm for haplotype block partitioning to minimize the number of representative single nucleotide polymorphisms (SNPs) required to account for most of the haplotype quality in each block. The block quality is a function of the haplotypes defined by the SNPs in the block. Any measure of haplotype quality can be used in the algorithm and of course the measure should depend on the specific application. The dynamic programming algorithm is applied to analyze the haplotype data on chromosome 21 of Patil et al. Using the same criteria as in Patil et al. (6), we identify a total of 3,582 representative SNPs and 2,575 blocks which are 21.5% and 37.7%, respectively, smaller than those identified using a greedy algorithm of Patil et al. We also compare the power of association studies using all SNPs, tag SNPs and the same number of randomly chosen SNPs.

Elizabeth Thompson, University of Washington

Title: Genome sharing in small populations

The haplotype structure of a population derives from the recombination events in meioses that are ancestral to current population members. In very small populations of conservation importance, the extent of genome survival depends also on recombination which distributes founder genome across the population. Relative to a founder population, a junction is a recombination point between genomes of different founder origins. In comparing two extant genomes, the segments shared IBD will be bounded by external junctions and may include internal junctions shared by both genomes. Thus IBD tracts are made up of a random number of segments bounded by junctions. A study of the process of internal and external junction types along pairs of chromosomes sampled from a population leads to new results on the variance of lengths of genome shared between relatives. This research is based on work with Dr. N. Chapman.

Francisco de la Vega, Applied Biosystems

Title: Patterns of linkage disequilibrium across human chromosomes 6, 21, AND 22

With the aim of developing a linkage disequilibrium (LD) SNP map to serve as a resource for candidate-gene, candidate-region and whole-genome association studies, we have genotyped >250,000 SNPs on 90 DNA samples (45 African-American, 45 Caucasian, unrelated) selected from the Coriell Human variation collection. The individual genotypes thus generated have enabled us to survey the patterns of LD and haplotype diversity across all gene regions of the human genome. Here I describe the empirical results of the first comparative study of the patterns of LD across three entire human autosomes: Chromosomes 6, 21, and 22. We selected for the study a total of 17,966 SNPs covering more than 209 Mb of chromosomal segments, and overlapping 2,266 predicted gene regions, with a minor allele frequency greater than 10% in either population, and that were in Hardy-Weinberg equilibrium (p>0.01). Several methods to define ?haplotype blocks? were applied to this dataset, including several forms of the D? method and the 4-gamete rule. Haplotypes were then computationally inferred for the markers within each block by the EM algorithm to assess haplotype diversity. In addition, a subset of 277 SNPs spanning 4 Mb across the HLA region on chromosome 6 was genotyped on 550 DNA samples of unrelated individuals of European ancestry from north Germany, 93 samples from Norway, and 77 samples from UK. We analyze the robustness of the different haplotype block definitions, the differences between the population samples, and the effect of sample size on the generalization of haplotype blocks defined in one given population sample. Finally, I present the preliminary results of haplotype-based power calculations for case-control studies across the gene regions of these three chromosomes.

Jinghui Zhang, National Cancer Institute, NIH

Title: A Software System for Automated and Visual Analysis of Functionally Annotated Haplotypes

We have developed a software analysis package, HapScope, which includes a comprehensive analysis pipeline and a sophisticated visualization tool for analyzing functionally annotated haplotypes. The HapScope analysis pipeline supports: a) computational haplotype construction with an EM or Bayesian statistical algorithm; b) SNP classification by protein coding change, homology to model organisms or putative regulatory regions; c) minimum SNP subset selection by either a Brute Force Algorithm or a Greedy Partition Algorithm. The HapScope viewer displays genomic structure with haplotype information in an integrated environment, providing eight alternative views for assessing genetic and functional correlation. It has a user-friendly interface for: a) haplotype block visualization; b) SNP subset selection; c) haplotype consolidation with subset SNP markers; d) incorporation of both experimentally determined haplotypes and computational results; e) data export for additional analysis. Comparison of haplotypes constructed by the statistical algorithms with those determined experimentally shows variation in haplotype prediction accuracies in genomic regions with different levels of nucleotide diversity. We have applied HapScope in analyzing haplotypes for candidate genes and genomic regions with extensive SNP and genotype data. We envision that the systematic approach of integrating functional genomic analysis with population haplotypes, supported by HapScope, will greatly facilitate current genetic disease research.

Maoxia Zheng, University of Chicago

Title: Assessment of goodness of fit of models for block haplotype structure

Our aim is to formalize models for high-resolution haplotype structure in such a way that they can be useful in statistical methods for LD mapping. Some steps in that direction have been taken by Daley et al. '01, who outline a hidden Markov model (HMM) that allows for common haplotypes in each block. We propose somewhat different models that also use HMM. In this talk, we address the problem of assessing goodness of fit of particular models, where each model involves choices such as number and positions of blocks and common haplotypes in each block. Our models also allow for haplotypes in a block that are not one of the common types. We discuss choice of goodness-of-fit statistic, parametrization, and computational issues involved in assessing the fit of background LD models to data.

This is joint work with Mary Sara McPeek (U. of Chicago).

Next: Call for Participation

Workshop Index

DIMACS Homepage

Contacting the Center
Document last modified on November 18, 2002.