DIMACS Workshop on Computational Issues in Genetic Epidemiology

August 21 - 22, 2008
DIMACS Center, CoRE Building, Rutgers University

Andrew Scott Allen, Duke University, andrew.s.allen at duke.edu
Ion Mandoiu, University of Connecticut, ion at engr.uconn.edu
Dan Nicolae, University of Chicago, nicolae at galton.uchicago.edu
Yi Pan, Georgia State University, pan at cs.gsu.edu
Alex Zelikovsky, Georgia State University, alexz at cs.gsu.edu
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus on Information Processing in Biology.

This special focus is jointly sponsored by the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), the Biological, Mathematical, and Physical Sciences Interfaces Institute for Quantitative Biology (BioMaPS), and the Rutgers Center for Molecular Biophysics and Biophysical Chemistry (MB Center).


Sivan Bercovici, Technion - Israel Institute of Technology

Title: Inferring Ancestry Efficiently in Admixed Populations

The ability to determine the mosaic composing the ancestral origin of admixed individuals is central for an abundance of applications ranging from the study of population history, to the assessment of population structure for adequate adjustment of association studies, and for mapping by admixture linkage disequilibrium (MALD). We present a novel framework for the inference of ancestry at each chromosomal location. The uniqueness of our method stems from the ability to efficiently incorporate complex probability models that account for linkage-disequilibrium in the ancestral populations. The validity of our model is demonstrated through simulations showing that the framework can provide a higher accuracy in inferring ancestry origin.

Tanya Berger-Wolf, University of Illinois at Chicago

Title: Reconstructing Sibling Relationships from Microsatellite Data

Kinship analysis using genetic data is important for many applications, including many in forensic analysis and population biology. Wide availability of microsatellites has boosted studies in wild populations that rely on the knowledge of kinship, particularly sibship. While there exist many methods for reconstructing sibling relationships, most rely on extensive a priori knowledge about population parameters and do not account for errors and mutations in microsatellite data, which are prevalent and affect the quality of reconstruction. We present an error-tolerant method for reconstructing sibling relationships that uses Mendelian rules of inheritance and combinatorial optimization. We test our approach on both real and simulated data, with both pre-existing and introduced errors. Our method is highly accurate on almost all simulations, giving over 90% accuracy in most cases.

The ability to determine the mosaic composing the ancestral origin of admixed individuals is central for an abundance of applications ranging from the study of population history, to the assessment of population structure for adequate adjustment of association studies, and for mapping by admixture linkage disequilibrium (MALD). We present a novel framework for the inference of ancestry at each chromosomal location. The uniqueness of our method stems from the ability to efficiently incorporate complex probability models that account for linkage-disequilibrium in the ancestral populations. The validity of our model is demonstrated through simulations showing that the framework can provide a higher accuracy in inferring ancestry origin.

Daniel Brown, University of Waterloo

Title: Mathematical structure and optimization approaches to haplotyping problems

There has been a striking development for algorithms for problems related to haplotyping over the past few years. We discuss some interesting algebraic structure such problems have, and give math programming approaches to identifying haplotypes in the presence of noisy input data, or on networks of constrained structure.

Joint work with Ian Harrower (David R. Cheriton School of Computer Science, University of Waterloo) and Dan Gusfield (Department of Computer Science, UC Davis).

Mariza de Andrade, Mayo Clinic

Title: Things to know when using Affymetrix 6.0 SNP Array for GW analysis

In an era of genome-wide association analyses, researchers are facing the challenge not only of analyzing a large volume of data but also of processing the genotype data and creating appropriate workflows. We present our experiences in working with the Affymetrix 6.0 Genome-Wide Human SNP Array including the workflows that we have created. In particular, we focus on the issues prior to genotype extraction, assessment of genotype accuracy, and automation of the workflows recognizing that the Affy 6.0 data will eventually be used for analysis using a wide range of study designs. We will share our experience working with Birdseed1 and 2 using individual plate and all plates to generate genotype call, examining replicate samples and the challenges with quality control measures in sibships. We will present our workflow and initial results using 900 samples of hypertensive sibships from Rochester, MN.

Michael Epstein, Emory University

Title: Fast and Robust Association Tests for Untyped SNPs in Case-Control Studies

Genomewide association studies of complex diseases typically genotype and analyze a set of tagSNPs that effectively capture genetic variation across the genome. Nevertheless, many such studies have substantial interest in testing SNPs that are not genotyped formally in the test sample. Such analyses of untyped SNPs can assist in signal localization and permit cross-platform comparison of results from different studies. While such untyped analyses might initially appear intractable, a study can extrapolate information on an untyped SNP in a sample using the observed tagSNP data coupled with external haplotype-based information on all SNPs (both typed and untyped) from an appropriate reference catalogue of human genetic variation (such as one of the samples from the International HapMap Project). Using this logic, we propose a novel statistical approach for testing untyped SNPs in case-control genomewide association studies. We base our approach on an efficient-score function derived from a prospective likelihood of data, which facilitates easy modeling and testing of untyped SNPs and covariates. In addition, we show both theoretically and empirically that our approach is robust to an inappropriate choice of reference sample for inference of untyped SNPs and, further, does not require adjusting for the additional variability in estimating haplotypes from genotype data. As a result, our efficient-score test is computationally much faster than existing approaches for untyped analysis and, in many situations, has a closed form that allows easy programming in existing software packages. Regarding this former strength of our method, we can analyze ~1.6 million untyped SNPs in a case-control dataset of 1000 subjects in ~90 minutes on a single Windows processor. At the same time, we show using simulated data that our approach has near-equivalent performance compared to the popular hidden-Markov methods of untyped analysis. We illustrate our approach with an application to a first-stage genomewide association study of Parkinson's Disease.

Joint work with Glen A Satten (National Center for Chronic Disease Prevention and Health, Centers for Disease Control and Prevention) and Andrew S. Allen (Department of Biostatistics and Bioinformatics, Duke University).

Eleazar Eskin, UCLA

Title: Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information

The availability of various types of genomic data provides an opportunity to incorporate this data as prior information in genetic association studies. This information includes knowledge of linkage disequilibrium structure as well as which regions are likely to be involved in disease. In this paper, we present an approach for incorporating this information by revisiting how we perform multiple-hypothesis correction. In a traditional association study, in order to correct for multiple-hypothesis testing, the significance threshold at each marker, t, is set to control the total false-positive rate. In our framework, we vary the threshold at each marker t_i and use these thresholds to incorporate prior information.

We present a numerical procedure for solving for thresholds that maximizes association study power using prior information. We also present the results of benchmark simulation experiments using the HapMap data, which demonstrate a significant increase in association study power under this framework. We provide a Web server for performing association studies using our method and provide thresholds optimized for the Affymetrix 500k and Illumina HumanHap 550 chips and demonstrate the application of our framework to the analysis of the Wellcome Trust Case Control Consortium data.

Sridhar Hannenhalli, University of Pennsylvania

Title: Computational investigation of gene regulation

Biological processes are controlled at various levels in the cell and while these mechanisms are poorly understood, transcriptional control is widely recognized as an important component and a better understanding of which will provide an efficient means for the therapeutic intervention in disease processes. I will present a brief overview of the computational approaches to the study of gene transcription as well as a number of applications.

Eran Halperin, International Computer Science Institute

Title: Estimating Local Ancestry in Admixed Populations

Large-scale genotyping of SNPs has shown a great promise in identifying markers that could be linked to diseases. One of the major obstacles involved in performing these studies is that the underlying population sub-structure could produce spurious associations. Population sub-structure can be caused by the presence of two distinct sub-populations or a single pool of admixed individuals. In this talk, I will focus on the latter which is significantly harder to detect in practice. New advances in this research direction are expected to play a key role in identifying loci which are different among different populations and are still associated with a disease. Furthermore, the detection of an individual ancestry has important medical implications. I will describe two methods that we have recently developed to detect admixture, or the locus-specific ancestry in an admixed population. We have run extensive experiments to characterize the important parameters that have to be optimized when considering this problem - I will describe the results of these experiments in context with existing tools such as SABER and STRUCTURE.

Yongtao Guan, University of Chicago

Title: Multi-SNP Association Mapping using Bayesain Regression and Shrinkage Priors

Whole genome association studies are currently genotyping thousands of individuals at hundreds of thousands of markers, in an effort to identify variants affecting phenotypes related to human health. Conventional analyses test each marker, one at a time, for association with phenotype. Here we describe a fully Bayesian approach that analyses all markers jointly, and uses a novel prior to avoid over-fitting. Despite the size of the problem, a simple MCMC scheme provides improved inference compared with standard single-SNP analyses. We will present some results from re-analyzing WTCCC data. Methods described here are implemented in a software package, BIMBAM, available from Stephens Lab website.

Joint work with Matthew Stephens (Departments of Statistics and Human Genetics, University of Chicago).

Iuliana Ionita, Harvard University

Title: On a method to estimate the number of unseen variants in the human genome

The different genetic variation discovery projects (The HapMap, The 1000 Genomes Project etc.) aim to identify as much as possible of the underlying human genetic variation. The question we address in this paper is how many new variants are yet to be found. This problem is similar to the one studied in Efron and Thisted (1976), where their goal was to estimate the number of words Shakespeare knew but did not use. We use a parametric beta-binomial model that allows us to calculate the expected number of new variants with a specified minimum frequency to be discovered in a new sample of individuals of a certain size. We apply the method to three datasets, the ENCODE dataset, the SeattleSNPs dataset and the NIEHS SNPs dataset.

Justin Kennedy, University of Connecticut

Title: Linkage Disequilibrium Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads

The new generation of sequencing technologies has made shotgun sequencing of individual genomes feasible. However, most medical application requires accurate identification of both alleles at each variable locus, and sensitivity of detecting heterozygous genotypes from shotgun sequencing data is limited by coverage depth. Indeed, coverage depths similar to those used for the Watson and Venter genomes (around 7.5 ) are able to detect only ~75% of the heterozygous SNPs, and sensitivity drops rapidly at lower coverage depths.

We demonstrate that highly accurate SNP genotypes can be inferred from low-coverage shotgun sequencing reads by using a multilocus inference model that exploits linkage disequilibrium (LD) information from a reference population panel such as Hapmap. Experiments on a publically available subset of the 454 reads from the Watson genome show that our method achieves significantly improved accuracy compared with single-SNP methods that ignore LD. For example, the accuracy achieved by the binomial test of Wheeler et al. for 5.6-fold mapped read coverage is achieved by our methods using only 1/4 of the reads.

Joint work with Sanjiv Dinakar, Ion Mandoiu, Yufeng Wu (Computer Science & Engineering Dept., University of Connecticut) and Yözen Hernández (Department of Computer Science, Hunter College).

Peter Kraft, Harvard University

Title: I'm not sure I'm agnostic: incorporating "known" biology into analysis and interpretation of genome-wide association studies

Genome-wide association scans (GWAS) have proven very effective at identifying small chromosomal regions harboring germline variants associated with complex traits: over 100 such regions have been identified in the last two years. However, empirical and theoretical considerations suggest that individual, modestly-sized GWAS (ca. 2,000 subjects) have low power to detect most trait-associated markers--so there are more regions to be discovered. The power for smaller studies of rare or less-well-studied traits is even smaller. Moreover, almost all successful GWAS have only identified markers, not the causal variants themselves. This suggests that using external meta-data (e.g. on the putative function of individual SNPs or genes) may help identify regions of interest or pinpoint the causal variant once a region has been identified. I present two examples of this approach (applied to a GWAS of smoking behavior and a GWAS of breast cancer) that upweight markers in candidate gene regions and downweight others, both by aggregating evidence for association across candidate markers and increasing p-value significance thresholds for candidate relative to non-candidate markers. I discuss the prospects and limitations for meta-data in GWAS. In particular, I argue that as larger and larger scans are conducted for many traits, the "data will swamp the prior" when it comes to identifying trait-associated regions. On the other hand, the amount of data useful for distinguishing markers in an associated region from each other is inherently limited by tight linkage disequilibrium. Diverse meta-data may be very useful when it comes to identifying the causal variant or prioritizing even smaller regions for intense scrutiny (resequencing, functional assays, etc.).

Itsik Pe'er, Columbia University

Title: Whole Population, Genomewide Mapping of Hidden Relatedness

The ability to identify and quantify genealogical relationships between individuals in a complex population is an important step in accurately using such data for disease analysis and improving our understanding of demography. However, exhaustive pair-wise analysis which has been successful in small cohorts cannot keep up with the current torrent of genotype data. We present GERMLINE, a robust algorithm for identifying pairwise segmental sharing which scales linearly with the number of input individuals. Our approach is based on a dictionary of haplotypes to efficiently discover short exact matches between individuals and then expands these matches to identify long nearly-identical segmental sharing that is indicative of relatedness. We comprehensively survey hidden relatedness both in the HapMap as well as in a densely typed island population of 3,000 individuals. We show GERMLINE agrees with other methods when they can process the data, and facilitates analysis of larger scale studies. We demonstrate the novel application of precise analysis of hidden relatedness. We show shared segment discovery can identifying haplotype phasing errors and potentially resolve them. Finally, we use detected identity of genomic segments for exposing polymorphic deletions that are otherwise challenging to detect, with 8/14 deletions in the HapMap samples and 149/200 deletions in the island data having independent experimental validation.

Chiara Sabatti, UCLA

Title: The genetics of quantitative traits: what has changed since R.A. Fisher?

One of the scientific problems that interested Galton, Pearson, and Fisher, was the heritability of quantitative traits. How do you conciliate Mendel's laws with heritable continuous traits? The concepts of correlation, variance, multivariate regression, for example, were introduced also in connection with this problem. Thanks to biotechnology advancements, we can now genotype thousands of individuals at tens of thousands of loci. Using this empirical data, researchers are trying to identify loci that influence height, cholesterol levels, etc. I will present one example of these studies, carried out on a Finnish population. After a brief summary of the initial analysis, I will discuss the challenges presented by these data sets.

Glen A. Satten, National Center for Chronic Disease Prevention and Health Centers for Disease Control and Prevention

Title: New Haplotype Sharing Method for Genome-Wide Case-Control Association Studies Implicates Gene for Parkinson's Disease

The large number of markers considered in a genome-wide association study (GWAS) has resulted in a simplification of analyses conducted. Most studies are analyzed one marker at a time using simple tests like the trend test. Methods that account for the special features of genetic association studies, yet remain computationally feasible for genome-wide analysis, are desirable as they may lead to increased power to detect associations.

Haplotype sharing attempts to translate between population genetics and genetic epidemiology. Near a recent disease-causing mutation, case haplotypes should be more similar to each other than control haplotypes. We give computationally simple association tests based on haplotype sharing that can be easily applied to GWASs while allowing use of fast (but not likelihood-based) haplotyping algorithms and properly accounting for the uncertainty introduced by using inferred haplotypes. We also give haplotype sharing analyses that adjust for population stratification.

Applying our methods to a GWAS of Parkinson's disease, we find a genome-wide significant signal in a biologically-plausible gene that is not found by single-snp methods. Further, a missing-data artifact that causes a spurious single-SNP association on chromosome 9 does not impact our test.

Joint work with Andrew S. Allen (Department of Biostatistics and Bioinformatics, Duke University).

John Storey, Princeton University

Title: Calibrating the Performance of SNP Arrays for Whole-Genome Association Studies

Advances in SNP genotyping array technologies have made whole-genome association studies (WGAS) a readily available approach. Genetic coverage and the statistical power are two key properties to evaluate on the arrays. In this study, 359 newly sampled individuals were genotyped using Affymetrix 500K and Illumina 650Y SNP arrays. From these data, we obtained new estimates of genetic coverage by constructing a test set from among these genotypes and individuals that is independent from the SNPs and individuals used to construct the arrays. These estimates are notably smaller than previous ones, which we argue is due to an overfitting bias in previous studies. We also collected liver tissue RNA from the participants and profiled these samples on a comprehensive gene expression microarray. The RNA levels were used as a large-scale set of quantitative traits to calibrate the relative statistical power of the commercial arrays. Through this dataset and simulations, we find that the SNP arrays provide adequate power to detect quantitative trait loci when the causal SNP's minor allele frequency is greater than 20%, but low power is less than 10%. Importantly, we provide evidence that sample size has a greater impact on the power of WGAS than SNP density or genetic coverage. This is joint work with Ke Hao and Eric Schadt of Rosetta Inpharmatics.

Daniel Stram, University of Southern California

Title: Use of Empirical Kinship Matrices in Whole Genome Case-Control Studies of Disease in Stratified Populations

See StramAbstract.pdf

Jung-Ying Tzeng, North Carolina State University

Title: A Constrained Regression Approach for Studying Haplotype-Specific Effects

Understanding the effects of specific haplotypes can help to identify etiological variants and infer biological explanations. Ideally, a thorough haplotype-specific analysis should call for pairwise comparisons among all haplotypes, similar to the post-hoc analysis of ANOVA. However, such comparisons may suffer from power loss due to multiple testing adjustment, and often yields contradictory conclusions on which haplotypes share the same level of effects. To overcome this concern, we propose a constrained regression approach that performs the ANOVA-type of post-hoc analysis for haplotype-specific analysis. The method uses constraints that encourage haplotypes with similar effect sizes to be estimated with exact equality, and transfers the posthoc analysis from a multiple-comparison procedure to a variable-selection framework. Through simulation we evaluate the performance of the proposed method and illustrate how the output can be used to characterize the haplotype-specific associations.

Yufeng Wu, University of Connecticut

Title: Inference of Complex Genealogical Histories In Populations and Its Application in Mapping Complex Traits

Genealogical history of SNP sequences in a population may be potentially very useful for many biological problems, including association mapping of complex traits. However, inference of genealogy is not easy especially when recombination occurs. The genealogical history with recombination would be in the form of a network (not a single tree). This network is called an "ancestral recombination graph (ARG)". Reconstructing ARGs for the given SNP sequences has been actively studied recently. In this talk, I will introduce the approaches of reconstructing ARGs and explain how the inferred ARGs can be useful in mapping complex traits. I will present practical results in mapping simulated and biological data to show that our methods can be competitive in mapping accuracy and can be applied to large biological datasets. I will also present new results on inference of local tree topologies with recombination.

Eric Xing, Carnegie Mellon University

Title: Genome-Phenome Association: Computational Challenges and new Algorithms

Many complex disease syndromes consist of a large number of highly related, rather than independent, clinical phenotypes. Differences between these syndromes involve the complex interplay of a large number of genomic variations that perturb the function of disease-related genes in the context of a regulatory network, rather than individually. Thus unraveling the causal genetic variations and understanding the mechanisms of consequent cell and tissue transformation requires an analysis that jointly considers the epistatic, pleiotropic, and plastic interactions of elements and modules within and between the genome and the phenome. In this talk, we discuss the computational challenges we face in such analysis, and present a block regression algorithm that can capture structures within the genome, and a graph-guided lasso algorithm that can capture modules and regularities in the network of molecular and clinical phenotypes, during genome-phenome association mapping.

Previous: Program
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on August 5, 2008.