The DNA Barcode Data Analysis Initiative (DBDAI) Overview
The DNA Barcode Data Analysis Initiative (DBDAI):
Analyzing and Interpreting a New Generation of Biological Data
Prospectus for a 24-month international initiative
Background. In the past two years, a series of studies have been published in which "DNA barcoding" was proposed as a tool for differentiating species. Barcoding is based on the assumption that short gene regions evolve at a rate that produces clear interspecific sequence divergence while retaining low intraspecific sequence variability. The cytochrome c oxidase subunit 1 mitochondrial region ("COI") has emerged as a suitable barcode region for most animals. Taxonomists are in the process of identifying appropriate gene regions for barcoding other major groups of eukaryotes. Taxonomic studies of a growing number of taxa have shown that the discontinuity in the levels of barcode sequence divergence (both phenetic and diagnostic) match the species boundaries as delineated by morphological and ecological characters. These studies set the stage for a more in-depth analysis of the relationship between DNA barcode patterns and our understanding of speciation processes and mitochondrial evolution.
The Consortium for the Barcode of Life (CBOL; see www.barcoding.si.edu) is an international consortium of about 70 Member Organizations from six continents and more than 35 nations. These include natural history museums, herbaria, biodiversity and conservation organizations, university departments and other research organizations, government agencies and private sector companies. CBOL is devoted to exploring and developing the potential of DNA barcoding to become a tool for taxonomic research and for applications of species-level data to applied problems such as conservation, crop protection and sustainable development. Four Working Groups have been formed by CBOL, including the Data Analysis Working Group (DAWG) chaired by Dr. Michel Veuille, Director of the Department of Systematics and Evolution in the National Museum of Natural History, Paris.
Plans. CBOL and DAWG propose the DNA Barcode Data Analysis Initiative (DBDAI) , a 24-month international interdisciplinary program of work that will bring together taxonomists, population geneticists, statisticians, applied mathematicians and computer scientists. The overarching goals of this program of work will be to better understand the relationship between DNA barcode data and population-level genetic processes, and to develop the analytical tools needed to interpret, analyze and archive DNA barcode data . A further goal will be to explore both the potential and limitations of barcoding in the study of natural populations, especially populations of pests and endangered species.
The initiative's goals are:
- To create a small research community of population geneticists, statisticians, applied mathematicians, computer scientists and taxonomists that concentrates on DNA barcode data for two years;
- To promote collaboration and exchange of data and ideas within this research community;
- To explore the nature of DNA barcode data gathered to date and how variation patterns in barcode data comport with models from population genetics;
- To identify the types of analytical, interpretive and display tools needed for the optimal treatment of DNA barcode data;
- To engage interest in the analysis of barcode data, especially among doctoral students in statistics, applied mathematics and computer science;
- To develop and disseminate the most effective analytical procedures and display tools for barcode data;
Some specific questions that will be addressed during the initiative are:
- How should the choice of analytical techniques take our knowledge of population biology into account?
- What sample sizes are adequate, and how will population structure affect the sample sizes needed for reliable results?
- How are the patterns of similarity seen in barcode data affected by the choice of clustering algorithms?
- What other statistical techniques should be considered for assigning unidentified specimens to known species and for identifying new species? Do these tasks require several analytical techniques?
- Besides segregating specimens into species, how reliably can barcode data assign unknown specimens to known species? What pitfalls limit this practice?
- What sample sizes and geographic sampling schemes are required to distinguish species reliably using barcode data?
- How do we recognize when hybridization, introgression, or other factors affect the reliability of species identifications based on barcode data?
- What are the character based patterns of nucleotide variation within the sequenced region, and can these patterns be used instead of percent of sequence similarity?
- Going beyond phenograms, how can the results of barcode data analysis be visualized and displayed to convey information about similarity, character diagnosis, geographic distribution, and other variables?
Data interpretation is likely to require different kinds of tools at different steps of analytic protocols. For instance, phylogenomics, multivariate statistics, coalescent theory, learning machines, assignment statistics may all combine to achieve a satisfying result. A challenge of the DAWG is to put together theoreticians from several fields who do not usually work together, and who do not usually work with taxonomists .
Program of Work. Two planning meetings were held (DIMACS, Rutgers University, 26 September 2005; National Museum of Natural History, Paris, 15 October 2005) with the support of the Consortium for the Barcode of Life. Participants agreed that the goal of the DAWG will be the development of a package of protocols and software needed to analyze, interpret and visualize barcode data. A Steering Committee of [five?] individuals was formed at that time. The Steering Committee has been charged to aid and involve researchers from two main fields: population geneticists and statisticians/mathematicians. The participants in the planning meetings, which included all members of the Steering Committee, agreed that:
- The DAWG will communicate through a web-based portal, which will facilitate the exchange of information;
- The Steering Committee will communicate regularly through the web-based portal and conference calls, and will meet as needed during the initiative;
- Work Teams will be able to form freely, following a principle of "competition" that will help developing innovative ideas for data analysis; however, the Steering Committee will keep in contact with these groups to develop working relationships between them and let them focus on the goals of the barcode initiative, by organizing work meetings;
- The outcomes of the Work Teams will be presented in one or several specific sessions at the Second International Barcode Conference organized by CBOL; and
- Work Teams will be encouraged to engage members of their respective research fields by presenting their results at relevant scientific meetings.
The two planning meetings produced the following six-phase Program of Work:
- Period up to July 6, 2006: Formulation of Work Teams. The two planning meetings defined a set of research questions and analytical challenges associated with the analysis, interpretation and visualization of barcode data. Participants in the planning meetings and others will self-organize into Work Teams, each of which will focus on specific questions and challenges.
- Period up to July 6, 2006: Preparation of Preliminary Results. During this phase, Work Teams will develop their approaches to their questions/challenges and will conduct pilot projects with the goal of producing preliminary results.
- July 6 - 8, 2006 Workshop: Presentation of Preliminary Results. Work Teams will submit abstracts of their preliminary results to the Steering Committee, from which participants in a May 2006 workshop will be selected. Presenters will receive feedback from the Steering Committee and other workshop participants.
- July 8, 2006 - February 2007: Preparation of Final Results. Work Teams will continue their efforts in preparation for a DAWG-sponsored session at the Second International Barcode Conference. Software packages will be tested using databases provided by the Steering Committee. The Steering Committee will apply to the Conference Program Committee for a session during summer 2006, and Work Teams will submit abstracts in September 2006.
- February 2007: Second International Barcode Conference. DAWG will sponsor a session in which recommendations will be presented for data analysis protocols for barcode data. Work Teams will present their final results.
- March - October 2007: Dissemination of Results. Presenters in the DAWG session at the Second International Barcode Conference will prepare manuscripts for the conference proceedings volume. Their software will be posted on the CBOL website and a User's Guide will be prepared. Work Teams will also present their results at scientific meetings in their respective fields of research.
Next: Call for Participation
Workshop Index
DIMACS Homepage
Contacting the Center
Document last modified on March 21, 2006.