Call for Participation: Research Challenges Related to Data Analysis and the DNA Barcode of Life Initiative


We are pleased to announce research challenges from the Data Analysis Working Group of the Consortium for the Barcode of Life. For more information about the activities of this working group, see http://dimacs.rutgers.edu/Workshops/DNAInitiative/

The DNA Barcode Data Analysis Initiative: Developing Tools for a New Generation of Biodiversity Data

The Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life (CBOL) will hold the Second International Barcode of Life Conference at Academia Sinica in Taipei, Taiwan, during the week of 17-21 September 2007. See the Second Conference Announcement. The Conference website is at www.dnabarcodes2007.org.

Researchers and teams of researchers are invited to read the Research Challenges presented below, and to submit an abstract for presentation at the Taipei workshop using the Abstract Submission Form. (Please note registration for the workshop is required to hold your place.)

What are DNA Barcode Data? In the past two years, a series of studies have been published in which "DNA barcoding" was proposed as a tool for differentiating biological species. Barcoding is based on the assumption that short gene regions evolve at a rate that produces clear interspecific sequence divergence while retaining low intraspecific sequence variability. The cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long) has emerged as a suitable barcode region for most taxonomic groups of animals.

Why are these data interesting? In the two years since barcoding was proposed, more than 60,000 barcode records representing 10,000 species have been collected. The next two years will see an explosion of these standardized data in public repositories, creating an unparalleled opportunity to explore biological variability and its distribution within and among species.

Who is sponsoring this activity? The Consortium for the Barcode of Life (CBOL) is an international initiative supported by the Alfred P. Sloan Foundation and hosted by the Smithsonian Institution in Washington. CBOL has created a Data Analysis Working Group (DAWG) that has organized and will oversee this initiative. Funding for the activity is being provided by CBOL and the European Science Foundation, and additional funds are being requested from the U.S. National Science Foundation.

What are the goals of this initiative? DNA barcodes are emerging as a global standard for assigning biological specimens to their proper species, and new and more reliable methods will be needed to analyze, interpret, and visualize these data. The volume of barcode data is expanding rapidly. More than 60,000 barcode records representing more than 10,000 species have already been determined. In the next two years, these figures will probably increase to half a million barcode records for 50,000 species. Their use for basic research in biology and applied use in a variety of settings is increasing, such as border control of agricultural pests and enforcement of laws that protect endangered species. CBOL is constructing a Data Portal that will provide users with access to these data and the new analytical tools needed to handle the data.

What are some of the scientific and technical challenges associated with barcode data?

How can I find out more about the barcode initiative? Two brochures provide a general overview of DNA barcoding: "Barcoding Life: Ten Reasons"(available at http://phe.rockefeller.edu/barcode/docs/TenReasonsBarcoding.pdf) and "Barcoding Life, Illustrated" (http://phe.rockefeller.edu/PDF_FILES/BLIllustrated26jan04print%20v1-3.pdf).

DAWG I Challenge Problems:

I. The Dataset:

The data tables given at http://dimacs.rutgers.edu/Workshops/BarcodeDataSets/ are based on real barcode data from species with some small modification and disguises to make it hard to identify the species.

The dataset is organized in two data tables:

  1. training.csv: contains 1623 barcodes from 1623 specimens with their corresponding species names. The species names are computer generated by us in order to disguise the true names. They uniquely identify all the samples of each of the species.
  2. testblind.csv: contains 346 barcodes of specimens without species information.

Example Datasets:

This is a small selection of Barcode datasets extracted from BOLD. Although we encourage data analysis researchers to tray as many datasets as they can, we think it is useful to have a small list as of test datasets to make comparisons between methods. This list is of course expandable and changeable; please make any comments of suggestions.

  1. Bats of Guyana: (BatGuyana.csv)
    DNA barcoding of Neotropical bats: species identification and discovery within Guyana, E. L. Clare, B. K. Lim, M. D. Engstrom, J. L. Eger, P. D. N. Hebert, doi:10:1111/j.1471-8286.2006.01657.x (PDF)

  2. Birds of North America - Phase II: (Bird2.csv)
    Comprehensive DNA barcode coverage of North American birds, K. C. R. Kerr, M. Y. Stoeckle, C. J. Dove, L. A. Weigt, C. M. Frances, P. D. N. Hebert, doi: 10.1111/j.1471-8286.2006.01670.x (PDF)

  3. Hesperiidae of the ACG 1: (ACG.csv)
    Hajibabaei M. , Janzen D. H., Burns J. M., Hallwachs W., and . Hebert P. D. N. (2006) DNA barcodes distinguish species of tropical Lepidoptera. PNAS 103, 968-971 (PDF)

  4. Fishes of Australia Container Part : (FishAustralia.csv)
    DNA barcoding Australia.s fish species, Ward et. al, 2005. Phil. Trans. R. Soc. B (doi:10.1098/rstb.2005.1716 Published online) (PDF)

II. List of challenges: (Select one or more - or invent your own)

  1. Assignment to known species. Construct a classification rule using the training set that assigns specimens in the testing set to the species contained in the training set. Your classification rule should both maximize correct assignments and minimize incorrect assignments.
  2. Data visualization. Most barcode studies display results as phenetic cluster diagrams, which can easily be confused with phylogenetic trees. An important challenge is to develop new visualization methods to display and analyze barcode data. These new methods should improve our ability to see and explore the degree and structure of variability within species and divergence among species. (Illustrate the new methods on the datasets provided.)
  3. Character-based approaches to barcode data. All barcoding studies to date have measured variation within species and divergence among species with phenetic measures - measures of distance based on overall similary (or dissimilarity) among barcode sequences. Barcode sequences can also be compared using the nucleotides at equivalent sites in the sequence. This approach parallels the use of homologous characters in phylogenetic analysis, and there may be significant synergies with those analytical methods. Character-based approaches may also reduce drastically the barcode sequence lengths needed as diagnostics. The challenge is to develop protocols and software that analyze character-based barcode data to assign specimens to known species and to identify barcode clusters that might be new species.
  4. Detection of possible new species. A good portion of the specimens in the testing set do not belong to the species included in the training set. Your classification rule should also be able to do two things in addition to assigning specimens to their correct species: (1) determine which specimens should not be assigned to known species, and (2) how these unclassified specimens in the testing set should be partitioned among potentially new species. Your classification rule should therefore identify clusters among unclassified specimens in the testing set that might be new species.
  5. Confidence measures. For some specimens the classification into known species might be unclear or borderline. One way to think more formally about these situations is to assign a measure of confidence to the assignments of specimens to species. Another is assigning measures of confidence to new clusters representing new species. The challenge is to propose ways to measure confidence of assignments to species and separation among clusters.
  6. Clusters within species. All species include some level of variation among individuals and in some cases this variation takes the form of splits among local populations and even subspecies. The challenge is to develop a classification rule that will identify clusters within individual species that rise above background variation and therefore might represent subspecies or other significant biological units. This is similar to Challenge 2 but instead of clustering a big dataset you will cluster many smaller datasets since you need to find clusters separately within each individual species.
  7. Sample size. The first four challenges are doable as long as there is a sufficient number of specimens per species. "Sufficient" will be a relative term, varying with a number of biological variables (population size, intraspecific variability, and gene flow are three important ones). The challenge is to provide guidelines for sample size - guidelines that will allow your clustering method and/or classification rule to produce decisions with a determined level of confidence. You should also explore how your method is robust relative to small absolute and relative sample sizes.
  8. Metric for barcode data. One important question is related to the clustering and classification methods needed for the first five challenges: What kind of distance metrics should be used to measure the difference (similarity) between barcodes? Barcode data is high-dimensional and categorical, and very little is known about how to analyze this kind of data. Many "off-the-shelf" methods can be applied to this data. One approach to improving the results obtained with "off-the-shelf" methods is by tapping into the specific structure of the data. One might start by analyzing the variability structure within species and between species and understanding the "correlation structure" of the data in high dimensions. How can one exploit the complexity of this correlation structure to obtain better results than the standard methods? How can one model this structure? Finally, you should address how new methods you propose compare in performance to "off-the-shelf" methods.
  9. New clustering methods. Clustering methods are challenged by small datasets and are often not robust under dynamically changing datasets. Will your approaches lead to some new clustering methods? More generally, will barcoding lead to some new clustering methods? How about new Bayesian clustering methods? The challenge is to describe such new methods and illustrate them on the datasets given.

To learn more about DNA barcoding, please visit the websites for the Consortium for the Barcode of Life, (CBOL); www.barcoding.si.edu and the Guelph Centre for DNA Barcoding (www.barcodeoflife.org). These sites provide links to scientific publications on barcoding as well as guidance to barcoding projects.

Are barcode data available to work with? DIMACS has assembled sets of barcode data, and the Barcode of Life Database (BoLD) at the University of Guelph's Centre for DNA Barcoding has many datasets that are available to the public.


How can I get involved? The first steps are to:


Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on May 31, 2007.