Research Challenges Related to Data Analysis and the DNA Barcode of Life Initiative: Data Sets


I. The Dataset:

The data tables provided here are based on real barcode data from species with some small modification and disguises to make it hard to identify the species.

The dataset is organized in two data tables:

  1. training.csv: contains 1623 barcodes from 1623 specimens with their corresponding species names. The species names are computer generated by us in order to disguise the true names. They uniquely identify all the samples of each of the species.
  2. testblind.csv: contains 346 barcodes of specimens without species information.

Example Datasets:

This is a small selection of Barcode datasets extracted from BOLD. Although we encourage data analysis researchers to tray as many datasets as they can, we think it is useful to have a small list as of test datasets to make comparisons between methods. This list is of course expandable and changeable; please make any comments of suggestions.

  1. Bats of Guyana: (BatGuyana.csv)
    DNA barcoding of Neotropical bats: species identification and discovery within Guyana, E. L. Clare, B. K. Lim, M. D. Engstrom, J. L. Eger, P. D. N. Hebert, doi:10:1111/j.1471-8286.2006.01657.x (PDF)

  2. Birds of North America - Phase II: (Bird2.csv)
    Comprehensive DNA barcode coverage of North American birds, K. C. R. Kerr, M. Y. Stoeckle, C. J. Dove, L. A. Weigt, C. M. Frances, P. D. N. Hebert, doi: 10.1111/j.1471-8286.2006.01670.x (PDF)

  3. Hesperiidae of the ACG 1: (ACG.csv)
    Hajibabaei M. , Janzen D. H., Burns J. M., Hallwachs W., and . Hebert P. D. N. (2006) DNA barcodes distinguish species of tropical Lepidoptera. PNAS 103, 968-971 (PDF)

  4. Fishes of Australia Container Part : (FishAustralia.csv)
    DNA barcoding Australia.s fish species, Ward et. al, 2005. Phil. Trans. R. Soc. B (doi:10.1098/rstb.2005.1716 Published online) (PDF)

The data sets are provided in conjunction with the DAWG I Challenge Problems (see: http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges2007/) provided by the Data Analysis Working Group of the Consortium for the Barcode of Life.

II. List of challenges:

  1. Construct a classification rule using the training set that assigns some of the specimens in the testing set to the species contained in the training set.
  2. A good portion of the specimens in the testing set do not belong to the species included in the training set. Your classification rule should also be able to determine which specimens of the testing set belong to new species. Are there clusters among the testing samples that permit one to identify new species?
  3. For some specimens the classification into known species might be unclear or borderline. One way to think more formally about these situations is to assign a measure of confidence to the assignments of specimens to species. Another is assigning measures of confidence to new clusters representing new species. What are some good ways to measure confidence?
  4. Identification of clusters within individual species to identify subspecies: This is similar to Challenge 2 but instead of clustering a big dataset you will cluster many smaller datasets since you need to find clusters for each individual species separately.
  5. Sample size. The first four challenges are doable as long as there is sufficient data. We want you to address the issue of sample size necessary for your clustering method or classification rule to identify species, or subspecies. The challenge is to provide guidelines for sample size.
  6. One important question that is related to the clustering and classification methods needed for the first five challenges is what kind of metric should be used. Barcode data is high-dimensional and categorical, and very little is known about how to analyze this kind of data. One approach to improve on the results obtained with "off-the-shelf" methods is by tapping into the specific structure of the data. We might start by analyzing the variability structure within species and between species and understanding the "correlation structure" of the data in high dimensions. How can one exploit the complexity of this correlation structure to obtain better results than the standard methods? How can one model this structure? Will this lead to some new clustering methods?, perhaps some new form of Bayesian clustering?
  7. Another important challenge is to develop new visualization methods to display and analyze barcode data.

Other Workshops
DIMACS Homepage
Contacting the Center
Document last modified on May 24, 2007.