December 10, 2013
Making sense --- or not --- of deep DNA sequencing data
DNA sequencing from populations of non-identical cells is becoming extensively used to probe their diversity, Typically, the DNA is pooled, one or more chosen genomic regions are amplified with PCR, and all the resulting amplicons are sequenced together resulting in as many as 10^8 similar sequencing reads. The most common application is sequencing 16S ribosomal RNA from natural bacterial populations to classify and quantify what "species" are present. But disentangling the true diversity from errors caused by PCR and sequencing is a major challenge. A new algorithm will be presented that develops a model of the error processes from the data set itself (without training) and identifies the true sequences and their abundances. This is used to probe extensive fine-scale diversity within a single species of cyanobacteria from a Yellowstone hot spring. Quantitative analysis in terms of population genetics models of the spectrum of abundances and relationships between the observed sequences yields puzzling results about the population dynamics and micro-ecology.
Use of deep DNA sequencing to follow the real-time evolutionary dynamics of microbial populations in the laboratory by "barcoding" cells, and the quantitative challenges this presents, will also be discussed.