Latino Studies at New York University

Daniel Fisher

Stanford University

December 10, 2013

Making sense --- or not --- of deep DNA sequencing data

DNA sequencing from populations of non-identical cells is becoming extensively used to probe their diversity, Typically, the DNA is pooled, one or more chosen genomic regions are amplified with PCR, and all the resulting amplicons are sequenced together resulting in as many as 10^8 similar sequencing reads.  The most common application is sequencing 16S ribosomal RNA from natural bacterial populations to classify and quantify what "species" are present.   But disentangling the  true diversity from errors caused by PCR and sequencing  is a major challenge.  A new algorithm will be presented that develops a model of the error processes from the data set itself (without training) and identifies the true sequences and their abundances.  This is used  to probe extensive fine-scale diversity within a single species of cyanobacteria from a Yellowstone hot spring.  Quantitative analysis in terms of population genetics models of the spectrum of abundances and relationships between the observed sequences yields puzzling results about the population dynamics and micro-ecology.

Use of deep DNA sequencing to follow the  real-time evolutionary dynamics of microbial populations in the laboratory by "barcoding" cells, and the quantitative challenges this presents, will also be discussed.