Latino Studies at New York University

Alexander Alekseyenko

NYU School of Medicine

October 4, 2011

Estimating multiple haplotypes and their frequencies from next-generation sequencing data

Next generation sequencing techniques allow for an in-depth examination of the variability of viral, bacterial and eukaryote nucleic acid sequences in heterogeneous samples. One of the drawbacks of these methods, however, is the shortness of the produced sequencing reads, which limits the ability to assign individual variants to (nearly) full-length haplotypes. Moreover, it is also hard to establish the total number of haplotypes and the frequencies at which they are present in the analyzed sample. To tackle these problems we present a statistical technique (OMNIPLOID) that bridges the short reads to produce minimal explaining haplotype set. The method is designed to estimate the frequencies of this set of haplotypes, which completely explain the present variants, by effectively removing haplotypes with little support from the sequencing data. OMNIPLOID operates in the regularized maximum likelihood framework and is estimated using a majorization minimization (MM) algorithm, which provides for fast convergence, while exploiting algorithmic heuristics to efficiently handle large amounts of data.