Computational Biology Program, NYU
December 13, 2011
Integrative Biclustering of DNAse hypersensitivity and binding site information for learning conserved co-regulated groups active during hematopoietic stem cells differentiation
Data integration is particularly relevant with the recent and rapid expansion of biological data now being made publicly available. We work on T-cell and B-cell differentiation and function where recent data-sets include, but are not limited to: ENCODE data (such as DNAse hyper-sensitivity and ChIP-seq data), RNA-seq following genetic and environmental perturbation of key cell types, measurement of phosphorylation states of key proteins, ChIP-seq of several chromatin marks and histone modifications, and DNA-cross linking experiments. One means to integrate and query these massive datasets is to group genomic elements into modules, allowing for the effective complexity of a given dataset to be dramatically reduced. A natural first step towards this is the learning of co-regulated clusters. Early efforts often assumed genes cluster across all observed cell states (or genetic backgrounds) and that genes exist in only a single cluster. Current approaches account for genes that participate in multiple clusters, and that clusters that are condition-specific. Biclustering (condition- or cell-state-specific clustering) allows for genes not expressed over significant portion of conditions to be incorporated. Here we present a method for Integrative Biclustering, multi-species cMonkey, that has been specifically expanded to integrate chromatin state, DNAse hypersensitivity data, and binding site models derived from ChIP-seq and protein-DNA binding arrays. We demonstrate the added value of this expanded bicluster model within the context of white blood cell differentiation, for the widespread characterization of immune system development. We will also demonstrate our system for exploring and visualizing these biclusters.