Computational Methods for ChIP-seq Data Analysis and Applications
Overview
Abstract
The development of Chromatin immunoprecipitation followed by sequencing (ChIP-seq) technology has enabled the construction of genome-wide maps of protein-DNA interaction. Such maps provide information about transcriptional regulation at the epigenetic level (histone modifications and histone variants) and at the level of transcription factor (TF) activity. This dissertation presents novel computational methods for ChIP-seq data analysis and applications. The work of this dissertation addresses four main challenges. First, I address the problem of detecting histone modifications from ChIP-seq cancer samples. The presence of copy number variations (CNVs) in cancer samples results in statistical biases that lead to inaccurate predictions when standard methods are used. To overcome this issue I developed HMCan, a specially designed algorithm to handle ChIP-seq cancer data by accounting for the presence of CNVs. When using ChIP-seq data from cancer cells, HMCan demonstrates unbiased and accurate predictions compared to the standard state of the art methods. Second, I address the problem of identifying changes in histone modifications between two ChIP-seq samples with different genetic backgrounds (for example cancer vs. normal). In addition to CNVs, different antibody efficiency between samples and the presence of samples replicates are challenges for this problem. To overcome these issues, I developed the HMCan-diff algorithm as an extension to HMCan. HMCan-diff implements robust normalization methods to address the challenges listed above. HMCan-diff significantly outperforms another state of the art methods on data containing cancer samples. Third, I investigate and analyze predictions of different methods for enhancer prediction based on ChIP-seq data. The analysis shows that predictions generated by different methods are poorly overlapping. To overcome this issue, I developed DENdb, a database that integrates enhancer predictions from different methods. DENdb also integrates several experimental data including ChIP-seq data for TF binding sites. Finally, I present an extensive computational comparison of different ab-initio motif identification methods based on TF ChIP-seq data. The comparison included 10 different methods over 159 different TF datasets. The recommendations of this comparison indicate that the usage of simple methods outperforms the usage of high order models.
Brief Biography
Haitham Ashoor is a Ph.D. candidate in the Computer Science program at King Abdullah University of Science and Technology (KAUST). He is working under the supervision of Prof. Vladimir Bajic. He has obtained his BSc degree in Computer Engineering from the University of Jordan, Amman Jordan in 2008. He joined KAUST as an MSc student in 2009. In 2011, He finished his master studies under the supervision of Prof. Vladimir Bajic. His main research interests are developing computational methods for next-generation sequencing data analysis and applications of machine learning in bioinformatics. His research work has resulted in several peer-reviewed publications in high-quality journals.