Abstract
Genome annotation is an important topic since it provides information for the foundation of downstream genomic and biological research. It is considered as a way of summarizing part of existing knowledge about the genomic characteristics of an organism. Annotating different regions of a genome sequence is known as structural annotation while identifying functions of these regions are considered as a functional annotation. In silico approaches can facilitate both tasks that otherwise would be difficult and time-consuming. This study contributes to genome annotation by introducing several novel bioinformatics methods, some based on machine learning (ML) approaches.
First, we present Dragon PolyA Spotter (DPS), a method for accurate identification of the polyadenylation signals (PAS) within human genomic DNA sequences. For this, we derived a novel feature-set able to characterize properties of the genomic region surrounding the PAS, enabling the development of high accuracy optimized ML predictive models. DPS considerably outperformed the state-of-the-art results.
The second contribution concerns developing generic models for structural annotation, i.e., the recognition of different genomic signals and regions (GSR) within eukaryotic DNA. We developed DeepGSR, a systematic framework that facilitates generating ML models to predict GSR with high accuracy. To the best of our knowledge, no available generic and automated method exists for such a task that could facilitate the studies of newly sequenced organisms. The prediction module of DeepGSR uses deep learning algorithms to derive highly abstract features that depend mainly on proper data representation and hyperparameters calibration. DeepGSR, which was evaluated on recognition of PAS and translation initiation sites (TIS) in different organisms, yields a simpler and more precise representation of the problem under study, compared to some other hand-tailored models, while producing high accuracy prediction results.
Finally, we focus on deriving a model capable of facilitating the functional annotation of prokaryotes. As far as we know, there is no fully automated system for a detailed comparison of functional annotations generated by different methods. Hence, we developed BEACON, a method and supporting system that compares gene annotation from various methods to produce a more reliable and comprehensive annotation. Overall, our results contributed to different aspects of genome annotation.
Brief Biography
Manal Kalkatawi completed her B.S in Computer Science from King Abdulaziz University (KAU) in 2008 and M.S. in Computer Science from King Abdullah University of Science and Technology (KAUST) in 2011. Currently, she is a Ph.D. candidate in the CBRC group at KAUST focusing on Bioinformatics, Data mining, and Machine/Deep learning. Manal’s work appeared in top venues including Bioinformatics and BMC Genomics. Details can be found at https://scholar.google.com/citations?user=YJgNij8AAAAJ&hl=en