Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often results in extremely intricate ML models. Frequently, these models may have poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first challenge is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a tree-like decision tree structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of translation initiation sites in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a model for eukaryotic organisms in general. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and a tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.
Arturo Magana-Mora is a Ph.D. Candidate at King Abdullah University of Science and Technology (KAUST) under the supervision of Prof. Vladimir B Bajic. He received his BS degree in Computer Engineering at Universidad Autonoma de San Luis Potosi (UASLP), Mexico, in 2009 with the most outstanding academic trajectory award. He then joined KAUST and received his M.Sc. degree in Computer Science in 2011 under the supervision of Prof. Vladimir B Bajic. His research interests include the development of novel machine-learning and data mining techniques to address complex problems in biology. His research work has resulted in peer-reviewed publications in high-quality journals.