The amount of available protein sequences is rapidly increasing, mainly as a consequence of the development and application of high throughput sequencing technologies in the life sciences. It is a key question in the life sciences to identify the functions of proteins, and furthermore to identify the phenotypes that may be associated with a loss (or gain) of function in these proteins. Protein functions are generally determined experimentally, and it is clear that experimental determination of protein functions will not scale to the current -- and rapidly increasing -- amount of available protein sequences (over 300 million). Furthermore, identifying phenotypes resulting from loss of function is even more challenging as the phenotype is modified by whole organism interactions and environmental variables. It is clear that accurate computational prediction of protein functions and loss of function phenotypes would be of significant value both to academic research and to the biotechnology industry.
We developed and expanded novel methods for representation learning, predicting protein functions and their loss of function phenotypes. We use deep neural network algorithm and combine them with symbolic inference into neural-symbolic algorithms. Our work significantly improves previously developed methods for predicting protein functions through methodological advances in machine learning, incorporation of broader data types that may be predictive of functions, and improved systems for neural-symbolic integration.
The methods we developed are generic and can be applied to other domains in which similar types of structured and unstructured information exist.
In future, our methods can be applied to prediction of protein function for metagenomic samples in order to evaluate the potential for discovery of novel proteins of industrial value. Also our methods can be applied to the prediction of loss of function phenotypes in human genetics and incorporate the results in a variant prioritization tool that can be applied to diagnose patients with Mendelian disorders.
Maxat Kulmanov is a PhD candidate in Computer Science at Computational Bioscience Research Center under supervision of prof. Robert Hoehndorf. He obtained his bachelor and master degrees in Information Systems from Kazakh-British Technical University in 2010. Then, he worked as a software developer and a lecturer for several years. He joined KAUST PhD program in 2015.
His research interests are bioinformatics, knowledge representation and reasoning, machine learning, neural networks, semantic web and algorithms.