Abstract:
Machine learning models such as AlphaFold can generate protein 3D conformation from primary sequence up to experimental accuracy, which gives rise to a bunch of research works to predict protein functions from 3D structures. Almost all of these works attempted to use graph neural networks (GNN) to learn 3D structures of proteins from 2D contact maps/graphs. Most of these works use rich 1D features such as ESM and LSTM embedding in addition to the contact graph.
These rich 1D features essentially obfuscate the learning the capability of GNNs. In this thesis,
we evaluate the learning capabilities of GCNs from contact map graphs in the existing framework, where we attempt to incorporate distance information for better predictive performance. We found that GCNs fall far short with 1D-CNN without language models, even with distance information. Consequently, we further investigate the capabilities of GCNs to distinguish sub- graph patterns corresponding to the InterPro domains. We found that GCNs perform better than highly rich sequence embedding with MLP in recognizing the structural patterns. Finally, we investigate the capability of GCNs to predict GO-terms (functions) individually. We found that GCNs perform almost on par in identifying GO-terms in the presence of only hard positive and hard negative examples. We also identified some GO-terms which indistinguishable by both GCNs and ESM2-based MLP. This gives rise to new research questions to be investigated by future works.
Bio:
Nurul Muttakin is an M.S candidate in the Bio-Ontology Research Group (BORG), under the supervision of Associate Professor Robert Hoehndorf. He completed his B.Sc. in Computer Science and Engineering from Bangladesh University of Engineering and Technology, where he developed a strong foundation in computer science and engineering.
Nurul is research interests are in the application of Artificial Intelligence in Bioinformatics tasks like protein function prediction as well as understanding the theory behind it.