Protein-protein interactions drive many important biological events such as infection, replication, and recognition. To control or engineering such events, we need to access the molecular details of the interaction provided by experimental 3D structures. However, such experiments take time and are expensive; moreover, the current technology cannot keep up with the high discovery rate of new interactions. Computational modeling like protein-protein docking can help to fill this gap by generating docking poses. Protein-protein docking generally consists of two parts, sampling, and scoring. The sampling is an exhaustive search of the tridimensional space. The caveat of the sampling produces a large number of incorrect poses, producing a highly unbalanced dataset. This limits the utility of the data to train machine learning classifiers. Using weak supervision, we develop a data augmentation method that we named hAIkal. Using hAIkal, we increased the labeled training data to train several algorithms. We trained and obtained different classifiers; the best classifier has 81% accuracy and 0.51 MCC on the test set, surpassing the state-of-the-art scoring functions.
Didier Barradas-Bautista received his Ph.D. degree from Barcelona University in 2017. His research interest includes applying machine learning and artificial intelligence to protein modeling, biochemistry, bioinformatics, and network biology. He is currently a Postdoctoral Fellow with the King Abdullah University of Science and Technology in the Catalysis Center department.