An iterative machine learning approach has identified elusive 800 million-year-old amino acid patterns that are responsible for facilitating protein interactions.
Leucine-aspartic acid (LD) motifs are short amino acid sequences embedded within some proteins to link them to cellular molecules that control cell adhesion, motility and survival. They are known to also play a role in cancer cell spreading and in cardiovascular and infectious diseases. LD motifs were first revealed in 1996 in a family of proteins called paxillin. Only three other LD motif-containing proteins have been discovered since then, and scientists do not know the importance of LD motifs or how many other types of proteins contain them.
KAUST structural biologist Stefan Arold and computational bioscientists Xin Gao and Vladimir Bajic combined the efforts of their teams to develop a machine learning tool that they called LD Motif Finder (LDMF) to scan through the human proteome and identify LD motif patterns. This was no small task given the tiny number of known LD-motif-containing proteins that could be used to train the tool.
The team "taught" their computational tool using biophysical and structural data from known LD motifs and their proteins. To improve the accuracy of their algorithm, they included a round of experimental testing of its initial predictions and trained the tool to learn from these results.
A final step, performed in collaboration with KAUST colleagues Mariusz and Lukasz Jaremko, involved three-dimensional structural analyses of the association between newly identified LD motifs and known LD motif-binding proteins.
Using this integrative approach, the researchers were able to identify 12 new human proteins that carry functional LD motifs. “This gives us a good idea of how many of these motifs exist within the human proteome,” says Arold. “It seems there are far fewer than researchers initially suggested. Of course, this does not mean that they are biologically irrelevant.”
Read the full article