
A blindfold approach improves machine learning privacy
A query-based method for extracting knowledge from sensitive datasets without showing any underlying private data could resolve long-standing privacy concerns in machine learning.
About
A key challenge lies in balancing patient privacy with the opportunity to improve future outcomes when training artificial intelligence (AI) models for applications such as medical diagnosis and treatment. A KAUST-led research team has now developed a machine-learning approach that allows relevant knowledge about a patient’s unique genetic, disease and treatment profile to be passed between AI models without transferring any original data.
“When it comes to machine learning, more data generally improves model quality,” says Norah Alballa, a computer scientist from KAUST. “However, much data is private and hard to share due to legal and privacy concerns. Collaborative learning is an approach that aims to train models without sharing private training data for enhanced privacy and scalability. Still, existing methods often fail in heterogeneous data environments when local data representation is insufficient.”
Learning from sensitive data while preserving privacy is a long-standing problem in AI: it restricts access to large data sets, such as clinical records, that could greatly accelerate research and the effectiveness of personalized medicine.
One way privacy can be maintained in machine learning is to break up the dataset and train AI models on individual subsets. The trained model can then share just the learnings from the underlying data without breaching privacy.
This approach, known as federated learning, can work well when the datasets are largely similar, but in situations when distinctly different datasets form part of the training library, there can be a breakdown in machine-learning process.
“These approaches can fail because, in a heterogenous data environment, a local client can ‘forget’ existing knowledge when new updates interfere with previously learned information,” says Alballa. “In some cases, introducing new tasks or classes from other datasets can lead to catastrophic forgetting, causing old knowledge to be overwritten or diluted.”