Imagine being asked to pick out a particular face among a sea of people. Researchers from KAUST have come up with a method to accurately sift complex biological data.
Xin Gao and his team have developed an algorithm that achieves high accuracy in difficult classification problems.
Biological data are often presented with dizzying complexity. They can be made up of many samples, with thousands of features per sample, and need to be converted into a simpler form for analysis.
Popular statistical methods for complexity reduction, such as principal component analysis, assign both positive and negative values to the simplified data. Thus, explains Gao, they cannot fit to the non-negative nature of some practically useful data, such as image and gene expression data.
Instead Gao, with postdoctoral fellow Jingyan Wang from KAUST’s Computer, Electrical and Mathematical Science and Engineering Division, improved upon a method that does not assign negative numbers, the so-called non negative matrix factorization (NMF). A complex dataset is expressed as a matrix — each row is a feature and each column a sample — and is then broken down into simpler matrices with fewer features for representation of the data. NMF is first ‘trained’ on known data and then used to represent test data.
Gao and Wang utilized the fact that each sample in a training set can be assigned to a particular class. They then increased the distance between any two pairs belonging to different classes to develop Max-Min NMF. “Instead of dealing with all the inter-class pairs equally, we pick the closest inter-class pair and maximize the distance, so that all other inter-class pairs will also be separated simultaneously,” says Gao.
Read the full article