Machine Learning in Healthcare: When Low Sample Size is not a Limitation

Event Start
Event End


Methodology at the intersection of machine learning and medicine has been rapidly advancing in recent years. Especially, in scenarios when large amounts of high-quality data for supervised tasks is available. However, in most practical situations in medicine, data is scarce, difficult to access, the endpoints of interest are rare or the process for generating labeled data for supervised tasks is difficult, time consuming and expensive, all of which make it difficult to assemble large datasets for supervised tasks. In this talk, I will describe three use cases that highlight present challenges and opportunities for the development of machine learning methodology for applications in healthcare. First, I will describe the development of simple word embedding approaches for bag of-documents classification and its applications to diagnosis of peripheral artery disease from clinical narratives. Second, I will present an approach for volumetric image classification that leverages attention mechanisms, contrastive learning and feature-encoding sharing for geographic atrophy prognosis from optical coherence tomography images. Third, I will discuss machine learning approaches for multi-modal and multi-dataset integration for biomarker discovery from molecular (omics) data. To conclude, I will summarize the contributions and insights in each of these different directions in which relatively low sample sizes are the common denominator.

Brief Biography

Ricardo Henao, a quantitative scientist, is an Assistant Professor in the department of Biostatistics and Bioinformatics at Duke University. He is also affiliated with the Department of Electrical and Computer Engineering (ECE), the Information Initiative at Duke (iiD), the Center for Applied Genomics and Precision Medicine (CAGPM), the Forge Duke’s center for Actionable health Data Science and the Duke Clinical Research Institute (DCRI), all at Duke University. The theme of his research is the development of statistical methods and machine learning algorithms primarily based on probabilistic modeling. His expertise covers several fields including applied statistics, signal processing, pattern recognition and machine learning. His methods research focuses on hierarchical or multilayer probabilistic models to describe complex data, such as that characterized by high-dimensions, multiple modalities, more variables than observations, noisy measurements, missing values, time-series, multiple modalities, etc. Most of his applied work is dedicated to the analysis of biological data such as gene expression, medical imaging, clinical narrative, and electronic health records. His recent work has been focused on the development of machine learning models, including deep learning approaches, for the analysis and interpretation of clinical and biological data with applications to predictive modeling for diverse clinical outcomes