Research and experimentation in various scientific fields are based on the knowledge and ideas from scholarly literature. The advancement of research and development has, thus, strengthened the importance of literary analysis and understanding.
However, in recent years, researchers have been facing massive scholarly documents published at an exponentially increasing rate. Analyzing this vast number of publications is far beyond the capability of individual researchers.
This dissertation is motivated by the need for large scale analyses of the exploding number of scholarly literature for scientific knowledge discovery and information retrieval. In the first part of this dissertation, the interdependencies between scholarly literature are studied. First, I develop Delve -- a data-driven search engine supported by our designed semi-supervised edge classification method. This system enables users to search and analyze the relationship between datasets and scholarly literature.
Based on the Delve system, I propose to study information extraction as a node classification problem in attributed networks. Specifically, if we can learn the research topics of documents (nodes in a network), we can aggregate documents by topics and retrieve information specific to each topic (e.g., top-k popular datasets).
Node classification in attributed networks has several challenges: a limited number of labeled nodes, effective fusion of topological structure and node/edge attributes, and the co-existence of multiple labels for one node. Existing node classification approaches can only address or partially address a few of these challenges. This dissertation addresses these challenges by proposing semi-supervised multi-class/multi-label node classification models to integrate node/edge attributes and topological relationships.
The second part of this dissertation examines the problem of analyzing the interdependencies between terms in scholarly literature. I present two algorithms for the automatic hypothesis generation (HG) problem, which refers to the discovery of meaningful implicit connections between scientific terms, including but not limited to diseases, chemicals, drugs, and genes extracted from databases of biomedical publications. The automatic hypothesis generation problem is modeled as a future connectivity prediction in a dynamic attributed graph. The key is to capture the temporal evolution of node-pair (term-pair) relations. The proposed algorithms utilize both the graphical structure and node attribute to encode the temporal node-pair (edge) relationship. The proposed algorithms work in both transductive and inductive settings. Experiment results and case study analyses highlight the effectiveness of the proposed algorithms compared to the baselines' extension.
Uchenna Akujuobi is a computer science Ph.D. student of Professor Xiangliang Zhang. His research focuses on large scale learning from scientific literature using machine learning techniques. Prior to his Ph.D., Uchenna Akujuobi obtained an MSc in CS from KAUST in 2016 and a BSc in CS from Saint Petersburg Electrotechnical University in Russia in 2014.