Everything you always wanted to know about single machine large-scale graph analytics
Overview
Abstract: Graphs are a natural way to capture relations among objects or people. This has led to graph processing systems being used in a wide variety of fields, ranging from biology to social networks, and a large number of such systems have been described in the recent literature. The focus of our recent work is on single machine graph processing systems where the graph is processed from main memory or external storage. We perform a systematic comparison of various techniques proposed to speed up in-memory multicore graph processing. In addition, we take an end- to-end view of execution time, including not only algorithm execution time, but also pre-processing time and the time to load the graph input data from storage. More specifically, we study various data structures to represent the graph in memory, various approaches to pre-processing and various ways to structure the graph computation. We also investigate approaches to improve cache locality, and NUMA-awareness. In doing so, we take our inspiration from existing systems and implement the techniques they propose in a single system. Our main observation is that the cost of pre-processing in many circumstances dominates the cost of algorithm execution, calling into question the benefits of proposed algorithmic optimizations that rely on extensive pre- processing. Equally surprising, using radix sort turns out to be the most efficient way of pre-processing the graph input data into adjacency lists, when the graph input data is already in memory or is loaded from fast storage. Furthermore, we adapt a technique developed for out-of-core graph processing and show that it significantly improves cache locality. When the data does not fit in the RAM of a single machine, applications can scale up to external storage. We evaluate the feasibility of and design trage-offs of this approach. In particular, our evaluation uses an end-to-end approach taking all of storage format, preprocessing costs, and algorithmic computation into account when using three different types of storage: HDD, SSD, and PCIe NVMe. Our findings suggest that while specialised data structures may improve the end-to-end time on SSDs and HDDs, there is little gain when using higher bandwidth/lower latency NVMe devices. The best preprocessing approach, as well as data structure, depends on the application and the ratio of available RAM to graph size. We also show that different ways to overlap I/O and computation benefit different data layouts. Compared to a recent system, which uses high-density compute capability from multiple Intel Xeon processors as well as high bandwidth from multiple NVMe devices to process a trillion-edge graph in minutes, we project that a single socket with a single NVMe can come close with the right choice of storage format, preprocessing cost, and algorithm. This is joint work with Baptiste Lepers (University of Sydney) and Willy Zwaenepoel (EPFL, University of Sydney).
Biography: Jasmina Malicevic got her BS/MS from the University of Belgrade, Serbia. She is currently a 6th year PhD Student at the Operating Systems Laboratory at EPFL, advised by Professor Willy Zwaenepoel. Her interests are in systems and big data. In particular, she is interested in the impact irregular data structures, such as graphs, have on system design, as well as the implications the type of storage has on graph mining applications. Her recent paper, “Everything you always wanted to know about multicore graph processing but were afraid to ask”, was awarded Best Paper at USENIX ATC 2017.