Big data analytics and large-scale simulations have followed largely independent paths to the high-performance computing frontier, but important opportunities now arise that can be addressed by combining the strengths of each. As a prominent big data application, geospatial statistics is increasing performance bound. We present Exascale GeoStatistics (ExaGeoStat) software, a high-performance library implemented on a wide variety of contemporary hybrid distributed-shared supercomputers whose primary target is climate and environmental prediction applications. Such software is destined to play an important role at the intersection of big data and extreme simulation by allowing applications with prohibitively large memory footprints to be deployed at scales worthy of the data on modern architectures by exploiting recent algorithmic developments in computational linear algebra.
In contrast to simulation-based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. A relatively small ensemble of expensive simulations can be used to parameterize a statistical model from which inexpensive emulations can be drawn after a parameter fitting process. For the dense covariance matrix operations of geospatial statistics to keep up with the growing scale of data sets from the sparse Jacobian operations of PDE simulations, data sparsity intrinsic in the physics must be identified and exploited. Parameterized by the Matern covariance function, the covariance matrix is symmetric and positive definite. The computational tasks involved during the evaluation of the Gaussian log-likelihood function become daunting as the number n of geographical locations grows, as O(n^2) storage and O(n^3) operations are required. While ExaGeoStat's distributed capability extends traditional ``exact'' linear algebra approaches, the library supports several approximate techniques that reduce the complexity of the maximum likelihood operation and while respecting user-specified accuracy. For example, ExaGeoStat supports the Tile Low-Rank (TLR) approximation technique which exploits the data sparsity of the dense covariance matrix by compressing the off-diagonal tiles up to a user-defined accuracy threshold. Because many environmental characteristics show a spatial continuity, i.e., data at two nearby locations are on average more similar than data at two widely spaced locations, other approximations become valid and are provided by ExaGeoStat such as diagonal-super tile and mixed-precision approximation methods, whereby the less significant correlations that comprise the vast majority of entries in the covariance matrix are stored in lower precisions than the defaults for tightly coupled degrees of freedom.
Sameh Abdulah is a research scientist at the Extreme Computing Research Center, King Abdullah University of Science and Technology, Saudi Arabia (KAUST). Sameh received his MS and Ph.D. degrees from the Ohio State University, Columbus, US, in 2014 and 2016. His work is centered around High-Performance Computing (HPC) applications, Extreme Scale Geostatistics, Extreme Scale Machine Learning, and Data Mining (MLDM) algorithms, and fault-tolerance for data-intensive applications.