With the hardware technology scaling and the trend on heterogeneous chip design, the existing numerical algorithms and software framework may break down due to load imbalance. There is currently a fundamental mismatch between the underlying hardware architecture with high thread concurrency and the software deployment of numerical libraries, which relies on the traditional bulk synchronous programming model. Numerical software should first squeeze performance out of single node by efficiently running on manycore architectures with processor counts sharing a common memory in the hundreds. Programming and extracting performance from these advanced architecture chips remain a challenging effort, which is further exacerbated in distributed-memory environment. Algorithmic solutions such as fine-grained computations, communication/synchronization-reducing, and mixed precisions come to the rescue. They represent some of the key ingredients to embrace for software libraries moving forward and leverage extreme-scale computing capabilities.
This half-day workshop brings together experts who lead efforts in developing innovative numerical algorithms and scalable HPC software libraries on today's architectures and upcoming Exascale systems.
Below are the main themes of the workshop:
- High performance direct / iterative solvers
- Mixed-precision algorithms exploiting specific hardware features
- Synchronization-reducing and communication-avoiding algorithms
- Performance optimizations for accelerated / hybrid architectures
- Numerical libraries for large-scale scientific applications
Friday, July 2nd: 2PM - 6PM (CET)
02:00pm - 02:10pm: Hatem Ltaief and Bilel Hadri, KAUST, Saudi Arabia: Opening Remarks
02:10pm - 02:40pm: David Keyes, KAUST, Saudi Arabia: Data-sparse Linear Algebra Algorithms for Large-scale Applications on Emerging Architectures (slides)
|A traditional goal of algorithmic optimality -- squeezing out operations -- has been superseded because of evolution in architecture. Algorithms must now squeeze memory data transfers and synchronizations while extra operations on locally cached data cost relatively little time or energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity in approximating a wide class of formally dense operators that arise in exascale applications. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on hierarchical tree-based data structures and their simpler cousins tile low-rank matrices are well suited for early exascale architectures which are provisioned for high processing power relative to memory capacity and memory bandwidth. These data-sparse algorithms are ushering in a renaissance of numerical linear algebra. We describe modules of a software toolkit Hierarchical Computations on Manycore Architectures (HiCMA) that illustrate these features on several applications. Early modules of this open-source project are distributed in software libraries of major vendors. A recent addition H2Opus extends H2 hierarchical matrix operations to distributed memory and GPUs.|
02:40pm - 03:10pm: Daniel Gruenewald, Fraunhofer Institute for Industrial Mathematics, Germany: GaspiLS - A Linear Solver Library Based on The GASPI Programming Model (slides)
The Global Address Space Programming Interface (GASPI) with its reference implementation GPI-2 is based on single-sided communication complemented by light weight synchronization primitives. It is designed for extreme scalability. As such, it aims on truely asynchronous data dependency driven implementations with dynamic load balancing and maximal overlap of communication by computation. Explicit synchronization points can be avoided as much as possible.
GaspiLS is a sparse linear solver library built on top of these principles which provides basic linear algebra objects and iterative solvers for sparse linear problems. Solving linear equations is at the core of many scientific simulations. GaspiLS helps the domain experts by abstracting away the complexities of a hybrid parallel implementation while preserving an efficient and scalable execution using the Gaspi programming model.
We introduce the guiding principles used in GaspiLS and present recent advances in the development of two classes of black-box preconditioners. Algebraic Multigrid (AMG) for symmetric positive definite problems as they appear e.g. in CFD applications and Multi-Level ILU factorizations as a generic preconditioner which combines a MLND approach on the process level with an restrictive additive Schwarz method on the cluster level.
03:10pm - 03:40pm Laura Grigori, INRIA / Sorbonne University, France: Recent advances in randomization techniques for linear algebra (slides)
This talk will discuss several recent advances in using randomization techniques for solving large scale linear algebra operations. It will focus in particular on solving linear systems of equations and computing the low rank approximation of a large matrix. In the context of linear systems of equations, we discuss a randomized Gram-Schmidt process and show that it is efficient as classical Gram-Schmidt and numerically stable as modified Gram-Schmidt. We exploit the usage of mixed precision in this context and discuss its usage in GMRES. We also address the question of computing the low rank approximation of a matrix by using randomization techniques.
03:40pm - 04:10pm Alfredo Buttari, CNRS-IRIT, France: Five-Precision GMRES-based iterative refinement for large
sparse linear systems (slides)
|The recently proposed GMRES-based iterative refinement method in three precisions can handle much more ill-conditioned problems than traditional iterative refinement thanks to the use of preconditioned GMRES for solving the correction equation. Nevertheless, the practical interest of this method may be mitigated by the fact that high precision is needed to apply the preconditioner within the GMRES solver. In this talk we present a variant of this method where requirements on the precisions used within GMRES are relaxed, allowing the use of arbitrary precisions which leads to a five-precision GMRES-based iterative refinement algorithm. We present the associated rounding error analysis obtaining conditions under which the forward and backward errors converge to their attainable values. Our analysis makes use of a new result on the backward stability of MGS-GMRES in two precisions. We discuss the use of this method for the solution of large and sparse linear systems and present an implementation that relies on the use of a parallel, sparse direct solver. The experimental results on a large set of real life problems are in good agreement with our theoretical analysis and show that our method is a robust and versatile tool for solving linear systems of equations.|
04:10pm - 04:40pm Hartwig Anzt, Karlsruhe Institute of Technology, Germany: Toward a modular precision ecosystem for high performance computing (slides)
|On the road to exascale computing, it is usually the explosion in parallelism that is perceived as the main challenge as it requires the redesign of algorithms to enable the efficient use of architectures providing million-way parallelism. Aside from the increasing hardware concurrency, we observe another trend that requires the reconsideration of the traditional computing paradigms: The gap between the compute power available in the arithmetic cores on the one hand and the bandwidth to main memory or between nodes on the other hand keeps increasing dramatically, making data access and communication prohibitively expensive compared to arithmetic operations. In modern processor technology, retrieving values from main memory takes several orders of magnitude longer than performing arithmetic operations, and communicating between distinct nodes of a cluster is again orders of magnitude slower than main memory access. With no disruptive hardware changes on the horizon, we are facing a situation where all high performance computing applications suffer from the slow communication to main memory or in-between nodes. A promising - and maybe the only promising - strategy to overcome this problem is to utilize the bandwidth capacity more carefully, reduce the communication volume and the number of communication points, and - whenever possible - trade communication against computations. Specifically, the idea is to radically decouple the memory precision from the arithmetic precision, employ high precision only in the computations, and lower the precision as much as possible when accessing data in main memory or communicating with remote processors. An important aspect in this context is the design of a "memory accessor" that converts data on the fly between the IEEE high precision arithmetic format that is natively supported by hardware and the memory/communication format. On an abstract level, the idea is to compress data before and after memory operations and only use the working precision in the arithmetic operations. While one generally distinguishes between “lossy compression'' and “lossless compression'', significant bandwidth reduction usually requires the loss of some information. How much information can be disregarded without endangering the application correctness heavily depends on the underlying algorithm and the problem characteristics. In this talk, we motivate and present the idea of decoupling the memory precision from the arithmetic precision, and discuss the potential and limitations of the approach in the sense of preserving the algorithm correctness. We also provide examples for bandwidth-bound numerical linear algebra algorithms that are amenable to the format separation, and evaluate the performance gains we can achieve for real-world problems on recent HPC architectures.|
04:40pm - 05:10pm Ulrike Yang, Lawrence Livermore National Laboratory, USA: Advances in Hypre’s GPU Capabilities (slides)
|With the increasing inclusion of accelerators into current and future high-performance computers, it has become important to enable parallel mathematical libraries to take advantage of the increased performance potential of GPUs. The hypre software library provides high performance preconditioners and solvers for the solution of large sparse linear systems on massively parallel computers with focus on algebraic multigrid (AMG) methods. However, their implementation can be complex. Traditionally, AMG’s setup phase has consisted of the generation of coarsening, interpolation, and coarse grid generation. Breaking these components down into smaller kernels can significantly improve portability. Another way to achieve performance on accelerators is to increase structure in algorithms. This talk will discuss new developments in hypre, including a new class of interpolation operators that consists of simple matrix operations and efforts to develop a semi-structured AMG method.|
05:10pm - 05:55pm Panel moderated by John Shalf, Lawrence Berkeley National Laboratory, USA: Hardware/Software Co-Design: a Myth or a Reality?
05:55pm - 06:00pm: Hatem Ltaief and Bilel Hadri, KAUST, Saudi Arabia: Closing Remarks