Enhanced Parallel ILU(p)-based Preconditioners for Multi-core CPUs and GPUs -- The Power(q)-pattern Method
Application demands and grand challenges in numerical simulation require for both highly capable computing platforms and efficient numerical solution schemes. Power constraints and further miniaturization of modern and future hardware give way for multi- and manycore processors with increasing fine-grained parallelism and deeply nested hierarchical memory systems -- as already exemplified by recent graphics processing units. Accordingly, numerical schemes need to be adapted and re-engineered in order to deliver scalable solutions across diverse processor configurations. Portability of parallel software solutions across emerging hardware platforms is another challenge.
This work investigates multi-coloring and re-ordering schemes for block Gauß-Seidel methods and, in particular, for incomplete LU factorizations with and without fill-ins. We consider two matrix re-ordering schemes that deliver flexible and efficient parallel preconditioners. The general idea is to generate block decompositions of the system matrix such that the diagonal blocks are diagonal itself. In such a way, parallelism can be exploited on the block-level in a scalable manner. Our goal is to provide widely applicable, out-of-the-box preconditioners that can be used in the context of finite element solvers.
We propose a new method for anticipating the fill-in pattern of ILU($p$) schemes which we call the power($q$)-pattern method. This method is based on an incomplete factorization of the system matrix $A$ subject to a predetermined pattern given by the matrix power $|A|^(p+1)$ and its associated multi-coloring permutation $. We prove that the obtained sparsity pattern is a superset of our modified ILU($p$) factorization applied to pi A pi^(-1). As a result, this modified ILU($p$) applied to multi-colored system matrix has no fill-ins in its diagonal blocks. This leads to an inherently parallel execution of triangular ILU($p$) sweeps.
In addition, we describe the integration of the preconditioners into the HiFlow$^3$ open-source finite element package that provides a portable software solution across diverse hardware platforms. On this basis, we conduct performance analysis across a variety of test problems on multi-core CPUs and GPUs that proves efficiency, scalability and flexibility of our approach. Our preconditioners achieve a solver acceleration by a factor of up to 1.5, 8 and 85 for three different test problems. The GPU versions of the preconditioned solver are by a factor of up to 4 faster than an OpenMP parallel version on eight cores.
The Engineering Mathematics and Computing Lab (EMCL), directed by Prof. Dr. Vincent Heuveline, is a research group at the Interdisciplinary Center for Scientific Computing (IWR).
The EMCL Preprint Series contains publications that were accepted for the Preprint Series of the EMCL and are planned to be published in journals, books, etc. soon.
The EMCL Preprint Series was published under the roof of the Karlsruhe Institute of Technology (KIT) until April 30, 2013. As from May 01, 2013 it is published under the roof of Heidelberg University.