Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers
AbstractAs hardware devices like processor cores and memory sub--systems based on nano--scale technologies nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an algorithm--based fault tolerant (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is ''pay as you go'', meaning that there is only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based triple modular redundancy (TMR) clearly shows the runtime benefit of the proposed approach, better fault tolerance and no occurrence of silent data corruption.
The Engineering Mathematics and Computing Lab (EMCL), directed by Prof. Dr. Vincent Heuveline, is a research group at the Interdisciplinary Center for Scientific Computing (IWR).
The EMCL Preprint Series contains publications that were accepted for the Preprint Series of the EMCL and are planned to be published in journals, books, etc. soon.
The EMCL Preprint Series was published under the roof of the Karlsruhe Institute of Technology (KIT) until April 30, 2013. As from May 01, 2013 it is published under the roof of Heidelberg University.