Session 9
Session 9
9.1 Project Talk: Optimization of Fault-Tolerance Strategies for Workflow Applications ”Optimization of Fault-Tolerance Strategies for Workflow Applications”
Aurélien Cavelan (INRIA)
Checkpointing is the traditional fault-tolerance technique when it comes to resilience for large-scale platforms. Unfor- tunately, as platform scale increases, checkpoints must become more frequent to accommodate with the increasing Mean Time Between Failure (MTBF). As such, it is expected that checkpoint-recovery will become a major bot- tleneck for applications running on post-petascale platforms. In this paper, we focus on replication as a way of mitigating the checkpointing-recovery overhead. In particular, we investigate the impact of replication on the exe- cution of a single task. A task can be checkpointed and/or replicated, so that if only one replica fails, no recovery is needed. Replication can be done at different application level. We study process-replication, where each process
can be replicated several times, and we adopt a passive approach: waiting for the application to either succeed or fail. Finally, we consider both fail-stop and silent errors. We derive closed-form formulas for the expected execution time and first-order approximations for the overhead and the optimal checkpoint interval.
9.2 Individual Talk: Failure detection
Yves Robert (INRIA)
This talk briefly describes current methods for failure detection and outlines their shortcomings. it also presents a new algorithm that overcomes some (but not all) problems. Finally, itaddresses open questions.
9.3 Individual Talk: Identification and imapct of failure cascades
Frederic Vivien (INRIA)
Most studies assume that failures are independent and not time correlated. The question we address in this work is: can we identify time-correlated failures in supercomputer traces? could the potential cascade of failures have an impact on our usage of fault tolerance mechanisms?
9.4 Project Talk: New Techniques to Design Silent Data Corruption Detectors ”Recent Work on Detecting SDC”
Leonardo Bautista (BSC)
We will present the results obtained in the last 6 months on this topic.
9.5 Individual Talk: Runtime driven online estimation of memory vulnerability
Luc Jaulmes (BSC)
Memory reliability is measured as a fault rate, i.e. a probability over a given amount of time. The missing link to know the fault probability of any data stored in memory is its storage duration. By analyzing memory access patterns of an application, we can thus determine the vulnerability of data stored in memory, and thus the optimal amount of redundancy to keep fault probabilities below an acecptable threshold at all times. While such a data vulnerability metric has been approached offline [Luo et al., DSN’14] and with an analytical model [Yu et al., SC’14], we estimate it online using performance counters on real hardware. This allows to dynamically get fault probabilities for memory storage, and opens the door to runtime optimizations. The open problem remains the right set of actuators to use for a runtime system, in order to adapt the strength of memory protection. Some leads are to have different ECC strengths, either through an adaptable ECC scheme whose amount of redundancy can be adjusted, or through different chips with the option of migrating data under different protection requirements. Another lead is to allow strong ECC at all times, instead tuning parameters that impact resilience in order to save power and time, such as reducing DRAM refresh rates, wherever we know the uniform redundancy is superfluous.
9.6 Individual Talk: Fault Tolerance and Approximate Computing
Osman Unsal (BSC)
There is a correlation between fault-tolerance and approximate computing. Everything that is critical for fault- tolerance - be it code, data structures, threads or tasks - is not conductive to approximation. On the other hand, anything that is relatively less critical for fault-tolerance can be approximated. This means that if parts of an ap- plication is annotated for realiability-criticality; the same annotations could be leveraged for approximation without the need to further analyze and annotate the application for approximation. In particular, we have considered tasks for approximation that were unmarked as reliability-critical either by programmer or runtime; initial results are not conclusive, needing further exploration.
9.7 Individual Talk: Fault-tolerance for FPGAs
Osman Unsal (BSC)
Similar to GPUs, the first generations of FPGAs did not offer any hardware-based fault tolerance. However, their potential deployment in mission critical and HPC environments have led FPGA system integrators to include ECC- support for on-chip Block RAM and SRAM memory structures in state-of-the-art FPGAs. However, most of the area of an FPGA is reserved for programmable logic rather than memory - unlike CPUs who dedicate a majority of the chip area for cache memory structures -. Since fault rate is proportional to area, any reliability proposal for FPGAs should therefore target programmable logic structures as well. However, protecting every latch with ECC or any other redundancy technique is not effective. We think that an end-to-end integrated reliability solution involving ABFT, runtime and hardware is neccessary to select the most vulnerable programmable logic structures of an FPGA and protect them. In the initial step, we created a fault injection mechanism which injects faults to any desired latch of an FPGA. Our initial results support our intuition that certain latches are much more reliability critical than others.