Banca de DEFESA: CARLA DOS SANTOS SANTANA

Uma banca de DEFESA de DOUTORADO foi cadastrada pelo programa.
STUDENT : CARLA DOS SANTOS SANTANA
DATE: 04/10/2024
TIME: 10:00
LOCAL: nPITI
TITLE:

A Configurable Dependability Library for High-Performance Computing Iterative Applications with Interruption Detection and Data Preservation


KEY WORDS:

fault tolerance, interruption detection, data conservation, high-performance computing


PAGES: 100
BIG AREA: Ciências Exatas e da Terra
AREA: Ciência da Computação
SUMMARY:

High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios such as resource reclamation by cloud providers.

Given the imperative for fault tolerance, we present the Dependability Library for Iterative Applications. This library offers a versatile solution for bulk synchronous programs. The proposed library simplifies the integration of fault tolerance capabilities into the applications, offering high configurability options and allowing users to select which functionalities they want to utilize in their applications. The principle is to reduce efforts in implementing fault tolerance approaches, allowing project groups to focus on developing their specific problem.

The library offers a range of fault tolerance features, including checkpointing, replication, and heartbeat monitoring. Checkpointing saves the application's state at intervals, allowing it to resume from the last saved point after a failure. Replication ensures reliability by allowing a backup unit to take over in case of failure. The library has detected possible failures using the heartbeat monitoring method and potential resource reclamation. The proposed library is compatible with user-level failure mitigation, which allows programs to continue operating after crashes, minimizing downtime and ensuring continuous operation.

Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis. All features were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.


COMMITTEE MEMBERS:
Presidente - 1673543 - SAMUEL XAVIER DE SOUZA
Externo ao Programa - 3216921 - TIAGO TAVARES LEITE BARROS - UFRNExterno à Instituição - CLAUDE TADONKI
Externa à Instituição - MICHELA TAUFER
Externo à Instituição - Herve Chauris
Externo à Instituição - PHILIPPE OLIVIER ALEXANDRE NAVAUX
Notícia cadastrada em: 05/09/2024 07:40
SIGAA | Superintendência de Tecnologia da Informação - (84) 3342 2210 | Copyright © 2006-2025 - UFRN - sigaa11-producao.info.ufrn.br.sigaa11-producao