Checkpoints: why, when and how
Автор: Sharcnet HPC
Загружено: 2025-05-08
Просмотров: 129
Checkpointing is a technique that enables programs to save their current state and resume execution from a saved state in the future. This mechanism is useful in running long jobs, which may be interrupted for various unpredictable causes, such as system failures (either hardware or software), bugs in the running program, timeout, etc.
We have a wiki page about checkpoints that only gives general guidelines. In this webinar, we will introduce checkpointing through a few concrete examples to illustrate what is the state of a program and how its states at different points of execution are saved and restored. We will discuss various topics related to checkpoints, such as saving frequency, checkpoint file types, and how to implement the checkpointing mechanism in different computational job categories: serial, threaded, and MPI.
_______________________________________________
This webinar was presented by Weiguang Guan (SHARCNET) on May 7th, 2025, as a part of a series of weekly Compute Ontario Colloquia. The webinar was hosted by SHARCNET. The colloquia cover different advanced research computing (ARC) and high performance computing (HPC) topics, are approximately 45 minutes in length, and are delivered by experts in the relevant fields. Further details can be found on this web page: https://www.computeontario.ca/trainin... . Recordings, slides, and other materials can be found here: https://helpwiki.sharcnet.ca/wiki/Onl...
SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers (http://www.sharcnet.ca). SHARCNET is a part of Compute Ontario (http://computeontario.ca/) and Digital Research Alliance of Canada (https://alliancecan.ca).
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: