Job Submission & Scheduling | Minnesota Supercomputing Institute | UMN
Автор: Minnesota Supercomputing Institute | UMN
Загружено: 2025-09-17
Просмотров: 180
Topic: Job Submission & Scheduling
Date: September 17 2025
Presentation Slides and Training Materials: https://tinyurl.com/4uz6d2rv
Learn more at https://www.msi.umn.edu
00:00:00 Introduction to Job Submission and Slurm
00:00:19 Prerequisites (Unix/Linux, Bash scripting)
00:01:30 What is a Scheduler and why is it needed?
00:04:15 Slurm Concepts: Nodes, Cores, Time Limits, Partitions
00:08:40 Slurm Commands: sinfo (View Cluster Status)
00:11:00 Slurm Commands: squeue (View Job Queue)
00:14:00 Slurm Partitions (e.g., interactive, small, bigmem)
00:17:10 Job Limits (Time, Memory, Cores)
00:19:40 Basic Job Submission (Simple Script)
00:23:00 Slurm Directives (#SBATCH)
00:27:00 Submitting the job: sbatch
00:29:00 Monitoring the job: squeue -u username
00:30:30 Deleting a job: scancel
00:32:00 Requesting Resources (Time, Nodes, Cores, Memory)
00:36:00 Requesting GPUs
00:38:00 Slurm Variables (e.g., $SLURM_JOB_ID)
00:40:30 Modules (Managing software environment)
00:43:40 Job Arrays (Parallelizing many similar tasks)
00:46:10 Job Array Demo and Slurm Variables ($SLURM_ARRAY_TASK_ID)
00:50:00 Submitting a Job Array
00:51:30 Understanding Job Array Output
00:52:40 Advanced Job Submission: Dependencies
00:54:30 Dependency Demo (--depend)
00:57:00 Interactive Sessions with srun
01:00:00 Interactive Sessions with Open OnDemand
01:02:10 Slurm Command: sacct (Accounting/Past Job Info)
01:05:00 sacct Demo and Fields
01:07:30 Slurm Command: sstat (Real-time Statistics for running jobs)
01:09:50 Slurm Logs and Troubleshooting
01:13:00 Job Priority and Fairshare
01:15:20 Data Loading Best Practices (I/O)
01:17:30 Num Workers (Optimizing data loading)
01:20:00 Data Loading from different file systems (Home vs. Project)
01:22:20 Checkpointing (Saving progress)
01:25:20 Checkpointing: Example in PyTorch
01:27:00 Job Cancellation in Checkpointing Scenarios
01:29:40 Debugging a Slurm Job
01:31:30 Troubleshooting Tips
01:33:00 Debugging Demo
01:35:40 Slurm Monitoring Tools: Slurm Dashboard
01:38:00 Slurm Dashboard Metrics (CPU, Memory, GPU)
01:40:00 Conclusion and MSI Help Resources
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: