Slurm Commands - HPC Job Management

Master Slurm commands for efficient HPC job management. Learn to submit, queue, cancel, and monitor your jobs with essential Slurm commands and examples.

Slurm Commands

Slurm is a powerful workload manager and job scheduler used in High-Performance Computing (HPC) environments. Mastering its commands is crucial for efficiently managing your computational tasks. Below are some of the most common and essential Slurm commands to help you submit, monitor, and control your jobs.

Submit Slurm Jobs

The sbatch command is used to submit a batch script to Slurm. This script typically contains the commands you want to execute on the cluster, along with Slurm directives (e.g., number of nodes, time limit, partition).

# To submit a new job:
sbatch job.sh

Monitor Job Queues

The squeue command allows you to view the status of jobs in the queue. You can filter jobs by user, partition, or other criteria.

# To list all jobs for a user:
squeue -u <user>

Cancel Slurm Jobs

If you need to stop a running job or remove a pending job from the queue, use the scancel command. You can specify jobs by their ID or name.

# To cancel a job by id:
scancel <job-id>
# To cancel a job by name:
scancel --name <job-name>

Inspect Job Details

The scontrol command provides detailed information about Slurm jobs, nodes, and partitions. It's invaluable for troubleshooting and understanding job configurations.

# To list all information for a job:
scontrol show jobid -dd <job-id>

Check Job Resource Usage

Use the sstat command to get real-time or historical statistics on job resource consumption, such as CPU usage, memory, and I/O.

# To status info for currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <job-id> --allsteps

Additional Resources