Contents

Home
Getting Started
Using the Cluster
Software and Development
Policies
- Permitted Use and Misuse
- Communications Policy
Courses
- CVCS - Quick Start Guide
- SAI - Quick Start Guide

Scheduler and Job Submission

Last updated: June 24, 2026

AImageLab-HPC compute nodes are accessed exclusively through the SLURM workload manager. Login nodes are intended for light tasks only: editing files, transferring data, compiling code, and short test runs not exceeding a few minutes. All production work must be submitted as a SLURM job.

Note: Running long or resource-intensive processes directly on login nodes is not permitted and may result in your process being terminated.

Partitions and QOS

SLURM organises resources into partitions (queues of nodes) and QOS (Quality of Service) levels that define per-user limits and walltime caps.

Partition	QOS	Max walltime	Max GPUs/user	Notes
`all_serial` (default)	`all_serial`	4 h	4	Login nodes only. Free of charge.
`all_usr_prod`	`all_usr_prod`	24 h	8	All GPU nodes. Standard users.
`all_usr_prod`	`all_qos_sprod`	24 h	4 (8 shared)	Auto-assigned to students.
`all_usr_prod`	`all_qos_tprod`	24 h	4 (8 shared)	Auto-assigned to thesis students.
`all_usr_prod`	`all_qos_dbg`	1 h	1	Debug and testing. All users.
`all_usr_prod`	`all_qos_lprod`	7 days	0 (CPU only)	Long CPU-only jobs (e.g. workflow schedulers).
`all_usr_prod`	`qos_biz_lprod`	60 days	8 (group)	Industrial contracts and production demos. Restricted.
`all_usr_prod`	`all_qos_bio`	24 h	0 (CPU only)	Biomedical research team. Restricted.
`all_usr_prod`	`all_qos_biol`	30 days	0 (CPU only)	Biomedical research, long jobs. Restricted.
`boost_usr_prod`	`boost_usr_prod`	24 h	16	High-end nodes (L40S, A40, RTX PRO 6000 Blackwell). Standard users.
`boost_usr_prod`	`boost_qos_sprod`	24 h*	8 (16 shared)	Auto-assigned to students.
`boost_usr_prod`	`boost_qos_tprod`	24 h*	8 (16 shared)	Auto-assigned to thesis students.
`boost_usr_prod`	`all_qos_dbg`	1 h	1	Debug and testing. All users.
`boost_usr_prod`	`qos_biz_lprod`	60 days	8 (group)	Industrial contracts and production demos. Restricted.

* Capped by the partition’s MaxTime; the QOS itself sets no walltime limit.

all_usr_prod vs boost_usr_prod: boost_usr_prod restricts scheduling to the high-end nodes equipped with L40S, A40, and RTX PRO 6000 Blackwell GPUs. all_usr_prod includes all GPU nodes (including older hardware). When in doubt, use all_usr_prod.

Students and thesis students are automatically assigned their respective QOS (all_qos_sprod/all_qos_tprod for all_usr_prod, boost_qos_sprod/boost_qos_tprod for boost_usr_prod) by the job submission system — no --qos flag is needed. Thesis students have a per-user cap of 4 GPUs and a global cap of 8 GPUs shared across all thesis jobs running simultaneously (GrpTRES).

all_qos_dbg is available on both production partitions and grants immediate or near-immediate access with a 1-hour cap and a maximum of 1 GPU. Use it for quick tests and debugging.

all_qos_lprod is intended for long-running CPU-only processes such as workflow schedulers or pipeline orchestrators; GPUs cannot be requested with this QOS.

qos_biz_lprod, all_qos_bio, and all_qos_biol are restricted QOS available only to specific groups; access is granted by the HPC team.

SLURM Directives

Resources are requested by including #SBATCH directives in your job script, or by passing the equivalent flags directly to sbatch or salloc.

Directive	Description	Example
`--job-name`	Name for the job	`#SBATCH --job-name=my_train`
`--output`	Standard output file (`%j` = job ID)	`#SBATCH --output=job_%j.out`
`--error`	Standard error file	`#SBATCH --error=job_%j.err`
`--time`	Maximum walltime (hh:mm:ss)	`#SBATCH --time=08:00:00`
`--partition`	Partition to submit to	`#SBATCH --partition=all_usr_prod`
`--qos`	Quality of Service	`#SBATCH --qos=all_qos_dbg`
`--account`	Project account	`#SBATCH --account=<project>`
`--nodes`	Number of nodes	`#SBATCH --nodes=1`
`--ntasks`	Total number of tasks	`#SBATCH --ntasks=1`
`--cpus-per-task`	CPU cores per task	`#SBATCH --cpus-per-task=6`
`--mem`	Memory per node	`#SBATCH --mem=18G`
`--gres`	Generic resources (GPUs)	`#SBATCH --gres=gpu:1`

Defaults when requesting GPUs: each GPU automatically reserves 6 CPU cores and 3 GB of RAM per core (18 GB total) unless you override these with --cpus-per-task and --mem.

Job Script Examples

CPU-only Job

Suitable for data preprocessing, postprocessing, or any task that does not require a GPU.

#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=/work/<username>/logs/preprocess_%j.out
#SBATCH --error=/work/<username>/logs/preprocess_%j.err
#SBATCH --account=<project>

module load python/3.11.11-gcc-11.4.0
source /work/<username>/envs/my_env/bin/activate

python preprocess.py --input /work/<username>/data/raw --output /work/<username>/data/processed

Single-GPU Job

The most common use case: training or running inference with a single GPU.

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>

module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate

python train.py --config config.yaml

Multi-GPU Job (single node)

For distributed training across multiple GPUs on the same node using PyTorch torchrun.

#!/bin/bash
#SBATCH --job-name=train_multigpu
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=24
#SBATCH --mem=72G
#SBATCH --time=24:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>

module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate

torchrun --nproc_per_node=4 train_distributed.py --config config.yaml

Interactive Job Submission

Interactive jobs are useful for debugging, exploratory work, or tasks that require real-time input. Resources are allocated and billed exactly like batch jobs.

Using `salloc`

salloc allocates resources and opens an interactive session. Commands run without srun execute on the login node; use srun to run on the allocated compute node.

salloc --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project>

srun --pty /bin/bash      # open a shell on the compute node

# inside the shell:
module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
python -c "import torch; print(torch.cuda.is_available())"

exit                      # release the allocation

Using `srun --pty`

Opens a shell directly on the compute node in a single step:

srun --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project> --pty /bin/bash

Recommendation: prefer salloc when you plan to run multiple commands or scripts within the session. With srun --pty, launching additional srun calls inside the session may hang unless you pass --overlap.

Monitoring Jobs

`squeue`

Displays pending and running jobs.

squeue -u $USER                         # your jobs only
squeue -j <job_id>                      # specific job
squeue -p all_usr_prod                  # all jobs in a partition
squeue --start -u $USER                 # estimated start time for pending jobs

Custom output format:

squeue -o "%.10i %.12P %.15j %.8u %.2t %.10M %.5D %R" -u $USER

`sinfo`

Shows node and partition status.

sinfo                                   # summary of all partitions
sinfo -p all_usr_prod                   # specific partition
sinfo -N -p all_usr_prod               # node-level view
sinfo -o "%P %D %t %C"                 # partition, nodes, state, CPUs

`scontrol`

Queries detailed information about jobs and nodes.

scontrol show job <job_id>              # full job details (resources, node, state)
scontrol show node <node_name>          # node hardware and state
scontrol hold <job_id>                  # prevent a job from starting
scontrol release <job_id>              # release a held job

`scancel`

Cancels pending or running jobs.

scancel <job_id>                        # cancel a specific job
scancel -u $USER                        # cancel all your jobs
scancel -u $USER -t PD                  # cancel all your pending jobs
scancel -p all_usr_prod -u $USER        # cancel your jobs in a specific partition