AImageLab-HPC

Scheduler and Job Submission

Last updated: March 31, 2026


AImageLab-HPC compute nodes are accessed exclusively through the SLURM workload manager. Login nodes are intended for light tasks only: editing files, transferring data, compiling code, and short test runs not exceeding a few minutes. All production work must be submitted as a SLURM job.

Note: Running long or resource-intensive processes directly on login nodes is not permitted and may result in your process being terminated.

Partitions and QOS

SLURM organises resources into partitions (queues of nodes) and QOS (Quality of Service) levels that define per-user limits and walltime caps.

Partition QOS Max walltime Max GPUs/user Notes
all_serial (default) all_serial 4 h 4 Login nodes only. Free of charge.
all_usr_prod all_usr_prod 24 h 8 All GPU nodes. Standard users.
all_usr_prod all_qos_sprod 24 h 4 (8 shared) Auto-assigned to students.
all_usr_prod all_qos_tprod 24 h — (8 shared) Auto-assigned to thesis students. No per-user GPU cap.
all_usr_prod all_qos_dbg 1 h 1 Debug and testing. All users.
all_usr_prod all_qos_lprod 7 days 0 (CPU only) Long CPU-only jobs (e.g. workflow schedulers).
all_usr_prod qos_biz_lprod 30 days 8 (group) Industrial contracts and production demos. Restricted.
all_usr_prod all_qos_bio 24 h 0 (CPU only) Biomedical research team. Restricted.
all_usr_prod all_qos_biol 30 days 0 (CPU only) Biomedical research, long jobs. Restricted.
boost_usr_prod boost_usr_prod 24 h 8 High-end nodes (L40S, A40). Standard users.
boost_usr_prod boost_qos_sprod 24 h* 4 (8 shared) Auto-assigned to students.
boost_usr_prod boost_qos_tprod 24 h* — (8 shared) Auto-assigned to thesis students. No per-user GPU cap.
boost_usr_prod all_qos_dbg 1 h 1 Debug and testing. All users.
boost_usr_prod qos_biz_lprod 30 days 8 (group) Industrial contracts and production demos. Restricted.

* Capped by the partition’s MaxTime; the QOS itself sets no walltime limit.

all_usr_prod vs boost_usr_prod: boost_usr_prod restricts scheduling to the high-end nodes equipped with L40S and A40 GPUs. all_usr_prod includes all GPU nodes (including older hardware). When in doubt, use all_usr_prod.

Students and thesis students are automatically assigned their respective QOS (all_qos_sprod/all_qos_tprod for all_usr_prod, boost_qos_sprod/boost_qos_tprod for boost_usr_prod) by the job submission system — no --qos flag is needed.

all_qos_dbg is available on both production partitions and grants immediate or near-immediate access with a 1-hour cap and a maximum of 1 GPU. Use it for quick tests and debugging.

all_qos_lprod is intended for long-running CPU-only processes such as workflow schedulers or pipeline orchestrators; GPUs cannot be requested with this QOS.

qos_biz_lprod, all_qos_bio, and all_qos_biol are restricted QOS available only to specific groups; access is granted by the HPC team.

SLURM Directives

Resources are requested by including #SBATCH directives in your job script, or by passing the equivalent flags directly to sbatch or salloc.

Directive Description Example
--job-name Name for the job #SBATCH --job-name=my_train
--output Standard output file (%j = job ID) #SBATCH --output=job_%j.out
--error Standard error file #SBATCH --error=job_%j.err
--time Maximum walltime (hh:mm:ss) #SBATCH --time=08:00:00
--partition Partition to submit to #SBATCH --partition=all_usr_prod
--qos Quality of Service #SBATCH --qos=all_qos_dbg
--account Project account #SBATCH --account=<project>
--nodes Number of nodes #SBATCH --nodes=1
--ntasks Total number of tasks #SBATCH --ntasks=1
--cpus-per-task CPU cores per task #SBATCH --cpus-per-task=6
--mem Memory per node #SBATCH --mem=18G
--gres Generic resources (GPUs) #SBATCH --gres=gpu:1

Defaults when requesting GPUs: each GPU automatically reserves 6 CPU cores and 3 GB of RAM per core (18 GB total) unless you override these with --cpus-per-task and --mem.

Job Script Examples

CPU-only Job

Suitable for data preprocessing, postprocessing, or any task that does not require a GPU.

#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=/work/<username>/logs/preprocess_%j.out
#SBATCH --error=/work/<username>/logs/preprocess_%j.err
#SBATCH --account=<project>

module load python/3.11.11-gcc-11.4.0
source /work/<username>/envs/my_env/bin/activate

python preprocess.py --input /work/<username>/data/raw --output /work/<username>/data/processed

Single-GPU Job

The most common use case: training or running inference with a single GPU.

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>

module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate

python train.py --config config.yaml

Multi-GPU Job (single node)

For distributed training across multiple GPUs on the same node using PyTorch torchrun.

#!/bin/bash
#SBATCH --job-name=train_multigpu
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=24
#SBATCH --mem=72G
#SBATCH --time=24:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>

module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate

torchrun --nproc_per_node=4 train_distributed.py --config config.yaml

Interactive Job Submission

Interactive jobs are useful for debugging, exploratory work, or tasks that require real-time input. Resources are allocated and billed exactly like batch jobs.

Using salloc

salloc allocates resources and opens an interactive session. Commands run without srun execute on the login node; use srun to run on the allocated compute node.

salloc --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project>

srun --pty /bin/bash      # open a shell on the compute node

# inside the shell:
module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
python -c "import torch; print(torch.cuda.is_available())"

exit                      # release the allocation

Using srun --pty

Opens a shell directly on the compute node in a single step:

srun --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project> --pty /bin/bash

Recommendation: prefer salloc when you plan to run multiple commands or scripts within the session. With srun --pty, launching additional srun calls inside the session may hang unless you pass --overlap.

Monitoring Jobs

squeue

Displays pending and running jobs.

squeue -u $USER                         # your jobs only
squeue -j <job_id>                      # specific job
squeue -p all_usr_prod                  # all jobs in a partition
squeue --start -u $USER                 # estimated start time for pending jobs

Custom output format:

squeue -o "%.10i %.12P %.15j %.8u %.2t %.10M %.5D %R" -u $USER

sinfo

Shows node and partition status.

sinfo                                   # summary of all partitions
sinfo -p all_usr_prod                   # specific partition
sinfo -N -p all_usr_prod               # node-level view
sinfo -o "%P %D %t %C"                 # partition, nodes, state, CPUs

scontrol

Queries detailed information about jobs and nodes.

scontrol show job <job_id>              # full job details (resources, node, state)
scontrol show node <node_name>          # node hardware and state
scontrol hold <job_id>                  # prevent a job from starting
scontrol release <job_id>              # release a held job

scancel

Cancels pending or running jobs.

scancel <job_id>                        # cancel a specific job
scancel -u $USER                        # cancel all your jobs
scancel -u $USER -t PD                  # cancel all your pending jobs
scancel -p all_usr_prod -u $USER        # cancel your jobs in a specific partition