- Home
- Getting Started
- Using the Cluster
- Software and Development
- Policies
- Courses
Scheduler and Job Submission
AImageLab-HPC compute nodes are accessed exclusively through the SLURM workload manager. Login nodes are intended for light tasks only: editing files, transferring data, compiling code, and short test runs not exceeding a few minutes. All production work must be submitted as a SLURM job.
Note: Running long or resource-intensive processes directly on login nodes is not permitted and may result in your process being terminated.
Partitions and QOS
SLURM organises resources into partitions (queues of nodes) and QOS (Quality of Service) levels that define per-user limits and walltime caps.
| Partition | QOS | Max walltime | Max GPUs/user | Notes |
|---|---|---|---|---|
all_serial (default) |
all_serial |
4 h | 4 | Login nodes only. Free of charge. |
all_usr_prod |
all_usr_prod |
24 h | 8 | All GPU nodes. Standard users. |
all_usr_prod |
all_qos_sprod |
24 h | 4 (8 shared) | Auto-assigned to students. |
all_usr_prod |
all_qos_tprod |
24 h | — (8 shared) | Auto-assigned to thesis students. No per-user GPU cap. |
all_usr_prod |
all_qos_dbg |
1 h | 1 | Debug and testing. All users. |
all_usr_prod |
all_qos_lprod |
7 days | 0 (CPU only) | Long CPU-only jobs (e.g. workflow schedulers). |
all_usr_prod |
qos_biz_lprod |
30 days | 8 (group) | Industrial contracts and production demos. Restricted. |
all_usr_prod |
all_qos_bio |
24 h | 0 (CPU only) | Biomedical research team. Restricted. |
all_usr_prod |
all_qos_biol |
30 days | 0 (CPU only) | Biomedical research, long jobs. Restricted. |
boost_usr_prod |
boost_usr_prod |
24 h | 8 | High-end nodes (L40S, A40). Standard users. |
boost_usr_prod |
boost_qos_sprod |
24 h* | 4 (8 shared) | Auto-assigned to students. |
boost_usr_prod |
boost_qos_tprod |
24 h* | — (8 shared) | Auto-assigned to thesis students. No per-user GPU cap. |
boost_usr_prod |
all_qos_dbg |
1 h | 1 | Debug and testing. All users. |
boost_usr_prod |
qos_biz_lprod |
30 days | 8 (group) | Industrial contracts and production demos. Restricted. |
* Capped by the partition’s MaxTime; the QOS itself sets no walltime limit.
all_usr_prod vs boost_usr_prod: boost_usr_prod restricts scheduling to the high-end nodes equipped with L40S and A40 GPUs. all_usr_prod includes all GPU nodes (including older hardware). When in doubt, use all_usr_prod.
Students and thesis students are automatically assigned their respective QOS (all_qos_sprod/all_qos_tprod for all_usr_prod, boost_qos_sprod/boost_qos_tprod for boost_usr_prod) by the job submission system — no --qos flag is needed.
all_qos_dbg is available on both production partitions and grants immediate or near-immediate access with a 1-hour cap and a maximum of 1 GPU. Use it for quick tests and debugging.
all_qos_lprod is intended for long-running CPU-only processes such as workflow schedulers or pipeline orchestrators; GPUs cannot be requested with this QOS.
qos_biz_lprod, all_qos_bio, and all_qos_biol are restricted QOS available only to specific groups; access is granted by the HPC team.
SLURM Directives
Resources are requested by including #SBATCH directives in your job script, or by passing the equivalent flags directly to sbatch or salloc.
| Directive | Description | Example |
|---|---|---|
--job-name |
Name for the job | #SBATCH --job-name=my_train |
--output |
Standard output file (%j = job ID) |
#SBATCH --output=job_%j.out |
--error |
Standard error file | #SBATCH --error=job_%j.err |
--time |
Maximum walltime (hh:mm:ss) | #SBATCH --time=08:00:00 |
--partition |
Partition to submit to | #SBATCH --partition=all_usr_prod |
--qos |
Quality of Service | #SBATCH --qos=all_qos_dbg |
--account |
Project account | #SBATCH --account=<project> |
--nodes |
Number of nodes | #SBATCH --nodes=1 |
--ntasks |
Total number of tasks | #SBATCH --ntasks=1 |
--cpus-per-task |
CPU cores per task | #SBATCH --cpus-per-task=6 |
--mem |
Memory per node | #SBATCH --mem=18G |
--gres |
Generic resources (GPUs) | #SBATCH --gres=gpu:1 |
Defaults when requesting GPUs: each GPU automatically reserves 6 CPU cores and 3 GB of RAM per core (18 GB total) unless you override these with
--cpus-per-taskand--mem.
Job Script Examples
CPU-only Job
Suitable for data preprocessing, postprocessing, or any task that does not require a GPU.
#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=/work/<username>/logs/preprocess_%j.out
#SBATCH --error=/work/<username>/logs/preprocess_%j.err
#SBATCH --account=<project>
module load python/3.11.11-gcc-11.4.0
source /work/<username>/envs/my_env/bin/activate
python preprocess.py --input /work/<username>/data/raw --output /work/<username>/data/processed
Single-GPU Job
The most common use case: training or running inference with a single GPU.
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=12:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>
module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate
python train.py --config config.yaml
Multi-GPU Job (single node)
For distributed training across multiple GPUs on the same node using PyTorch torchrun.
#!/bin/bash
#SBATCH --job-name=train_multigpu
#SBATCH --partition=all_usr_prod
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=24
#SBATCH --mem=72G
#SBATCH --time=24:00:00
#SBATCH --output=/work/<username>/logs/train_%j.out
#SBATCH --error=/work/<username>/logs/train_%j.err
#SBATCH --account=<project>
module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
source /work/<username>/envs/my_env/bin/activate
torchrun --nproc_per_node=4 train_distributed.py --config config.yaml
Interactive Job Submission
Interactive jobs are useful for debugging, exploratory work, or tasks that require real-time input. Resources are allocated and billed exactly like batch jobs.
Using salloc
salloc allocates resources and opens an interactive session. Commands run without srun execute on the login node; use srun to run on the allocated compute node.
salloc --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project>
srun --pty /bin/bash # open a shell on the compute node
# inside the shell:
module load py-torch/2.8.0-gcc-11.4.0-cuda-12.6.3
python -c "import torch; print(torch.cuda.is_available())"
exit # release the allocation
Using srun --pty
Opens a shell directly on the compute node in a single step:
srun --partition=all_usr_prod --qos=all_qos_dbg --gres=gpu:1 --time=01:00:00 --account=<project> --pty /bin/bash
Recommendation: prefer
sallocwhen you plan to run multiple commands or scripts within the session. Withsrun --pty, launching additionalsruncalls inside the session may hang unless you pass--overlap.
Monitoring Jobs
squeue
Displays pending and running jobs.
squeue -u $USER # your jobs only
squeue -j <job_id> # specific job
squeue -p all_usr_prod # all jobs in a partition
squeue --start -u $USER # estimated start time for pending jobs
Custom output format:
squeue -o "%.10i %.12P %.15j %.8u %.2t %.10M %.5D %R" -u $USER
sinfo
Shows node and partition status.
sinfo # summary of all partitions
sinfo -p all_usr_prod # specific partition
sinfo -N -p all_usr_prod # node-level view
sinfo -o "%P %D %t %C" # partition, nodes, state, CPUs
scontrol
Queries detailed information about jobs and nodes.
scontrol show job <job_id> # full job details (resources, node, state)
scontrol show node <node_name> # node hardware and state
scontrol hold <job_id> # prevent a job from starting
scontrol release <job_id> # release a held job
scancel
Cancels pending or running jobs.
scancel <job_id> # cancel a specific job
scancel -u $USER # cancel all your jobs
scancel -u $USER -t PD # cancel all your pending jobs
scancel -p all_usr_prod -u $USER # cancel your jobs in a specific partition