AImageLab SRV

Table of Contents

SLURM Features


The AImageLab-SRV cluster is a heterogeneous computing environment designed to accommodate a wide variety of computational workloads. Due to the diverse range of hardware available, it is essential for users to understand how to effectively utilize node features to maximize resource efficiency. This document outlines the available SLURM node features in the AImageLab-SRV cluster and provides guidance on how to properly request specific hardware configurations when submitting jobs.

Node Features

Node features describe the specific capabilities of each node, particularly in terms of GPU model and the amount of available VRAM. These features can be specified when submitting a job using the --constraint directive in SLURM. This allows users to target nodes with the appropriate hardware for their computational needs.

Supported GPU Features

The current list of supported GPU node features is as follows:

  • gpu_1080_8G
  • gpu_2080Ti_11G
  • gpu_RTX5000_16G
  • gpu_RTX6000_24G
  • gpu_RTXA5000_24G
  • gpu_A40_48G

Each feature is named according to the GPU model and the amount of VRAM it possesses. For example, the feature gpu_RTX6000_24G corresponds to nodes equipped with an NVIDIA RTX 6000 GPU that has 24 GB of VRAM.

Submitting Jobs with Node Features

When submitting jobs on the AImageLab-SRV cluster, users can specify desired node features using the --constraint directive in their SLURM job script. This allows users to ensure that their jobs are allocated to nodes with the necessary GPU resources.

💡 Example: Requesting GPUs with 24 GB of VRAM

If your job requires a GPU with at least 24 GB of VRAM, you can specify the following constraint in your SLURM job script:

#SBATCH --constraint="gpu_RTX6000_24G|gpu_RTXA5000_24G|gpu_A40_48G"

This expression requests nodes equipped with either an RTX 6000 (24 GB), RTX A5000 (24 GB), or A40 (48 GB) GPU, ensuring that your job has access to the necessary VRAM.

Guidelines for Specifying Constraints

To optimize the use of cluster resources and to avoid unnecessary job delays, users should adhere to the following guidelines when specifying node features:

  1. Allow Higher VRAM GPUs 🟢: Always allow nodes with GPUs that have more VRAM than the minimum required by your job. For example, if your script requires 24 GB of VRAM, it is advisable to include nodes with 48 GB VRAM GPUs in your constraints.

  2. Include Compatible Lower VRAM GPUs 🔄: When your job has a specific VRAM requirement, always include GPUs with slightly higher or exactly matching VRAM capacities. For instance, if your job needs 9 GB of VRAM, you must allow nodes with GPUs having 11 GB, 16 GB, 24 GB, or 48 GB of VRAM.

🚨 Penalties for Non-Compliance

⚠️ Attention: Users who do not comply with the guidelines for specifying node constraints will receive email alerts. Continued non-compliance may result in the user being temporarily blocked from submitting jobs on the cluster. This measure is in place to ensure fair and efficient use of the AImageLab-SRV resources.

Conclusion

Effective use of SLURM node features is crucial for maximizing the efficiency and performance of the cluster. By following the guidelines outlined in this document, users can ensure that their jobs are allocated to the most suitable nodes, thereby optimizing resource usage and reducing wait times for other users.

For further assistance or questions regarding job submission and node features, please contact the AImageLab-SRV support team.

Last updated: August 13, 2024