AImageLab SRV

Table of Contents

Containerization


When installing software, you may come across applications that have complex chains of dependencies that are challenging to compile and install. Some software may require very specific versions of libraries that may not be available on AImageLab-SRV or conflict with libraries needed for other applications. You may also need to move between several workstations or HPC platforms, which often requires reinstalling your software on each system. Containers are a good way to tackle all of these issues and more.

Containerization Fundamentals

Containers build upon an idea that has long existed within computing: hardware can be emulated through software. Virtualization simulates some or all components of computation through a software application. Virtual machines use this concept to generate an entire operating system as an application on a host system. Containers follow the same idea, but at a much smaller scale and contained within a system’s kernel.

Containers are portable compartmentalizations of an operating system, software, libraries, data, and/or workflows. Containers offer portability and reproducibility.

  • Portability: containers can run on any system equipped with its specified container manager.
  • Reproducibility: because containers are instances of prebuilt isolated software, software will always execute the same every time.
    Containers distinguish themselves through their low computational overhead and their ability to utilize all of a host system’s resources. Building containers is a relatively simple process that starts with a container engine.

Container engines

Docker is the most widely used container engine, and can be used on any system where you have administrative privileges. Docker cannot be run on high-performance computing (HPC) platforms because users do not have administrative privileges.

Singularity is a container engine that does not require administrative privileges to execute. Therefore, it is safe to run on HPC platforms.

Because Docker images are widely available for many software packages, a common use case is to use Singularity to run Docker images.

Containerization on AImageLab-SRV is done with Singularity, and facilitated by NVIDIA Pyxis, which allows you to run and manage containerized applications efficiently within our HPC environment.

Introduction to NVIDIA Pyxis

NVIDIA Pyxis is an open-source solution that integrates NVIDIA GPU support into container runtimes on HPC systems. It provides seamless integration with SLURM for scheduling and managing containerized jobs. Pyxis is designed to work with popular container technologies such as Docker and Singularity.

For more detailed information on Pyxis, you can visit the NVIDIA Pyxis GitHub page.

Running Containers with Pyxis

To run containers using Pyxis on AImageLab-SRV, you need to follow specific steps to ensure proper configuration and execution. Below is a step-by-step guide to help you get started.

Step 1: Preparing Your Container

  1. Select or Build a Container Image:

    You can use pre-built container images from repositories such as Docker Hub or NVIDIA NGC. If you need a custom container, build it using Docker or Singularity.

  2. Ensure Compatibility:

    Verify that your container image is compatible with the NVIDIA GPU resources available on AImageLab-SRV. For GPU-enabled applications, ensure that the container has the necessary libraries and drivers.

Step 2: Writing Your SLURM Job Script

To submit a job that runs a container using Pyxis, you need to write a SLURM batch script. Below is a sample SLURM job script to run a container:

#!/bin/bash
#SBATCH --job-name=container_job         # Job name
#SBATCH --output=container_job.out       # Output file
#SBATCH --error=container_job.err        # Error file
#SBATCH --account=<account>              # Your account
#SBATCH --partition=<partition>          # Partition
#SBATCH --ntasks=1                       # Number of tasks
#SBATCH --time=01:00:00                  # Time limit

# Define the container image
IMAGE_NAME="ubuntu:latest"

# Run the container using Pyxis
srun --container-image=${IMAGE_NAME} --container-mounts=/path/to/data:/data \
     --container-workdir=/data /bin/bash -c "echo Hello, Container!"

In this script:
- --container-image specifies the container image to use.
- --container-mounts allows you to mount directories from the host into the container.
- --container-workdir sets the working directory inside the container.

Step 3: Submitting Your Job

Submit your SLURM job script using the sbatch command:

sbatch your_job_script.sh

Replace your_job_script.sh with the name of your job script file.

Advanced Container Configuration

Pyxis supports more advanced container configurations. For detailed examples and configurations, refer to the NVIDIA Pyxis GitHub page.

Example: Running a GPU-Accelerated Application

To run a GPU-accelerated application, ensure your container image includes GPU libraries. Modify your SLURM job script as follows:

#!/bin/bash
#SBATCH --job-name=gpu_app
#SBATCH --output=gpu_app.out
#SBATCH --error=gpu_app.err
#SBATCH --account=<account>
#SBATCH --partition=<partition>
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=02:00:00

IMAGE_NAME="pytorch/pytorch"

srun --container-image=${IMAGE_NAME} --container-mounts=/path/to/data:/data \
     --container-workdir=/data /bin/bash -c "python train_model.py"

In this example, train_model.py is a script inside the container that utilizes the GPU.

Common Issues and Troubleshooting

  • Container Fails to Start: Ensure that the container image is accessible and compatible with the SLURM node. Check for any errors in the output and error files.

  • GPU Not Available: Verify that you requested GPUs in your SLURM job script and that the container image supports GPU acceleration.

  • Permission Issues: Ensure you have the necessary permissions to access the specified directories and files.

For further assistance, consult our detailed documentation or contact support.

Last updated: August 24, 2024