Cuda and PyTorch

This topic should help resolving issues when working with Cuda and PyTorch

Selecting a Cuda Version

Multiple Cuda installations are present on the cluster and can be activated using the module command:

module avail
module add cuda/12.6

If you want your selected Cuda version to be the default for future login sessions then run

module save default

Please note that selecting Cuda version prior to 12.5 will automatically load the gcc/11 module because newer compilers are incompatible with older Cuda versions.

Installing PyTorch

The installation command for PyTorch (after creating and activating a virtual environment) is:

pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu...

The last part of the URL is the desired Cuda version formatted as plain concatenation of the major and minor version number and must match the one that you activate with module:

module add cuda/12.6
pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

The system Cuda version is 12.8 which may be too new for PyTorch so please be sure to use a suitable version.

Adding Additional Packages

You will need to activate the same version of Cuda that you used for installing torch when you install packages that depend on it and do not come with pre-compiled code. You may get errors like the following if you do not:

pip install --no-cache-dir spatial-correlation-sampler
...
RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (12.6). Please make sure to use the same CUDA versions.

Running Jobs

The module command is not available for running jobs unless you add the following to your batch file or script right after the #SBATCH lines (note the dot):

. /etc/profile.d/modules.sh
module add cuda/12.6

Jupyter

For environments that are prepared by TAs for the JupyterHub of the cluster the right version of Cuda is already active.

If you provide your own custom environment then you can supply the modules that you want to load at server startup.

Testing PyTorch

Run the following snippet in a notebook to check if PyTorch is installed and if a GPU is available:

try:
    import torch
    print("PyTorch is installed.")

    if torch.cuda.is_available():
        print("CUDA is available.")
        num_gpus = torch.cuda.device_count()
        print(f"Number of GPUs: {num_gpus}")
        for i in range(num_gpus):
            gpu_name = torch.cuda.get_device_name(i)
            print(f" - GPU {i}: {gpu_name}")
    else:
        print("CUDA is not available.")
except ImportError:
    print("PyTorch is not installed.")