Titan

Titan, named after the moon of Saturn, is a Cyberinfrastructure Research Computing (CIRC) cluster built for providing access to GPUs.

Titan is currently slated for migration into Ganymede2, the next iteration of the Ganymede campus condo cluster. As such we are no longer offering buy-ins to Titan, rather to Ganymede2.

Titan node setup

Titan is configured so multiple jobs can run on one node, depending on the GPUs and memory specified. The following nodes are available in the normal queue:

Node Name GPU Types Number of GPUs Cores Memory

compute-[01-02, 04-08]

NVIDIA GeForce RTX 3090

4

8

193 GB

compute-03

NVIDIA GeForce RTX 3090

8

16

386 GB

Using Titan

Logging in

Titan is accessed via SSH. Once your account is activated, you can connect to Titan at titan.circ.utdallas.edu. For example, in a typical terminal client run the command:

ssh <NetID>@titan.circ.utdallas.edu

More information on setting up SSH access to CIRC machines on your computer can be found here.

Requesting resources for jobs

Titan, like most of the CIRC systems, uses Slurm for job submission and scheduling. However, Titan has a few special requirements due to its configuration.

GPUs are considered Slurm Generic Resources. Slurm defines several convenience settings for configuring GPU access for your jobs. For example, if your job needs 4 CPUs, 2 GPUs, and 16 GB of memory on one node, your Slurm batch script should include:

#SBATCH -N 1
#SBATCH -n 4
#SBATCH --gpus=2
#SBATCH --mem=16G

In order for jobs to share nodes, mem, mem-per-gpu, or mem-per-cpu must be specified in your batch script or when you request an interactive job. If you don’t specify memory, your job will request all memory available on the node and block other jobs.

Similarity, for an interactive job, run

srun -N 1 -n 4 --gpus=2 --mem=16G --pty /bin/bash

Using containers

Many GPU codes are distributed via Docker containers. Docker is not allowed on CIRC systems due to security issues. However, Docker containers can be used by running them with Apptainer/Singularity. For example, you can use the TensorFlow Docker container from DockerHub with the following commands:

# Loads the Singularity Module
module load singularity

# Pull the TensorFlow Docker container and transform
# it into a Singularity sandbox
singularity build --sandbox tensorflow_sandbox/ docker://tensorflow/tensorflow

# Run your Python script with the tensorflow container
singularity run -u --nv tensorflow_sandbox python <your_python_script.py>

Using Singularity on Titan requires passing the -u and --nv flags to singularity run, singularity shell, and singularity exec. The option --nv allows Singularity to "see" the allocated GPUs.

Available software

You can view all available modules on Titan by running the command module spider. If you need new software installed or a different version than is provided, please contact circ-assist@utdallas.edu.

Troubleshooting on Titan

The following scenarios commonly come up on Titan. If your problem isn’t listed below, please contact circ-assist@utdallas.edu for help.

My code isn’t using all requested GPUs

Some code requires MPI with one GPU per task to use multiple GPUs. If this matches your code setup, the following Slurm batch script settings assign one GPU and four GB of memory per MPI Task:

#SBATCH -N 1
#SBATCH -n 2
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-gpu=4G