The High Performance Computing (HPC) resources at UT Dallas use the Slurm Workload Manager to facilitate access to the compute nodes from the login nodes. If multiple users or jobs need to access the same resources, Slurm also manages queues of pending work to submit to the resources as they become available. You can request resources from Slurm in two ways: interactively and through a batch script.
When you run a job interactively with
srun --pty /bin/bash, you’re provided
with terminal access to the compute node after the requested resources are
allocated. Before requesting an interactive session, you need to know:
The partition the resources you need to access are in (default:
normalon Ganymede). (denoted
The number of nodes you need. (
Either the total number of tasks needed, or the number of tasks per node. (
--ntasksfor total tasks and
--ntasks-per-nodefor the number of tasks per node)
(Optional) the number of CPUs needed per task. (
For example, the following command requests one node and one task on the
normal partition, a setup suitable for a serial
srun -N 1 -n 1 -p normal --pty /bin/bash
Similarly, the following command requests two nodes and thirty two tasks
(equivalently, sixteen tasks per node) on the
normal partition, which would
be appropriate for an MPI parallelized job:
srun -N 2 -n 32 -p normal --pty /bin/bash
Finally, the following command requests one node, one task, and 16 CPUs per task, suitable for an OpenMP parallelized job:
srun -N 1 -n 1 --cpus-per-task 16 --pty /bin/bash
Once your resources are allocated in an interactive job, you have terminal access to the compute nodes requested. From there, you can run your workload application or scripts in a manner suitable to your parallelization method.
Often, it’s preferable to submit your job to the compute nodes non-interactively. By using the batch script method, a job is queued until resources become available and then run with no further input from you. Submitting to the batch system requires writing a batch script composed of Slurm settings and your workload commands.
At the beginning of your batch script, you can specify Slurm
batch settings by prefixing each setting with
#SBATCH. At minimum, you need
The partition to request resources from (default:
normalon Ganymede). For example:
The number of nodes your workload requires. Example:
The total number of tasks required. Example:
#SBATCH --ntasks=32. Alternatively, you can specify the number of tasks per node with
The maximum time required for your workload in the format
Days-Hours:Minutes:Seconds. For Example:
--time=1-12:00:00provides a maximum run time of one day and twelve hours.
For a full list of available settings, see the
For example, the following batch script runs a
python script parallelized
mpi4py on two nodes with a total of thirty two tasks with a maximum
runtime of one hour:
prun python my_scripy.py
For more information on
prun and parallelization techniques, see the
The following are some frequently asked questions regarding the use of Slurm.
There are various reasons this can happen. The following shows some common issues.
The time limit for your job conflicts with a reservation
You can either modify your job to specify the
total run time or wait until after the
reservation window. To see existing reservations, you can run