Submitting workloads
The High Performance Computing (HPC) resources at UT Dallas use the Slurm Workload Manager to facilitate access to the compute nodes from the login nodes. If multiple users or jobs need to access the same resources, Slurm also manages queues of pending work to submit to the resources as they become available. You can request resources from Slurm in two ways: interactively and through a batch script.
Interactive jobs
When you run a job interactively with srun --pty /bin/bash
, you’re provided
with terminal access to the compute node after the requested resources are
allocated. Before requesting an interactive session, you need to know:
-
The partition the resources you need to access are in (default:
normal
on Ganymede). (denoted-p
or--partition
) -
The number of nodes you need. (
-N
or--nodes
) -
Either the total number of tasks needed, or the number of tasks per node. (
-n
or--ntasks
for total tasks and--ntasks-per-node
for the number of tasks per node) -
(Optional) the number of CPUs needed per task. (
--cpus-per-task
)
Interactive job examples
For example, the following command requests one node and one task on the
normal
partition, a setup suitable for a serial
job:
srun -N 1 -n 1 -p normal --pty /bin/bash
Similarly, the following command requests two nodes and thirty two tasks
(equivalently, sixteen tasks per node) on the normal
partition, which would
be appropriate for an MPI parallelized job:
srun -N 2 -n 32 -p normal --pty /bin/bash
Finally, the following command requests one node, one task, and 16 CPUs per task, suitable for an OpenMP parallelized job:
srun -N 1 -n 1 --cpus-per-task 16 --pty /bin/bash
Launching work in an interactive job
Once your resources are allocated in an interactive job, you have terminal access to the compute nodes requested. From there, you can run your workload application or scripts in a manner suitable to your parallelization method.
Slurm batch scripts
Often, it’s preferable to submit your job to the compute nodes non-interactively. By using the batch script method, a job is queued until resources become available and then run with no further input from you. Submitting to the batch system requires writing a batch script composed of Slurm settings and your workload commands.
Slurm specifications in a batch script
At the beginning of your batch script, you can specify Slurm
batch settings by prefixing each setting with #SBATCH
. At minimum, you need
to prepare:
-
The partition to request resources from (default:
normal
on Ganymede). For example:#SBATCH --partition=normal
-
The number of nodes your workload requires. Example:
#SBATCH --nodes=2
-
The total number of tasks required. Example:
#SBATCH --ntasks=32
. Alternatively, you can specify the number of tasks per node with#SBATCH --ntasks-per-node=16
-
The maximum time required for your workload in the format
Days-Hours:Minutes:Seconds
. For Example:--time=1-12:00:00
provides a maximum run time of one day and twelve hours.
For a full list of available settings, see the sbatch
documentation.
Slurm batch script example
For example, the following batch script runs a python
script parallelized
with mpi4py
on two nodes with a total of thirty two tasks with a maximum
runtime of one hour:
#!/bin/bash
#SBATCH --partition=normal
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --time=01:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your.email@utdallas.edu
prun python my_scripy.py
For more information on prun
and parallelization techniques, see the
parallelization documentation.
FAQ
The following are some frequently asked questions regarding the use of Slurm.
My job shows Pending (PD) even though free nodes are available
There are various reasons this can happen. The following shows some common issues.
Problem Description | Fix |
---|---|
The time limit for your job conflicts with a reservation |
You can either modify your job to specify the
total run time or wait until after the
reservation window. To see existing reservations, you can run |