Ganymede
Ganymede, named after Jupiter’s largest satellite, is a Cyberinfrastructure Research Computing (CIRC) cluster built on the condo model. While Ganymede does have free-to-use queues available to all UT Dallas researchers, a majority of the computational power is provided by nodes purchased for exclusive group access.
For information about purchasing nodes to add to Ganymede, email circ-assist@utdallas.edu. |
Ganymede node setup
Ganymede is set up to run only one job per node. When a user submits a job, they will be given exclusive access to the entire node, regardless of how many cores or how much memory is requested. The following partitions are available by default:
Queue Name | Number of nodes | Cores (CPU Architecture) | Memory | Time Limit ([d-]hh:mm:ss) |
---|---|---|---|---|
debug |
2 |
16 (Intel Sandy Bridge) |
32 GB |
02:00:00 |
normal |
110 |
16 (Intel Sandy Bridge) |
32 GB |
4-00:00:00 |
128s |
8 |
16 (Intel Sandy Bridge) |
128 GB |
4-00:00:00 |
256i |
16 |
20 (Intel Ivy Bridge) |
256 GB |
4-00:00:00 |
256h |
1 |
16 ( Intel Haswell) |
256 GB |
4-00:00:00 |
The above resources only account for a fraction of the compute capacity of Ganymede. All nodes cluster-wide considered, Ganymede has nearly 7300 CPU cores and 36 TB of RAM. |
Ganymede storage
Ganymede has two user-writable storage directories, accessible from the
head node (ganymede.utdallas.edu
) and the compute nodes:
Directory | Filesystem Type | Network Speed | Filesystem Size | User Quota (Soft/Hard) | Backup Frequency |
---|---|---|---|---|---|
|
NFS |
1/10 gigabit/s[1] |
5 TB |
50/55 GB |
Nightly |
|
40/56/100 gigabit/s[2] |
WekaFS |
200 TB |
10 TB |
None |
/home
/home
, due to its small quota, serial nature (recall that it’s exported via
NFS) and slower speed, is recommended to be used for scripts, runfiles, and
smaller output files. Please don’t run jobs from /home
as the filesystem and
network can be easily saturated, reducing user experience for others. MPI jobs
read and write a lot of data, so just a single multi-node MPI job can slow the
/home
filesystem drastically.
Recall from the chart that /home
is backed up nightly. Backups run nightly to
facilitate restoration of users' files in the event file deletion or corruption
occurs. If you need a backup restored email {circ-support-email} as soon as
possible and a CIRC team member will assist you however they can in getting
your files back.
Ceres
Ceres is the primary high-performance Parallel File System (PFS) for CIRC High Performance Computing (HPC) resources. Currently, Ceres is accessible from Ganymede and Ganymede2.
Software
Ceres is based on the Weka parallel filesystem. This filesystem is software-defined, with client and server-side processing.
On both the client and server side, container technology is utilized.
Note: On clients, a single core is dedicated to the Weka container. This means that on a 24-core system, only 23 cores are available for processing jobs.
Hardware
Ceres is hosted on Dell servers, namely PowerEdge R7515 NVMe servers and ECS500 spinning disk enclosures to reduce pressure on the flash storage.
The networking consists of 200 gigabit/s (HDR) Infiniband to export to clusters via the WekaFS protocol and 25 gigabit/s Ethernet to export to other systems via the NFS protocol.
Attached Clusters
Ganymede
Ceres exports to Ganymede as /scratch/ganymede
, with soft links provided
as ~/scratch
. On the cluster side:
Directory | Filesystem Type | Network Speed | Filesystem Size | User Quota (Soft/Hard) | Backup Frequency |
---|---|---|---|---|---|
|
40/56/100 gigabit/s[3] |
WekaFS |
200 TB |
10 TB[4] |
None |
Please notate in the preceding chart that |
Ganymede2
Ceres exports to Ganymede2 as /scratch/ganymede2
, with soft links provided
as ~/scratch
. On the cluster side:
Directory | Filesystem Type | Network Speed | Filesystem Size | User Quota (Soft/Hard) | Backup Frequency |
---|---|---|---|---|---|
|
200 gigabit/s |
WekaFS |
20 TB |
1 TB[5] |
None |
Usage
On Ganymede and Ganymede2, ~/scratch
is best utilized for high I/O and "larger than 50 GB"
datasets, TEMPORARILY. Please utilize the scratch space as "copy to, run,
clean up when done" space. The filesystem is a shared resource amongst all
Ganymede users and needs to be kept as clean as possible for performance and
usability reasons.
Note the network speeds are 40/56/100/200 gigabit/s rather than
1/10/25 gigabit/s. This is due to the network link being
Infiniband as opposed to Ethernet; Infiniband is a much
faster and much lower latency (0.5 us as opposed to 5-10 ms) than Ethernet.
This allows much faster file access (near-instantaneous) when running jobs and
~1.3 GB/s file read/write, which is about 10 times faster than the "standard"
link to Ganymede’s /home
and . Also, due to its parallel nature, the filesystem
doesn’t get saturated as easily as /home
, which allows more users to run jobs
at the same time. MPI jobs are no issue for ~/scratch
, so it’s highly advised
to use that space for running jobs, whether parallel or serial.
If you need persistent data storage in addition to the temporary scratch space, please contact circ-assist@utdallas.edu with the amount of storage you need, how long you need it, and what your workloads are and a CIRC team member will work with you to determine how best to proceed. |
Backups
Recall from above, ~/scratch
is NOT BACKED UP. Repeat, the entire
Ceres filesystem is NOT BACKED UP.
Any data on Ceres can be considered as volatile, so if it’s
important please move it off when your job is complete. The hardware running
the storage is robust, but nothing is invincible so please be cautious with
data storage.
Using Ganymede
Logging in
Ganymede is accessed via SSH. Once your
account is activated, you can connect to Ganymede at ganymede.utdallas.edu
.
For example, in a typical terminal client run the command:
ssh <NetID>@ganymede.utdallas.edu
More information on setting up SSH access to CIRC machines on your computer can be found here.
Submitting jobs
Information on submitting jobs to Ganymede can be found here. However, there is one Slurm variable that should not be included in your submission script:
#SBATCH --account=<NetID>
Attempting to run a Slurm submission script with this variable results in:
[user@ganymede ~]$ sbatch script.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
This error can usually be fixed by removing the --account
flag in your
batch script.
Users and accounts are separate concepts in Slurm. Your NetID is a user ID, not an account. |
TACC Launcher
Multiple instances of serial programs can run on Ganymede via
TACC Launcher. To use, load the launcher
module in your
submission script:
# Loads the Launcher Module
module load launcher
The TACC webpage has more information on running Launcher jobs.
DO NOT use Launcher with MATLAB. MATLAB at UT Dallas is run by a central license server (not run by CIRC) and queuing up many simultaneous MATLAB jobs will crash the license server and break MATLAB for everyone at UT Dallas. |
Using containers
Many "bundled" codes are distributed via Docker containers. Docker is not allowed on CIRC systems due to security issues. However, Docker containers can be used by running them with Apptainer/Singularity. For example, you can use the TensorFlow Docker container from DockerHub with the following commands:
# Loads the Singularity Module
module load singularity
# Pull the TensorFlow Docker container and transform
# it into a Singularity sandbox
singularity build --sandbox tensorflow_sandbox/ docker://tensorflow/tensorflow
# Run your Python script with the tensorflow container
singularity run -u --nv tensorflow_sandbox python <your_python_script.py>
Available software
You can view all available modules on Ganymede by running the command module
spider
. If you need new software installed or a different version than is
provided, please contact circ-assist@utdallas.edu.