Ganymede2

Ganymede2, the successor cluster to Ganymede, is (like its predecessor) named after Jupiter’s largest satellite. Ganymede2 is, like its predecessor, built on the condo model. Presently, Ganymede2 is in mid-to-late "beta" stage and thus only has a handful of dedicated compute nodes available to all UT Dallas researchers. A large majority of the hardware that makes up Ganymede2 is owned by individual researchers.

For information about purchasing nodes to add to Ganymede, email circ-assist@utdallas.edu.

Despite the fact that Ganymede2 is primarily owned by private researchers, the system has what are called "preempt" queues, which allow job submission from all Ganymede2 users. These preempt jobs (named cpu-preempt and gpu-preeempt) are heavily de-prioritized to the actual queue owner, so any workloads submitted to these queues should be seen as volatile and heavily utilize checkpointing.

When a preempt job is preempted, that job is killed immediately and forcefully. If data isn’t being constantly saved to an output file, DATA LOSS WILL OCCUR.

Ganymede2 node setup

Ganymede2, unlike its predecessor, allows multiple jobs per node. Nodes can be "mixed"state, which indicates that node is currently processing multiple jobs at once. GPU nodes with multiple GPUs can have individual GPUs queued up to different jobs, or in some cases each GPU can run multiple jobs at the same time.

The following partitions are available to all users:

Queue Name Number of nodes Cores/Threads (CPU Architecture) Memory Time Limit ([d-]hh:mm:ss) GPUs? Use Case

dev

2

64/128 (Ice Lake)

256GB

2:00:00

No

Code debugging, job submission testing

normal

4

64/128 (Ice Lake)

256GB

2-00:00:00

No

Normal code runs, CPU only

cpu-preempt

8

VARIOUS

VARIOUS

7-00:00:00

No

Volatile CPU job submission

gpu-preempt

6

VARIOUS

VARIOUS

7-00:00:00

Yes, VARIOUS types

Volatile GPU job submission

Ganymede2 storage

Ganymede2 has multiple user-writable storage directories, accessible from the login node (ganymede2.circ.utdallas.edu) and the compute nodes:

Directory Filesystem Type Network Speed Filesystem Size User Quota (Soft/Hard) Backup Frequency

/home

NFS

25 gigabit/s

Up to 50 TB

50/55 GB{fn-home-quota}

Nightly

/mfs/io/groups

MooseFS

10 gigabit/s

VARIES

VARIES

Nightly

/scratch

WekaFS

200 gigabit/s

20 TB[1]

None

None

/home

/home on Ganymede2 is hosted on Io, the CIRC MooseFS storage array. The host storage is all NVMe-based to allow fast sourcing of user environments as opposed to a spinning disk storage system. However, like Ganymede1, /home is recommended to be used for scripts, runfiles, and smaller output files. Please don’t run jobs from /home as the network can be easily saturated, reducing user experience for others. MPI jobs read and write a lot of data, so just a single multi-node MPI job can slow the /home filesystem drastically despite the fast storage backend.

As stated in the above chart, /home on Ganymede2 is backed up nightly. If you need files restored, please email circ-assist@utdallas.edu.

Ceres

Ceres is the primary high-performance Parallel File System (PFS) for CIRC High Performance Computing (HPC) resources. Currently, Ceres is accessible from Ganymede and Ganymede2.

Software

Ceres is based on the Weka parallel filesystem. This filesystem is software-defined, with client and server-side processing.

On both the client and server side, container technology is utilized.

Note: On clients, a single core is dedicated to the Weka container. This means that on a 24-core system, only 23 cores are available for processing jobs.

Hardware

Ceres is hosted on Dell servers, namely PowerEdge R7515 NVMe servers and ECS500 spinning disk enclosures to reduce pressure on the flash storage.

The networking consists of 200 gigabit/s (HDR) Infiniband to export to clusters via the WekaFS protocol and 25 gigabit/s Ethernet to export to other systems via the NFS protocol.

Attached Clusters

Ganymede

Ceres exports to Ganymede as /scratch/ganymede, with soft links provided as ~/scratch. On the cluster side:

Directory Filesystem Type Network Speed Filesystem Size User Quota (Soft/Hard) Backup Frequency

~/scratch (via /scratch/ganymede)

40/56/100 gigabit/s[2]

WekaFS

200 TB

10 TB[3]

None

Please notate in the preceding chart that ~/scratch is NOT BACKED UP IN ANY FORM OR FASHION. Any important data should NOT be stored on the Ganymede Scratch filesystem. Important data should be stored in your home directory, or if applicable, your MooseFS /work directory. While we generally ask users to voluntarily clean up ~/scratch, we reserve the right to purge scratch at any time. If you need assistance, please email circ-assist@utdallas.edu.

Ganymede2

Ceres exports to Ganymede2 as /scratch/ganymede2, with soft links provided as ~/scratch. On the cluster side:

Directory Filesystem Type Network Speed Filesystem Size User Quota (Soft/Hard) Backup Frequency

~/scratch (via /scratch/ganymede2)

200 gigabit/s

WekaFS

20 TB

1 TB[4]

None

Usage

On Ganymede and Ganymede2, ~/scratch is best utilized for high I/O and "larger than 50 GB" datasets, TEMPORARILY. Please utilize the scratch space as "copy to, run, clean up when done" space. The filesystem is a shared resource amongst all Ganymede users and needs to be kept as clean as possible for performance and usability reasons.

Note the network speeds are 40/56/100/200 gigabit/s rather than 1/10/25 gigabit/s. This is due to the network link being Infiniband as opposed to Ethernet; Infiniband is a much faster and much lower latency (0.5 us as opposed to 5-10 ms) than Ethernet. This allows much faster file access (near-instantaneous) when running jobs and ~1.3 GB/s file read/write, which is about 10 times faster than the "standard" link to Ganymede’s /home and . Also, due to its parallel nature, the filesystem doesn’t get saturated as easily as /home, which allows more users to run jobs at the same time. MPI jobs are no issue for ~/scratch, so it’s highly advised to use that space for running jobs, whether parallel or serial.

If you need persistent data storage in addition to the temporary scratch space, please contact circ-assist@utdallas.edu with the amount of storage you need, how long you need it, and what your workloads are and a CIRC team member will work with you to determine how best to proceed.

Backups

Recall from above, ~/scratch is NOT BACKED UP. Repeat, the entire Ceres filesystem is NOT BACKED UP. Any data on Ceres can be considered as volatile, so if it’s important please move it off when your job is complete. The hardware running the storage is robust, but nothing is invincible so please be cautious with data storage.

Cleanup

As Ceres is a shared resource by many, we ask that everyone cleans up their data when their jobs have finished, they no longer need it, or if it’s been copied to an external file system.

Buy-ins

We are not currently offering buy-ins to Ceres to prevent labs from having significant investment stake into the highly volatile filesystem.

Using Ganymede2

Logging in

Ganymede2 is accessed via SSH. Once your account is activated, you can connect to Ganymede at ganymede2.circ.utdallas.edu. For example, in a typical terminal client run the command:

ssh <NetID>@ganymede2.circ.utdallas.edu

More information on setting up SSH access to CIRC machines on your computer can be found here.

Submitting jobs

Information on submitting jobs to Ganymede2 can be found here.


1. The WekaFS was purchased to replace Petastore on Ganymede. Its addition to Ganymede2 is limited until the filesystem can be expanded.
2. The `smallmem` queue utilizes a 40 gigabit/s link whereas the `normal` and `debug` queues utilize a 56 gigabit/s link. Some privately accessible nodes utilize a 100 gigabit/s link. The storage appliance itself exports the filesystem at 200 gigabit/s.
3. Due to the high capital investment of Ceres and a relatively small amount of space being purchased, all Ganymede users have a 10 TB quota on the filesystem to prevent any one user from utilizing more than 5% of the available space. If you require a larger quota, please email circ-assist@utdallas.edu
4. Due to the high capital investment of Ceres and a relatively small amount of space being purchased, all Ganymede2 users have a 1 TB quota on the filesystem to prevent any one user from utilizing more than 5% of the available space. If you require a larger quota, please email circ-assist@utdallas.edu