Clusters

CCR maintains two clusters; ub-hpc and faculty. Within each cluster exist partitions (queues) that users can request to have their jobs run on. Not all partitions are available to all users. Below are descriptions of the clusters and the partitions.

Compute nodes on all CCR clusters are subject to monthly downtimes. Downtime information is listed on the Downtime Schedule page

Examples on how to run on these clusters and partitions can be found on the HPC jobs page.

UB-HPC Compute Cluster

The ub-hpc cluster contains the following partitions:

  • arm64: Compute nodes with ARM64 processors & specialized GPUs
  • class: Dedicated compute nodes for faculty teaching classes
  • debug: Dedicated compute nodes to help users debug their workflows
  • general-compute: Large pool of compute nodes for academic users
  • industry: Compute nodes industry partners pay to use
  • industry-dgx: NVIDIA DGX GPU compute nodes for industry partners (use qos industry to access)
  • industry-hbm: High-bandwidth memory nodes for industry partners (use qos industry to access)
  • scavenger: Preemptible jobs on all ub-hpc nodes
  • viz: Dedicated compute nodes to allow short jobs for visualization purposes

Access to clusters and partitions is managed through allocations in the CCR allocations portal, ColdFront. For more details on allocation requests and using ColdFront, see the Coldfront page.

UB-HPC Detailed Hardware Specifications

This is a listing of available compute node types in the UB-HPC cluster. Please keep in mind they are not all available to all users and are restricted using QOS values. QOS access is given via allocations in ColdFront. You can view what qos access you have using the slimits command. Also note, the quantity of nodes in the following list is approximate as nodes fail and are removed from service often.

Type of Node Qty Nodes Qty CPUs Processor GPU RAM Network SLURM TAGS Local scratch QOS
"Cascade Lake" Standard Node 96 40 Intel Xeon Gold 6230 - 187GB Infiniband CPU-Gold-6230, CASCADE-LAKE-IB 880GB debug, general-compute, scavenger
"Cascade Lake" Large Memory Node 24 40 Intel Xeon Gold 6230 - 754GB Infiniband CPU-Gold-6230, CASCADE-LAKE-IB, LM 3.5TB general-compute, scavenger
"Cascade Lake" GPU Node 8 40 Intel Xeon Gold 6230 2x V100 32GB (GPU memory) 754GB Infiniband CPU-Gold-6230, CASCADE-LAKE-IB, V100 880GB general-compute, scavenger
"Cascade Lake" GPU Node 1 48 Intel Gold 6240R 12x A16 15GB (GPU memory) 512GB Ethernet Gold-6240R, A16 880GB general-compute, scavenger
"Ice Lake" Standard Node1 67 56 Intel Xeon Gold 6330 - 512GB Infiniband CPU-Gold-6330, ICE-LAKE-IB 880GB industry, general-compute, scavenger
"Ice Lake" Large Memory Node1 20 56 Intel Xeon Gold 6330 - 1TB Infiniband CPU-Gold-6230, CASCADE-LAKE-IB, LM 7TB industry, general-compute, scavenger
"Ice Lake" GPU Node1 20 56 Intel Xeon Gold 6330 2x A100 40GB (GPU memory) 512GB Infiniband CPU-Gold-6230, CASCADE-LAKE-IB, A100 880GB industry, general-compute, scavenger
"Sapphire Rapids" Standard Node1 81 64 Intel Gold 6448Y - 512GB Infiniband CPU-Gold-6448Y, SAPPHIRE-RAPIDS-IB 880GB class, debug, industry, general-compute, scavenger
"Sapphire Rapids" Large Memory Node1 4 64 Intel Gold 6448Y - 2TB Infiniband CPU-Gold-6448Y, SAPPHIRE-RAPIDS-IB, LM 7TB industry, general-compute, scavenger
"Sapphire Rapids" GPU Node1 8 64 Intel Gold 6448Y H100 80GB (GPU memory) 512GB Infiniband CPU-Gold-6448Y, SAPPHIRE-RAPIDS-IB, H100 7TB debug, industry, general-compute, scavenger, viz
"Emerald Rapids" Standard Node 74 64 Intel Platinum 8562Y+ - 512GB HDR Infiniband CPU-Platinum-8562Y, EMERALD-RAPIDS-IB 880GB class, industry, scavenger
"Emerald Rapids" Large Memory Node 6 64 Intel Platinum 8562Y+ - 2TB HDR Infiniband CPU-Platinum-8562Y, EMERALD-RAPIDS-IB, LM 7TB industry, scavenger
"Emerald Rapids" High Bandwidth Memory Node2 8 64 Intel Max 9462 - 125GB HDR Infiniband CPU-Max-9462, HBM 833GB industry-hbm
"Emerald Rapids" GPU Node 10 64 Intel Platinum 8562Y+ 2x NVIDIA L40S 48GB (GPU memory) 1TB HDR Infiniband CPU-Platinum-8562Y, L40S, EMERALD-RAPIDS-IB 7TB class, industry, scavenger
"Emerald Rapids" NVIDIA HGX GPU Node3 2 96 Intel Platinum 8468 8x H100 85GB (GPU memory) 512GB HDR Infiniband CPU-Platinum-8468, H100, EMERALD-RAPIDS-IB 14TB industry-dgx, eai-test, scavenger
"Grace Hopper" GPU Node4 2 72 ARM Neoverse-V2 GH 200 490GB Ethernet CPU-Neoverse-V2 1.8TB arm64
Viz Node5 1 48 Intel Gold 6240R 12x A16 15GB (GPU memory) 512GB Ethernet Gold-6240R, A16 880GB viz
Viz Node5 2 64 Intel Gold 6448Y 2x A40 48GB (GPU memory) 512GB Ethernet Gold-6448Y, A40 7TB class, debug, viz

[1]: These compute nodes are borrowed from the industry partition when usage is lower. At times of higher industry usage, there will be fewer compute nodes available for the general-compute partition.

[2]: The high bandwith memory nodes are available in the industry-hbm partition, available to all users with access to the industry qos. These are not available in the scavenger partition to prevent issues with memory allocations in job.

[3]: The HGX GPU nodes are available in the industry-dgx partition, available to all users with access to the industry qos. These are also available in the scavenger and eai-test partitions.

[4]: The Grace Hopper GPU nodes are available in the arm64 partition and currently require a separate allocation in ColdFront. Academic PIs may request an allocation following our standard instructions.

[5]: The visualization nodes are only available to access through OnDemand. You may request the viz partition and QOS in the OnDemand app forms. The viz partition has a maximum run time of 24 hours and is limited to 1 job per user (queued or running).

UB-HPC Cluster Status

Use the command sqstat for detailed information on cluster status. To find more detailed information on node availability, use the snodes command.

Faculty Compute Cluster

This cluster contains over 50 faculty owned or project specific partitions. Access to these partitions is determined by the owner and managed via allocations in ColdFront. All idle nodes in the faculty cluster are accessible to UB-HPC users in the scavenger partition. Scavenger jobs will be preemptively canceled when jobs are submitted by members of the group that owns the node. To view hardware specifications for nodes in the faculty cluster, use the command: snodes all faculty/scavenger

Caution: Maintenance Downtimes

Jobs on the faculty cluster are allowed to run up until the downtime starts. Please ensure your jobs checkpoint and can restart where they left off OR request only enough time to run your job prior to the 7am cutoff on maintenance days. See the Downtime Schedule for more information.

Node types

CCR has several node types available in our clusters. Each node type is meant for certain tasks. These node types are relatively common for other HPC centers. We will discuss each node type and its intended use below.

Login Nodes

  • Use for: editing scripts, moving files, small data transfers, submitting jobs
  • Do not use for: Heavy computations, building software or long running processes
  • 15 Minute time limit on running processes
  • Many users are typically logged into these at the same time
  • Connections are load balanced across many servers

Compile Nodes

  • Use for: Building Software
  • Do not use for: Heavy computations
  • Access these nodes by typing ssh compile from a login node

Compute Nodes

  • Where jobs are executed after being submitted to the slurm scheduler
  • Intended for heavy computation
  • When running an interactive job will be performing tasks directly on the compute nodes
  • Only users with active jobs can log in to allocated nodes
  • See detailed specs in the hardware specifications list

Debug Nodes

  • Compute nodes reserved for testing and debugging jobs
  • Accessed by submitting a debug job in slurm
  • 1 hour walltime limit on debug jobs
  • See detailed specs in the hardware specifications list

Visualization Nodes

  • Compute nodes dedicated to allow access for those wanting to visualize results
  • Accessed through Open OnDemand
  • 24 hour walltime limit on visualization jobs
  • Select viz in the partition and qos drop down menus of the OnDemand apps
  • See detailed specs in the hardware specifications list

Data Transfer Nodes