Main Page: Difference between revisions

Revision as of 21:52, 13 November 2025

The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314. HPCC goals are to:

Support the scientific computing needs of CUNY faculty, their collaborators at other universities, and their public and private sector partners, and CUNY students and research staff.
Create opportunities for the CUNY research community to develop new partnerships with the government and private sectors; and
Leverage the HPC Center capabilities to acquire additional research resources for its faculty and graduate students in existing and major new programs.

Organization of systems and data storage (architecture)

All user data and project data are kept on Data Storage and Management System (DSMS) which is mounted only on login node(s) of all servers. Consequently, no jobs can be started directly from DSMS storage. Instead, all jobs must be submitted from a separate (fast but small) /scratch file system mounted on all computational nodes and on all login nodes. As the name suggests, the /scratch file system is not home directory for accounts nor can be used for long term data preservation. Users must use "staging" procedure described below to ensure preservation of their data, codes and parameters files. The figure below is a schematic of the environment.

Upon registering with HPCC every user will get 2 directories:

• /scratch/<userid> – this is temporary workspace on the HPC systems

• /global/u/<userid> – space for “home directory”, i.e., storage space on the DSMS for program, scripts, and data

• In some instances a user will also have use of disk space on the DSMS in /cunyZone/home/<projectid> (IRods).

The /global/u/<userid> directory has quota (see below for details) while the /scratch/<userid> do not have. However the /scratch space is cleaned up following the rules described below. There are no guarantees of any kind that files in /scratch will be preserved during the hardware crashes or cleaning up. Access to all HPCC resources is provided by bastion host called 'chizen. The Data Transfer Node called Cea allows file transfer from/to remote sites directly to/from /global/u/<userid> or to/from /scratch/<userid>

HPC systems

The HPC Center operates variety of architectures in order to support complex and demanding workflows. All computational resources of different types are united into single hybrid cluster called Arrow. The latter deploys symmetric multiprocessor (also referred as SMP) nodes with and without GPU, distributed shared memory (NUMA) node, fat (large memory) nodes and advanced SMP nodes with multiple GPU. The number of GPU per node varies between 2 and 8 as well as employed GPU interface and GPU family. Thus the basic GPU nodes hold two Tesla K20m (plugged through PCIe interface) while the most advanced ones support eight Ampere A100 GPU connected via SXM interface.

Overview of Computational architectures:

SMP servers have several processors (working under a single operating system) which "share everything". Thus all cpu-cores allocate a common memory block via shared bus or data path. SMP servers support all combinations of memory VS cpu (up to the limits of the particular computer). The SMP servers are commonly used to run sequential or thread parallel (e.g. OpenMP) jobs and they may have or may not have GPU.

Cluster is defined as a single system comprising set of servers interconnected with high performance network. Specific software coordinates programs on and/or across those in order to perform computationally intensive tasks. The most common cluster type is the one that consists of several identical SMP servers connected via fast interconnect. Each SMP member of the cluster is called a node. All nodes run independent copies of the same operating system (OS). Some or all of the nodes may incorporate GPU.

Hybrid clusters combine nodes of different architectures. For instance the main CUNY-HPCC machine is a hybrid cluster called Arrow. Sixty two (62) of its nodes are identical GPU enabled SMP servers each with 2 x GPU K20m, 3 are SMP but with extended memory (fat nodes), one node is distributed shared memory node (NUMA, see below) and 2 are fat SMP servers especially designed to support 8 NVIDIA GPU per node. The latter are connected via SXM interface. In addition HPCC operates the cluster Herbert dedicated only to education.

Distributed shared memory computer is tightly coupled server in which the memory is physically distributed, but it is logically unified as a single block. The system resembles SMP, but the number of cpu cores and the amounts of memory possible is far beyond limitations of the SMP. Because the memory is distributed, the access times across address space are non-uniform. Thus, this architecture is called Non Uniform Memory Access (NUMA) architecture. Similarly to SMP, the NUMA systems are typically used for applications such as data mining and decision support system in which processing can be parceled out to a number of processors that collectively work on a common data. HPCC operates the NUMA node at Arrow named Appel. This node does not have GPU.

Infrastructure systems:

o Master Head Node (MHN/Arrow) is a redundant login node from which all jobs on all servers start. This server is not directly accessible from outside CSI campus. Note that name of main server and its login nodes are the same Arrow. Thus users can access the Arrow login nodes using name Arrow or MHN.

o Chizen is a redundant gateway server which provides access to protected HPCC domain.

o Cea is a file transfer node allowing transfer of files between users’ computers to/from /scratch space or to/from /global/u/<usarid>. Cea is accessible directly (not only via Chizen), but allows only limited set of shell commands.

Table 1 below provides a quick summary of the attributes of each of the sub clusters of the main HPC Center called Arow.


Master Head Node	Sub System	Tier	Type	Type of Jobs	Nodes	CPU Cores	GPUs	Mem/node	Mem/core	Chip Type	GPU Type and Interface
Arrow	Penzias	Advanced	Hybrid Cluster	Sequential & Parallel jobs w/wo GPU	66	16	2	64 GB	4 GB	SB, EP 2.20 GHz	K20m GPU, PCIe v2
				Sequential & Parallel jobs	1	24	-	1500 GB	62 GB	HL, 2.30 GHz	-
						36	-	768 GB	21 GB		-
						24	-	768 GB	32 GB		-
	Appel		NUMA	Massive Parallel, sequential, OpenMP	1	384	-	11 TB	28 GB	IB, 3 GHz	-
	Cryo		SMP	Sequential and Parallel jobs, with GPU	1	40	8	1500 GB	37 GB	SL, 2.40 GHz	V100 (32GB) GPU, SXM
	Blue Moon		Hybrid Cluster	Sequential and Parallel jobs w/wo GPU	24	32	-	192 GB	6 GB	SL, 2.10 GHz	-
	Blue Moon		Hybrid Cluster	Sequential and Parallel jobs w/wo GPU	2	32	2	192 GB	6 GB	SL, 2.10 GHz	V100(16GB) GPU, PCIe
	Karle		SMP	Visualization, MATLAB/Mathematica	1	36*	-	768 GB	21 GB	HL, 2.30 GHz	-
	Chizen		Gateway	No jobs allowed	-
	CFD	Condo	SMP	Parallel, Seq, OpenMP	1	48	2	768 GB		EM, 4.8 GHz	A40, PCIe, v4
	CFD	Condo	SMP		1	48	-	512 GB		ER, 4.3 GHz	-
	PHYS	Condo	SMP		1	48	2	640 GB		ER, 4 GHz	L40, PCIe, v4
	PHYS	Condo	SMP		1	48	-	512 GB		ER, 4.3 GHz	-
	CHEM	Condo	SMP		1	48	2	256 GB		EM, 2.8 GHz	A30, PCIe, v4
	CHEM	Condo	SMP		1	128	8	512 GB		ER, 2.0 GHz	A100/40, SXM
	ASRC	Condo	SMP		1	48	2	256 GB		ER, 2.8 GHz	A30, PCIe, v4

Note: SB = Intel(R) Sandy Bridge, HL = Intel (R) Haswell, IB = Intel (R) Ivy Bridge, SL = Intel (R) Xeon(R) Gold, ER = AMD(R) EPYC ROMA, EM = AMD(R) EPYC MILAN, EG = AMD (R) EPYC GENOA

Prices and modes of operation

HPCC operates on cost recovery model recovering only operational costs of the center. The costs are calculated to be break even following the methodology used by CUNY-RF. The costs are reviewed and consequently updated twice a year. The charging scheme is based on unit. The unit definition is given in a table below:


Type of resource	Unit	For V100, A30, A40 or L40	Example on A100
CPU unit	1 cpu core	--	--
GPU unit	4 cpu cores + 1 GPU thread	4 cpu cores + 1 GPU	4 cpu cores and 1/7 A100

Users can code between following options:

On-demand computing
Rent a node/unit for the duration of the project

Under "On-Demand" computing the users are charged per unit hour. The table below describes meaning of the unit:

Partitions and jobs

The only way to submit job(s) to HPCC servers is through SLURM batch system. Any job despite of its type (interactive, batch, serial, parallel etc.) must be submitted via SLURM. The latter allocates the requested resources on proper server and starts the job(s) according to predefined strict fair share policy. Computational resources (cpu-cores, memory, GPU) are organized in partitions. The table below describes the partitions and their limitations. The users are granted permissions house one or other partition and corresponding QOS key. The table below shows the limitations of the partitions.


Partition	Max cores/job	Max jobs/user	Total cores/group	Time limits	Tier
partnsf	128	50	256	240 Hours	Advanced
partchem	128	50	256	240 Hours	Condo
partcfd	96	50	96	240 Hours	Condo
partsym	96	50	96	240 Hours	Condo
partasrc	48	16	16	240 Hours	Condo

o production is the main partition with assigned resources across all servers (except Math and Cryo).It is routing partition so the actual job(s) will be placed in proper sub-partition automatically. Users may submit sequential, thread parallel or distributed parallel jobs with or without GPU.

o partedu partition is only for education. Assigned resources are on educational server Herbert. Partedu is accessible only to students (graduate and/or undergraduate) and their professors who are registered for a class supported by HPCC. Access to this partition is limited by the duration of the class.

o partmatlab partition allows to run MATLAB's Distributes Parallel Server across main cluster. Note however that parallel toolbox programs can be submitted via production partition, but only as thread parallel jobs.

o partdev is dedicated to development. All HPCC users have access to this partition with assigned resources of one computational node with 16 cores, 64 GB of memory and 2 GPU (K20m). This partition has time limit of 4 hours.

Hours of Operation

In order to maximize the use of resources HPCC applies “rolling” maintenance scheme across all systems. When downtime is needed, HPCC will notify all users a week or more in advance (unless emergency situation occur). Typically, the fourth Tuesday mornings in the month from 8:00AM to 12PM is normally reserved (but not always used) for scheduled maintenance. Please plan accordingly. Unplanned maintenance to remedy system related problems may be scheduled as needed out of above mentioned days. Reasonable attempts will be made to inform users running on those systems when these needs arise. Note that users are strongly encouraged to use checkpoints in their jobs.

User Support

Users are strongly encouraged to read this Wiki carefully before submitting ticket(s) for help. In particular, the sections on compiling and running parallel programs, and the section on the SLURM batch queueing system will give you the essential knowledge needed to use the CUNY HPCC systems. We have strived to maintain the most uniform user applications environment possible across the Center's systems to ease the transfer of applications and run scripts among them.

The CUNY HPC Center staff, along with outside vendors, offer regular courses and workshops to the CUNY community in parallel programming techniques, HPC computing architecture, and the essentials of using our systems. Please follow our mailings on the subject and feel free to inquire about such courses. We regularly schedule training visits and classes at the various CUNY campuses. Please let us know if such a training visit is of interest. In the past, topics have include an overview of parallel programming, GPU programming and architecture, using the evolutionary biology software at the HPC Center, the SLURM queueing system at the CUNY HPC Center, Mixed GPU-MPI and OpenMP programming, etc. Staff has also presented guest lectures at formal classes throughout the CUNY campuses.

If you have problems accessing your account and cannot login to the ticketing service, please send an email to:

 hpchelp@csi.cuny.edu

Warnings and modes of operation

1. hpchelp@csi.cuny.edu is for questions and accounts help communication only and does not accept tickets unless ticketing system is not operational. For tickets please use the ticketing system mentioned above. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a 48h response, quite often even same day response. During the weekend you may not get any response.

2. E-mails to hpchelp@csi.cuny.edu must have a valid CUNY e-mail as reply address. Messages originated from public mailers (google, hotmail, etc) are filtered out.

3. Do not send questions to individual CUNY HPC Center staff members directly. These will be returned to the sender with a polite request to submit a ticket or email the Helpline. This applies to replies to initial questions as well.

The CUNY HPC Center staff members are focused on providing high quality support to its user community, but compared to other HPC Centers of similar size our staff is extremely lean. Please make full use of the tools that we have provided (especially the Wiki), and feel free to offer suggestions for improved service. We hope and expect your experience in using our systems will be predictably good and productive.

User Manual

The old version of the user manual provides PBS not SLURM batch scripts as examples. Currently CUNY-HPCC uses SLURM scheduler so users must check and use only the updated brief SLURM manual distributed with new accounts or ask CUNY-HPCC for a copy of the latter.

@@ Line 231: / Line 231: @@
 !Example on A100
 |-
-|CPU
+|CPU unit
 |1 cpu core
 | --
 | --
 |-
-|GPU
+|GPU unit
 |4 cpu cores + 1 GPU thread
 |4 cpu cores + 1 GPU

Main Page: Difference between revisions

Revision as of 21:52, 13 November 2025

Contents

Organization of systems and data storage (architecture)

HPC systems

Prices and modes of operation

Partitions and jobs

Hours of Operation

User Support

Warnings and modes of operation

User Manual

Navigation menu

Main Page: Difference between revisions

Revision as of 21:52, 13 November 2025

Organization of systems and data storage (architecture)

HPC systems

Prices and modes of operation

Partitions and jobs

Hours of Operation

User Support

Warnings and modes of operation

User Manual

Navigation menu

Search