Main Page: Difference between revisions

From HPCC Wiki
Jump to navigation Jump to search
 
(352 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[File:CUNY-HPCC-HEADER-LOGO.jpg]]
__TOC__
__TOC__


[[Image:hpcc-panorama3.png]]
[[Image:hpcc-panorama3.png]]


The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314.  The CUNY-HPCC supports computational research and computational intensive courses offered at all CUNY colleges in fields such as Computer Science, Engineering, Bioinformatics, Chemistry, Materials Science, Genetics, Computational Biology and others.  HPCC  provides educational outreach to local schools and supports undergraduates who work in the research programs of the host institution (REU program from NSF). The primary mission of HPCC is:  
The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the
campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314.  HPCC
goals are to:  


* To enable advanced research and scholarship at CUNY colleges by providing faculty, staff, and students with access to high-performance computing, adequate storage resources and visualization resources;
:*Support the scientific computing needs of CUNY faculty, their collaborators at other universities, and their public and private sector partners, and CUNY students and research staff.
* To provide CUNY faculty and their collaborators at other universities, CUNY research staff and CUNY graduate and undergraduate students with expertise in scientific computing, parallel scientific computing (HPC), software development, advanced data analytics, data driven science and simulation science, visualization, advanced database engineering, and others.
:*Create opportunities for the CUNY research community to develop new partnerships with the government and private sectors; and
* Leverage the HPC Center capabilities to acquire additional research resources for CUNY faculty, researchers and students in existing and major new programs.
:*Leverage the HPC Center capabilities to acquire additional research resources for its faculty and graduate students in existing and major new programs.
* Create opportunities for the CUNY research community to win grants from national funding institutions and to develop new partnerships with the government and private sectors.


==Organization of HPC production systems and HPC data storage==
==Organization of systems and data storage (architecture)==


CUNY-HPCC provides variety of architectures in order to support various types of research and education. The computational systems are organized in 3 tiers - Condominium Tier, Basic Tier and Advanced Tier. The architectures for each tier are discussed below. All tiers access 2 separate file systems: 1) Data Storage and Management System ('''DSMS''')  (a global file system) and 2) scratch file system. The DSMS is mounted only on login nodes and is intend to keep user data (home directories) and project data (project directories). The '''scratch''' file system is mounted on all computational nodes in all tiers. Thus the  However  are kept on  Data Storage and Management System ('''DSMS''') which is mounted only on login node(s) of all servers. Consequently, no jobs can be started directly from '''DSMS''' storage.  Instead, all jobs must be submitted  from  a separate (fast but small) '''/scratch''' file system mounted on all computational nodes and on all login nodes.  As the name suggests, the '''/scratch''' file system is not  home directory  for accounts nor can be used for long term data preservation.  Users must use "staging" procedure described below to ensure preservation of their data, codes and parameters files. The figure below is a schematic of the environment.   
All user data and project data are kept on  Data Storage and Management System ('''DSMS''') which is mounted only on login node(s) of all servers. Consequently, no jobs can be started directly from '''DSMS''' storage.  Instead, all jobs must be submitted  from  a separate (fast but small) '''/scratch''' file system mounted on all computational nodes and on all login nodes.  As the name suggests, the '''/scratch''' file system is not  home directory  for accounts nor can be used for long term data preservation.  Users must use "staging" procedure described below to ensure preservation of their data, codes and parameters files. The figure below is a schematic of the environment.   


Upon  registering with HPCC every user will get 2 directories:
Upon  registering with HPCC every user will get 2 directories:
Line 27: Line 26:
==HPC systems==
==HPC systems==


The HPC Center operates variety of architectures in order to support complex and demanding workflows.  The deployed systems include:  distributed memory (also referred to as “cluster”) computers, symmetric multiprocessor (also referred as SMP) and shared memory (also reffred as NUMA machines).   
The HPC Center operates variety of architectures in order to support complex and demanding workflows.  All computational resources of different types are united into single hybrid cluster called Arrow. The latter deploys symmetric multiprocessor (also referred as SMP) nodes with and without GPU, distributed shared memory (NUMA) node, fat (large memory) nodes and advanced SMP nodes with multiple GPU. The number of GPU per node varies between 2 and 8 as well as employed GPU interface and GPU family. Thus the basic GPU nodes hold two Tesla K20m (plugged through PCIe interface) while the most advanced ones  support eight Ampere A100 GPU connected via SXM interface. 


''Computational Systems'':
''Overview of Computational architectures'':


'''SMP''' servers have several processors (working under a single operating system) which "share everything".  Thus  all cpu-cores allocate a common memory block via shared bus or data path. SMP servers support all combinations of memory VS cpu (up to the limits of the particular computer). The SMP servers are commonly used to run sequential or thread parallel (e.g. OpenMP) jobs and they may have or may not have GPU. Currently, HPCC operates several detached SMP servers named '''Math, Cryo ''' and '''Karle'''. Karle is a server which does not have GPU and is used for visualizations, visual analytics and interactive MATLAB/Mathematica jobs. '''Math''' is a condominium server without GPU as well. Cryo (CPU+GPU server) is  specialized server with  eight (8) NVIDIA V100 (32G) GPU designed to support large scale multi-core multi-GPU jobs.
'''SMP''' servers have several processors (working under a single operating system) which "share everything".  Thus  all cpu-cores allocate a common memory block via shared bus or data path. SMP servers support all combinations of memory VS cpu (up to the limits of the particular computer). The SMP servers are commonly used to run sequential or thread parallel (e.g. OpenMP) jobs and they may have or may not have GPU.   


'''Cluster''' is defined as a single system comprizing a  set of SMP servers interconnected with high performance network. Specific software coordinates  programs on and/or across those in order to  perform computationally intensive tasks. Each SMP member of the cluster is called a '''node'''. All nodes run independent copies of the same operating system (OS). Some or all of the nodes may incorporate GPU.  The main cluster at HPCC is a hybrid (CPU+GPU) cluster called '''Penzias'''.  Sixty six (66) of Penzias nodes have 2 x GPU K20m, and the 3 fat nodes (nodes with large number of CPU-cores and memory) of the cluster do not have GPU.  In addition HPCC operates the cluster '''Herbert''' dedicated only to education.
'''Cluster''' is defined as a single system comprising set of servers interconnected with high performance network. Specific software coordinates  programs on and/or across those in order to  perform computationally intensive tasks. The most common cluster type is the one that consists of several identical SMP servers connected via fast interconnect.  Each SMP member of the cluster is called a '''node'''. All nodes run independent copies of the same operating system (OS). Some or all of the nodes may incorporate GPU.   


'''Distributed shared memory''' computer is tightly coupled server in which the memory is physically distributed, but it is logically unified as a single block. The system resembles SMP, but the number of cpu cores and the amounts of memory possible is far beyond limitations of the SMP.  Because the memory is distributed, the access times across address space are non-uniform. Thus, this architecture is called Non Uniform Memory Access (NUMA) architecture.  Similarly to SMP, the '''NUMA''' systems are typically used for applications such as data mining and decision support system in which processing can be parceled out to a number of processors that collectively work on a common data. HPCC operates the '''NUMA''' server called '''Appel'''.  This server does not have GPU.  
Hybrid clusters combine nodes of different architectures. For instance the main CUNY-HPCC machine is a hybrid cluster called '''Arrow'''.  Sixty two (62) of its nodes are identical GPU enabled SMP servers each with 2 x GPU K20m, 3 are SMP but with extended memory (fat nodes), one node is distributed shared memory  node (NUMA, see below) and 2 are fat SMP servers especially designed to support 8 NVIDIA GPU per node. The latter are connected via SXM interface. In addition HPCC operates the cluster '''Herbert''' dedicated only to education.
 
'''Distributed shared memory''' computer is tightly coupled server in which the memory is physically distributed, but it is logically unified as a single block. The system resembles SMP, but the number of cpu cores and the amounts of memory possible is far beyond limitations of the SMP.  Because the memory is distributed, the access times across address space are non-uniform. Thus, this architecture is called Non Uniform Memory Access (NUMA) architecture.  Similarly to SMP, the '''NUMA''' systems are typically used for applications such as data mining and decision support system in which processing can be parceled out to a number of processors that collectively work on a common data. HPCC operates the '''NUMA''' node at Arrow named '''Appel'''.  This node does not have GPU.  


'' Infrastructure systems'':
'' Infrastructure systems'':


o Master Head Node ('''MHN''') is a redundant login node from which all jobs on all servers start. This server is not directly accessible from outside CSI campus.  
o Master Head Node ('''MHN/Arrow)''' is a redundant login node from which all jobs on all servers start. This server is not directly accessible from outside CSI campus. Note that name of main server and its login nodes are the same Arrow. Thus users can access the Arrow login nodes using name Arrow or MHN. 


o '''Chizen''' is a redundant gateway server which provides access to protected HPCC domain.
o '''Chizen''' is a redundant gateway server which provides access to protected HPCC domain.
Line 45: Line 46:
o '''Cea''' is a file transfer node allowing transfer of files between users’ computers to/from  /scratch space or to/from /global/u/<usarid>. '''Cea''' is accessible directly (not only via '''Chizen'''), but allows only limited set of shell commands.   
o '''Cea''' is a file transfer node allowing transfer of files between users’ computers to/from  /scratch space or to/from /global/u/<usarid>. '''Cea''' is accessible directly (not only via '''Chizen'''), but allows only limited set of shell commands.   


'''Table 1''' below provides a quick summary of the attributes of each of the systems available at the HPC Center.
'''Table 1''' below provides a quick summary of the attributes of each of the sub clusters of the main  HPC Center called Arow.
 
{| class="wikitable"
{| class="wikitable"
|+
|+
!Master Head Node
!Master Head Node
!System
!Sub System
!Tier
!Type
!Type
!Type of Jobs
!Type of Jobs
Line 58: Line 61:
!Mem/core
!Mem/core
!Chip Type
!Chip Type
!GPU Type
!GPU Type and Interface
|-
|-
| rowspan="10" |MHN
| rowspan="17" |'''<big>Arrow</big>'''
| rowspan="4" |Penzias
| rowspan="4" |Penzias
| rowspan="10" |Advanced
| rowspan="4" |Hybrid Cluster
| rowspan="4" |Hybrid Cluster
|Sequential & Parallel jobs w/wo GPU
|Sequential & Parallel jobs w/wo GPU
Line 70: Line 74:
|4 GB
|4 GB
|SB, EP 2.20 GHz
|SB, EP 2.20 GHz
|K20m GPU, PCIe
|K20m GPU, PCIe v2
|-
|-
| rowspan="3" |Sequential & Parallel jobs
| rowspan="3" |Sequential & Parallel jobs
Line 95: Line 99:
|Appel
|Appel
|NUMA
|NUMA
|Massive Parallel, sequential
|Massive Parallel, sequential, OpenMP
|1
|1
|384
|384
Line 113: Line 117:
|37 GB
|37 GB
|SL, 2.40 GHz
|SL, 2.40 GHz
|V100 (32GB) GPU, XSM
|V100 (32GB) GPU, SXM
|-
|-
| rowspan="2" |Blue Moon
| rowspan="2" |Blue Moon
Line 144: Line 148:
|Chizen
|Chizen
|Gateway
|Gateway
|No jobs allowed
| colspan="7" | -
|-
| rowspan="2" |CFD
| rowspan="2" |Condo
| rowspan="2" |SMP
| rowspan="7" |Parallel, Seq, OpenMP
|1
|48
|2
|768 GB
|
|EM, 4.8 GHz
|A40, PCIe, v4
|-
|1
|48
| -
|512 GB
|
|ER, 4.3 GHz
| -
|-
| rowspan="2" |PHYS
| rowspan="2" |Condo
| rowspan="2" |SMP
|1
|48
|2
|2
| colspan="6" | -
|640 GB
|
|ER, 4 GHz
|L40, PCIe, v4
|-
|1
|48
| -
|512 GB
|
|ER, 4.3 GHz
| -
| -
|-
| rowspan="2" |CHEM
| rowspan="2" |Condo
| rowspan="2" |SMP
|1
|48
|2
|256 GB
|
|EM, 2.8 GHz
|A30, PCIe, v4
|-
|1
|128
|8
|512 GB
|
|ER, 2.0 GHz
|A100/40, SXM
|-
|ASRC
|Condo
|SMP
|1
|48
|2
|256 GB
|
|ER, 2.8 GHz
|A30, PCIe, v4
|}
|}
Note: SB = Sandy Bridge, HL = Haswell, IB = Ivy Bridge, SL = Skylake
Note: SB = Intel(R) Sandy Bridge, HL = Intel (R) Haswell, IB = Intel (R) Ivy Bridge, SL = Intel (R) Xeon(R) Gold, ER  = AMD(R) EPYC ROMA, EM = AMD(R) EPYC MILAN, EG = AMD (R) EPYC GENOA 


==Partitions and jobs==
== Recovery of  operational costs ==
The only way to submit job(s) to HPCC servers is through SLURM batch system.  Any  job despite of its type (interactive, batch, serial, parallel etc.) must be submitted via SLURM. The latter allocates the requested resources on proper server and starts the job(s) according to predefined strict fair share policy. Computational resources (cpu-cores, memory, GPU) are organized in partitions. The main partition is called production. This is routing partition which distributes the jobs in several sub-partitions depend on job’s requirements. Thus the serial job submitted in '''production''' will land in '''partsequential''' partition.  No SLURM Pro scripts should be ever used and all existing SLURM scripts must be converted to SLURM before use. The table below shows the limitations of the partitions.
CUNY-HPCC operates on cost recovery model recapturing only '''<u>operational costs with no profit (for CUNY users only)</u>'''. The costs are calculated to be break even following the methodology used by CUNY-RF. The costs are reviewed and consequently updated twice a year. The charging scheme is based on '''<u>unit-hour</u>'''. The unit can be either CPU  unit or GPU unit. The definitions of these is given in a table below:
{| class="wikitable"
{| class="wikitable mw-collapsible"
|+
!Type of resource
!Unit
!For V100, A30, A40 or L40
!For A100
|-
|CPU unit
|1 cpu core
| --
| --
|-
|GPU unit
|4 cpu cores + 1 GPU thread
|4 cpu cores + 1 GPU
|4 cpu cores and 1/7 A100
|}
Users can choose between following options:
# On-demand computing (for basic and advanced tiers only)
# Rent a node in basic and/or advanced tier for the duration of the project
# Rent a condo node.
 
=== Basic and Advanced Tier ===
Under "On-Demand" computing mode the users are charged per <u>'''''unit-''hour'''</u> according to above table. Users leasing (min 1 month , 30 days) the resource (e.g. node) are charged at the beginning of the lease period. Any excessive hours needed to complete the project are charged as on-demand computing. Leasing guarantees 24/7 access to node(s) (except maintenance periods), no time limits for the job(s) and preferred level of support.
{| class="wikitable mw-collapsible"
|+
!Type service
!Time Limit
!Guaranteed Access
!Support tickets
!Fair Share Policy
!Price CPU unit
!Price GPU unit
|-
|On-demand
|Yes
|No
|Yes
|Strict
|$0.015
|$0.15
|-
|Lease
|No
|Yes
|High Priority
|No
|$0.025
|$0.25
|}
 
=== Condo Tier ===
Condo tier consist of servers purchased and owed by faculty. Owner have unrestricted and access to their own server(s) and can borrow the server from condo tier (free) upon agreement between owners. Non owners can borrow the condo server when the server is free and owner explicitly agrees. The renter pays the cost recovery fee which is collected by HPCC and is used to offset the owners fees. The minimum rent period is 30 days (on month). The long term rent is 3+ months. In this case there is 10% discount of total price. The prices are given in a table below: 
 
{| class="wikitable mw-collapsible"
|+
!Type of node
!Renters cost/month
!Long term rent cost/month
!CPU/node
!CPU type
!GPU/node
!GPU type
!GPU interface
|-
|Laghe Hybrid
|$602.52
|$564.86
|128
|EPYC, 2.2 GHz
|8
|A100/80
|SXM
|-
|Small Hybrid
|$205.41
|$192.57
|48
|EPYC, 2.8 GHz
|2
|A40, A30, L40
|PCIe v4
|-
|Medium Non GPU
|$328.65
|$308.11
|96
|EPYC, 4.11GHz
|48
|None
|NA
|-
|Lagre Non GPU
|$438.20
|$410.81
|128
|EPYC, 2.0 GHz
|128
|None
|NA
|}
 
=== Free time ===
In order to establish a project all new users are entitled to free 11520 CPU hours and 1440 GPU hours. Any hours above these are charged on "on-demand" rates. Note that '''<u>free time is per user account not per project</u>''' so any user can have free time only once. External collaborators to CUNY are not normally eligible for free time. Please contact CUNY-HPCC director for  details. 
 
== Support for research grants ==
'''<u>All proposals dated on Jan 1st 2026 (01/01/26) and later</u>''' that require computational resources '''<u>must include budget for cost recovery fees at CUNY-HPCC.</u>'''  For a project the PI can choose between:
 
* lease the node(s), That is useful option for well defined projects and those with high computational component requiring 100% availability of the computational resource.
* use "on-demand" resources. That is flexible option good for experimental projects or exploring new areas of study. The downgrade is that resources are shared among all users under fair share policy. Thus immediate access to resource cannot be guaranteed.
* participate in CONDO  tier. That is most beneficial option in terms of availability of resources and level of support. It fits best the focused research of group(s) (e.g. materials science).
 
In all cases the PI can use the appropriate rates listed above to establish correct budget for the proposal.  PI should  '''<u>contact the Director of CUNY-HPCC Dr. Alexander Tzanov</u>'''  (alexander.tzanov@csi.cuny.edu) and discuss  the project's computational  requirements  including optimal and most economical computational workflows, suitable hardware, shared or own resources, CUNY-HPCC support options and any other matter concerning  correct and optimal computational budget for the proposal.     
 
== Partitions and jobs ==
The only way to submit job(s) to HPCC servers is through SLURM batch system.  Any  job despite of its type (interactive, batch, serial, parallel etc.) must be submitted via SLURM. The latter allocates the requested resources on proper server and starts the job(s) according to predefined strict fair share policy. Computational resources (cpu-cores, memory, GPU) are organized in '''partitions'''. The table below describes the partitions and their limitations. The users are granted permissions house one or other partition and corresponding QOS key.   The table below shows the limitations of the partitions (in progress).
{| class="wikitable mw-collapsible"
|+
|+
!Partition
!Partition
Line 159: Line 347:
!Total cores/group
!Total cores/group
!Time limits
!Time limits
!Tier
!
!GPU types
!
|-
|-
|production
|partnsf
|128
|128
|50
|50
|256
|256
|240 Hours
|240 Hours
|Advanced
|
|K20m, V100/16, A100/40
|
|-
|partchem
|128
|50
|256
|No limit
|Condo
|
|A100/80, A30
|
|-
|partcfd
|96
|50
|96
|No limit
|Condo
|
|A40
|
|-
|partsym
|96
|50
|96
|No limit
|Condo
|
|A30
|
|-
|-
|partedu
|partasrc
|48
|16
|16
|16
|2
|No limit
|216
|Condo
|72 Hours
|
|A30
|
|-
|-
|partmath
|partmatlabD
|128
|128
|128
|128
|50
|256
|240 Hours
|240 Hours
|Advanced
|
|V100/16,A100/40
|
|-
|-
|partmatlab
|partmatlabN
|1972
|384
|50
|50
|1972
|384
|240 Hours
|240 Hours
|Advanced
|
|None
|
|-
|-
|partdev
|partphys
|16
|96
|16
|50
|16
|96
|4 Hours
|No limit
|Condo
|
|L40
|
|}
|}
o '''production''' is the main partition with assigned resources across all servers (except Math and Cryo).It is routing partition so the actual job(s) will be placed in proper sub-partition automatically. Users may submit sequential, thread parallel or distributed parallel jobs with or without GPU.
o '''partedu'''  partition is only for education. Assigned resources are on educational server Herbert. Partedu is accessible only to students (graduate and/or undergraduate) and their professors who are registered for a class supported by HPCC. Access to this partition is limited by the duration of the class.
o '''partmatlab''' partition allows to run MATLAB's Distributes Parallel  Server across main cluster. Note however that parallel toolbox programs  can be submitted via production partition, but only as thread parallel jobs.


o '''partdev''' is dedicated to development. All HPCC users have access to this partition with assigned resources of one computational node with 16 cores, 64 GB of memory and 2 GPU (K20m). This partition has time limit of 4 hours.
* '''partnsf''' is the main partition with assigned resources across all sub-servers. Users may submit sequential, thread parallel or distributed parallel jobs with or without GPU.
* '''partchem'''  is CONDO partition. 
* '''partphys'''  is CONDO partition
* '''partsym'''    is CONDO partition
* '''partasrc'''  is CONDO partition
* '''partmatlabD''' partition allows to run MATLAB's Distributes Parallel  Server across main cluster.
* '''partmatlabN''' partition to access large matlab node with 384 cores and 11 TB of shared memory. It is useful to run parallel Matlab jobs with Parallel ToolBox
* '''partdev''' is dedicated to development. All HPCC users have access to this partition with assigned resources of one computational node with 16 cores, 64 GB of memory and 2 GPU (K20m). This partition has time limit of 4 hours.


== Hours of Operation ==
== Hours of Operation ==
The second and fourth Tuesday mornings in the month from 8:00AM to 12PM are normally reserved (but not always used) for scheduled maintenance.  Please plan accordingly.  <br/ >
In order to maximize the use of resources HPCC applies “rolling” maintenance scheme across all systems. When downtime is needed, HPCC will notify all users a week or more in advance (unless emergency situation occur).  Typically, the fourth Tuesday mornings in the month from 8:00AM to 12PM is normally reserved (but not always used) for scheduled maintenance.  Please plan accordingly.  Unplanned maintenance to remedy system related problems may be scheduled as needed out of above mentioned days. Reasonable attempts will be made to inform users running on those systems when these needs arise. Note that users are strongly encouraged to use checkpoints in their jobs.
Unplanned maintenance to remedy system related problems may be scheduled as needed. Reasonable attempts will be made to inform users running on those systems when these needs arise.


== User Support ==
== User Support ==
Users are encouraged to read this Wiki carefully. In particular, the sections on compiling and running
Users are strongly encouraged to read this Wiki carefully before submitting ticket(s) for help. In particular, the sections on compiling and running parallel programs, and the section on the SLURM batch queueing system will give you the essential knowledge needed to use the CUNY HPCC systems.  We have strived to maintain the most uniform user applications environment possible across the Center's systems to ease the transfer of applications
parallel programs, and the section on the SLURM batch queueing system will give you the essential
and run scripts among them.   
knowledge needed to use the CUNY HPC Center systems.  We have strived to maintain the most uniform
user applications environment possible across the Center's systems to ease the transfer of applications
and run scripts among them.  Still, there are some differences, particularly with the SGI (ANDY) and Cray (SALK)
systems.


The CUNY HPC Center staff, along with outside vendors, offer regular courses and workshops to the CUNY
The CUNY HPC Center staff, along with outside vendors, offer regular courses and workshops to the CUNY
community in parallel programming techniques, HPC computing architecture, and the essentials of using our
community in parallel programming techniques, HPC computing architecture, and the essentials of using our
systems. Please follow our mailings on the subject and feel free to inquire about such courses.  We regularly
systems. Please follow our mailings on the subject and feel free to inquire about such courses.  We regularly schedule training visits and classes at the various CUNY campuses.  Please let us know if such a training visit is of interest.  In the past, topics have include an overview of parallel programming, GPU programming and architecture, using the evolutionary biology software at the HPC Center,  the SLURM queueing system at the CUNY HPC Center, Mixed GPU-MPI and OpenMP programming, etc.  Staff has also presented guest lectures at formal classes throughout the CUNY campuses.     
schedule training visits and classes at the various CUNY campuses.  Please let us know if such a training visit
is of interest.  In the past, topics have include an overview of parallel programming, GPU programming and
architecture, using the evolutionary biology software at the HPC Center,  the SLURM queueing system at the
CUNY HPC Center, Mixed GPU-MPI and OpenMP programming, etc.  Staff has also presented guest lectures
at formal classes throughout the CUNY campuses.     


If you have problems accessing your account and cannot login to the ticketing service, please send an email to:
If you have problems accessing your account and cannot login to the ticketing service, please send an email to:
Line 226: Line 460:




1. hpchelp@csi.cuny.edu is for questions and accounts help communication '''only''' and does not accept tickets. For tickets please use  the ticketing system mentioned above. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a 48h response, quite  often even same day response. During the weekend you may not get any response.  
 
1. hpchelp@csi.cuny.edu is for questions and accounts help communication '''only''' and does not accept tickets unless ticketing system is not operational. For tickets please use  the ticketing system mentioned above. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a 48h response, quite  often even same day response. During the weekend you may not get any response.  


2. '''E-mails to hpchelp@csi.cuny.edu must have a valid CUNY e-mail as reply address.''' Messages originated from public mailers (google, hotmail, etc) are filtered out.
2. '''E-mails to hpchelp@csi.cuny.edu must have a valid CUNY e-mail as reply address.''' Messages originated from public mailers (google, hotmail, etc) are filtered out.
Line 239: Line 474:
== User Manual ==
== User Manual ==


An old version of the user manual can be downloaded at: http://cunyhpc.csi.cuny.edu/publications/User_Manual.pdf.  Note that this manual provides SLURM batch scripts as examples. Currently CUNY-HPCC uses SLURM so users must check the brief SLURM manual distributed with new accounts or ask CUNY-HPCC for a copy of the latter.
The old version of the user manual provides PBS not SLURM batch scripts as examples. Currently CUNY-HPCC uses SLURM scheduler so users must check and use only the updated brief SLURM manual distributed with new accounts or ask CUNY-HPCC for a copy of the latter.

Latest revision as of 20:40, 15 November 2025

Hpcc-panorama3.png

The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314. HPCC goals are to:

  • Support the scientific computing needs of CUNY faculty, their collaborators at other universities, and their public and private sector partners, and CUNY students and research staff.
  • Create opportunities for the CUNY research community to develop new partnerships with the government and private sectors; and
  • Leverage the HPC Center capabilities to acquire additional research resources for its faculty and graduate students in existing and major new programs.

Organization of systems and data storage (architecture)

All user data and project data are kept on Data Storage and Management System (DSMS) which is mounted only on login node(s) of all servers. Consequently, no jobs can be started directly from DSMS storage. Instead, all jobs must be submitted from a separate (fast but small) /scratch file system mounted on all computational nodes and on all login nodes. As the name suggests, the /scratch file system is not home directory for accounts nor can be used for long term data preservation. Users must use "staging" procedure described below to ensure preservation of their data, codes and parameters files. The figure below is a schematic of the environment.

Upon registering with HPCC every user will get 2 directories:

/scratch/<userid> – this is temporary workspace on the HPC systems
/global/u/<userid> – space for “home directory”, i.e., storage space on the DSMS for program, scripts, and data
• In some instances a user will also have use of disk space on the DSMS in /cunyZone/home/<projectid> (IRods).
HPCC structure.png

The /global/u/<userid> directory has quota (see below for details) while the /scratch/<userid> do not have. However the /scratch space is cleaned up following the rules described below. There are no guarantees of any kind that files in /scratch will be preserved during the hardware crashes or cleaning up. Access to all HPCC resources is provided by bastion host called 'chizen. The Data Transfer Node called Cea allows file transfer from/to remote sites directly to/from /global/u/<userid> or to/from /scratch/<userid>

HPC systems

The HPC Center operates variety of architectures in order to support complex and demanding workflows. All computational resources of different types are united into single hybrid cluster called Arrow. The latter deploys symmetric multiprocessor (also referred as SMP) nodes with and without GPU, distributed shared memory (NUMA) node, fat (large memory) nodes and advanced SMP nodes with multiple GPU. The number of GPU per node varies between 2 and 8 as well as employed GPU interface and GPU family. Thus the basic GPU nodes hold two Tesla K20m (plugged through PCIe interface) while the most advanced ones support eight Ampere A100 GPU connected via SXM interface.

Overview of Computational architectures:

SMP servers have several processors (working under a single operating system) which "share everything". Thus all cpu-cores allocate a common memory block via shared bus or data path. SMP servers support all combinations of memory VS cpu (up to the limits of the particular computer). The SMP servers are commonly used to run sequential or thread parallel (e.g. OpenMP) jobs and they may have or may not have GPU.

Cluster is defined as a single system comprising set of servers interconnected with high performance network. Specific software coordinates programs on and/or across those in order to perform computationally intensive tasks. The most common cluster type is the one that consists of several identical SMP servers connected via fast interconnect. Each SMP member of the cluster is called a node. All nodes run independent copies of the same operating system (OS). Some or all of the nodes may incorporate GPU.

Hybrid clusters combine nodes of different architectures. For instance the main CUNY-HPCC machine is a hybrid cluster called Arrow. Sixty two (62) of its nodes are identical GPU enabled SMP servers each with 2 x GPU K20m, 3 are SMP but with extended memory (fat nodes), one node is distributed shared memory node (NUMA, see below) and 2 are fat SMP servers especially designed to support 8 NVIDIA GPU per node. The latter are connected via SXM interface. In addition HPCC operates the cluster Herbert dedicated only to education.

Distributed shared memory computer is tightly coupled server in which the memory is physically distributed, but it is logically unified as a single block. The system resembles SMP, but the number of cpu cores and the amounts of memory possible is far beyond limitations of the SMP. Because the memory is distributed, the access times across address space are non-uniform. Thus, this architecture is called Non Uniform Memory Access (NUMA) architecture. Similarly to SMP, the NUMA systems are typically used for applications such as data mining and decision support system in which processing can be parceled out to a number of processors that collectively work on a common data. HPCC operates the NUMA node at Arrow named Appel. This node does not have GPU.

Infrastructure systems:

o Master Head Node (MHN/Arrow) is a redundant login node from which all jobs on all servers start. This server is not directly accessible from outside CSI campus. Note that name of main server and its login nodes are the same Arrow. Thus users can access the Arrow login nodes using name Arrow or MHN.

o Chizen is a redundant gateway server which provides access to protected HPCC domain.

o Cea is a file transfer node allowing transfer of files between users’ computers to/from /scratch space or to/from /global/u/<usarid>. Cea is accessible directly (not only via Chizen), but allows only limited set of shell commands.

Table 1 below provides a quick summary of the attributes of each of the sub clusters of the main HPC Center called Arow.

Master Head Node Sub System Tier Type Type of Jobs Nodes CPU Cores GPUs Mem/node Mem/core Chip Type GPU Type and Interface
Arrow Penzias Advanced Hybrid Cluster Sequential & Parallel jobs w/wo GPU 66 16 2 64 GB 4 GB SB, EP 2.20 GHz K20m GPU, PCIe v2
Sequential & Parallel jobs 1 24 - 1500 GB 62 GB HL, 2.30 GHz -
36 - 768 GB 21 GB -
24 - 768 GB 32 GB -
Appel NUMA Massive Parallel, sequential, OpenMP 1 384 - 11 TB 28 GB IB, 3 GHz -
Cryo SMP Sequential and Parallel jobs, with GPU 1 40 8 1500 GB 37 GB SL, 2.40 GHz V100 (32GB) GPU, SXM
Blue Moon Hybrid Cluster Sequential and Parallel jobs w/wo GPU 24 32 - 192 GB 6 GB SL, 2.10 GHz -
2 32 2 V100(16GB) GPU, PCIe
Karle SMP Visualization, MATLAB/Mathematica 1 36* - 768 GB 21 GB HL, 2.30 GHz -
Chizen Gateway No jobs allowed -
CFD Condo SMP Parallel, Seq, OpenMP 1 48 2 768 GB EM, 4.8 GHz A40, PCIe, v4
1 48 - 512 GB ER, 4.3 GHz -
PHYS Condo SMP 1 48 2 640 GB ER, 4 GHz L40, PCIe, v4
1 48 - 512 GB ER, 4.3 GHz -
CHEM Condo SMP 1 48 2 256 GB EM, 2.8 GHz A30, PCIe, v4
1 128 8 512 GB ER, 2.0 GHz A100/40, SXM
ASRC Condo SMP 1 48 2 256 GB ER, 2.8 GHz A30, PCIe, v4

Note: SB = Intel(R) Sandy Bridge, HL = Intel (R) Haswell, IB = Intel (R) Ivy Bridge, SL = Intel (R) Xeon(R) Gold, ER = AMD(R) EPYC ROMA, EM = AMD(R) EPYC MILAN, EG = AMD (R) EPYC GENOA

Recovery of operational costs

CUNY-HPCC operates on cost recovery model recapturing only operational costs with no profit (for CUNY users only). The costs are calculated to be break even following the methodology used by CUNY-RF. The costs are reviewed and consequently updated twice a year. The charging scheme is based on unit-hour. The unit can be either CPU unit or GPU unit. The definitions of these is given in a table below:

Type of resource Unit For V100, A30, A40 or L40 For A100
CPU unit 1 cpu core -- --
GPU unit 4 cpu cores + 1 GPU thread 4 cpu cores + 1 GPU 4 cpu cores and 1/7 A100

Users can choose between following options:

  1. On-demand computing (for basic and advanced tiers only)
  2. Rent a node in basic and/or advanced tier for the duration of the project
  3. Rent a condo node.

Basic and Advanced Tier

Under "On-Demand" computing mode the users are charged per unit-hour according to above table. Users leasing (min 1 month , 30 days) the resource (e.g. node) are charged at the beginning of the lease period. Any excessive hours needed to complete the project are charged as on-demand computing. Leasing guarantees 24/7 access to node(s) (except maintenance periods), no time limits for the job(s) and preferred level of support.

Type service Time Limit Guaranteed Access Support tickets Fair Share Policy Price CPU unit Price GPU unit
On-demand Yes No Yes Strict $0.015 $0.15
Lease No Yes High Priority No $0.025 $0.25

Condo Tier

Condo tier consist of servers purchased and owed by faculty. Owner have unrestricted and access to their own server(s) and can borrow the server from condo tier (free) upon agreement between owners. Non owners can borrow the condo server when the server is free and owner explicitly agrees. The renter pays the cost recovery fee which is collected by HPCC and is used to offset the owners fees. The minimum rent period is 30 days (on month). The long term rent is 3+ months. In this case there is 10% discount of total price. The prices are given in a table below:

Type of node Renters cost/month Long term rent cost/month CPU/node CPU type GPU/node GPU type GPU interface
Laghe Hybrid $602.52 $564.86 128 EPYC, 2.2 GHz 8 A100/80 SXM
Small Hybrid $205.41 $192.57 48 EPYC, 2.8 GHz 2 A40, A30, L40 PCIe v4
Medium Non GPU $328.65 $308.11 96 EPYC, 4.11GHz 48 None NA
Lagre Non GPU $438.20 $410.81 128 EPYC, 2.0 GHz 128 None NA

Free time

In order to establish a project all new users are entitled to free 11520 CPU hours and 1440 GPU hours. Any hours above these are charged on "on-demand" rates. Note that free time is per user account not per project so any user can have free time only once. External collaborators to CUNY are not normally eligible for free time. Please contact CUNY-HPCC director for details.

Support for research grants

All proposals dated on Jan 1st 2026 (01/01/26) and later that require computational resources must include budget for cost recovery fees at CUNY-HPCC. For a project the PI can choose between:

  • lease the node(s), That is useful option for well defined projects and those with high computational component requiring 100% availability of the computational resource.
  • use "on-demand" resources. That is flexible option good for experimental projects or exploring new areas of study. The downgrade is that resources are shared among all users under fair share policy. Thus immediate access to resource cannot be guaranteed.
  • participate in CONDO tier. That is most beneficial option in terms of availability of resources and level of support. It fits best the focused research of group(s) (e.g. materials science).

In all cases the PI can use the appropriate rates listed above to establish correct budget for the proposal. PI should contact the Director of CUNY-HPCC Dr. Alexander Tzanov (alexander.tzanov@csi.cuny.edu) and discuss the project's computational requirements including optimal and most economical computational workflows, suitable hardware, shared or own resources, CUNY-HPCC support options and any other matter concerning correct and optimal computational budget for the proposal.    

Partitions and jobs

The only way to submit job(s) to HPCC servers is through SLURM batch system. Any job despite of its type (interactive, batch, serial, parallel etc.) must be submitted via SLURM. The latter allocates the requested resources on proper server and starts the job(s) according to predefined strict fair share policy. Computational resources (cpu-cores, memory, GPU) are organized in partitions. The table below describes the partitions and their limitations. The users are granted permissions house one or other partition and corresponding QOS key. The table below shows the limitations of the partitions (in progress).

Partition Max cores/job Max jobs/user Total cores/group Time limits Tier GPU types
partnsf 128 50 256 240 Hours Advanced K20m, V100/16, A100/40
partchem 128 50 256 No limit Condo A100/80, A30
partcfd 96 50 96 No limit Condo A40
partsym 96 50 96 No limit Condo A30
partasrc 48 16 16 No limit Condo A30
partmatlabD 128 50 256 240 Hours Advanced V100/16,A100/40
partmatlabN 384 50 384 240 Hours Advanced None
partphys 96 50 96 No limit Condo L40
  • partnsf is the main partition with assigned resources across all sub-servers. Users may submit sequential, thread parallel or distributed parallel jobs with or without GPU.
  • partchem is CONDO partition.
  • partphys is CONDO partition
  • partsym is CONDO partition
  • partasrc is CONDO partition
  • partmatlabD partition allows to run MATLAB's Distributes Parallel Server across main cluster.
  • partmatlabN partition to access large matlab node with 384 cores and 11 TB of shared memory. It is useful to run parallel Matlab jobs with Parallel ToolBox
  • partdev is dedicated to development. All HPCC users have access to this partition with assigned resources of one computational node with 16 cores, 64 GB of memory and 2 GPU (K20m). This partition has time limit of 4 hours.

Hours of Operation

In order to maximize the use of resources HPCC applies “rolling” maintenance scheme across all systems. When downtime is needed, HPCC will notify all users a week or more in advance (unless emergency situation occur). Typically, the fourth Tuesday mornings in the month from 8:00AM to 12PM is normally reserved (but not always used) for scheduled maintenance. Please plan accordingly. Unplanned maintenance to remedy system related problems may be scheduled as needed out of above mentioned days. Reasonable attempts will be made to inform users running on those systems when these needs arise. Note that users are strongly encouraged to use checkpoints in their jobs.

User Support

Users are strongly encouraged to read this Wiki carefully before submitting ticket(s) for help. In particular, the sections on compiling and running parallel programs, and the section on the SLURM batch queueing system will give you the essential knowledge needed to use the CUNY HPCC systems. We have strived to maintain the most uniform user applications environment possible across the Center's systems to ease the transfer of applications and run scripts among them.

The CUNY HPC Center staff, along with outside vendors, offer regular courses and workshops to the CUNY community in parallel programming techniques, HPC computing architecture, and the essentials of using our systems. Please follow our mailings on the subject and feel free to inquire about such courses. We regularly schedule training visits and classes at the various CUNY campuses. Please let us know if such a training visit is of interest. In the past, topics have include an overview of parallel programming, GPU programming and architecture, using the evolutionary biology software at the HPC Center, the SLURM queueing system at the CUNY HPC Center, Mixed GPU-MPI and OpenMP programming, etc. Staff has also presented guest lectures at formal classes throughout the CUNY campuses.

If you have problems accessing your account and cannot login to the ticketing service, please send an email to:

 hpchelp@csi.cuny.edu 

Warnings and modes of operation

1. hpchelp@csi.cuny.edu is for questions and accounts help communication only and does not accept tickets unless ticketing system is not operational. For tickets please use the ticketing system mentioned above. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a 48h response, quite often even same day response. During the weekend you may not get any response.

2. E-mails to hpchelp@csi.cuny.edu must have a valid CUNY e-mail as reply address. Messages originated from public mailers (google, hotmail, etc) are filtered out.

3. Do not send questions to individual CUNY HPC Center staff members directly. These will be returned to the sender with a polite request to submit a ticket or email the Helpline. This applies to replies to initial questions as well.

The CUNY HPC Center staff members are focused on providing high quality support to its user community, but compared to other HPC Centers of similar size our staff is extremely lean. Please make full use of the tools that we have provided (especially the Wiki), and feel free to offer suggestions for improved service. We hope and expect your experience in using our systems will be predictably good and productive.

User Manual

The old version of the user manual provides PBS not SLURM batch scripts as examples. Currently CUNY-HPCC uses SLURM scheduler so users must check and use only the updated brief SLURM manual distributed with new accounts or ask CUNY-HPCC for a copy of the latter.