Introduction to the City University of New York High Performance Computing Center
The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314. HPCC goals are to:
- Support the scientific computing needs of CUNY faculty, their collaborators at other universities, and their public and private sector partners, and CUNY students and research staff.
- Create opportunities for the CUNY research community to develop new partnerships with the government and private sectors; and
- Leverage the HPC Center capabilities to acquire additional research resources for its faculty and graduate students in existing and major new programs.
Please send comments on or corrections to the wiki to firstname.lastname@example.org
The HPCC currently operates seven significant systems. The following table summarizes the characteristics of these systems; additional information is provided below the table.
Andy. Andy (andy.csi.cuny.edu) is named in honor of Dr. Andrew S. Grove, an alumnus of the City College of New York and one of the founders of the Intel Corporation (http://educationupdate.com/archives/2005/Dec/html/col-ccnypres.htm) . Andy is composed of two distinct computational halves serviced by a single head node and several service nodes. The first and older half (Andy1) is an SGI ICE system (http://www.sgi.com/products/servers/altix/ice/) with 45 dual-socket, compute nodes each with Intel 2.93 GHz quad-core Intel Core 7 (Nehalem) processors providing a total of 360 compute cores. Each compute node has 24 Gbytes of memory or 3 Gbytes of memory per core. Andy1's interconnect network is a dual rail, DDR Infiniband (20 Gbit/second) network in which one rail is used to access Andy's Lustre storage system and the other is used for inter-processor communication. The second and newer half (Andy2) is a cluster of 48 SGI x340 1U compute nodes (each configured similarly to those on Andy1 to give it 384 cores). Andy2's interconnect is a single rail QDR Infiniband (40 Gbit/second) network serving both its communication network and Lustre storage system. Both Andy1 and Andy2 (360 + 384 == 744 cores) are served by the same head node and home directory, which is a Lustre parallel file system with 24 Tbytes of useable storage.
Bob. Bob (bob.csi.cuny.edu) is named in honor of Dr. Robert E. Kahn, an alumnus of the City College of New York who, along with Vinton G. Cerf, invented the TCP/IP protocol, the technology used to transmit information over the modern Internet (http://www.economicexpert.com/a/Robert:E:Kahn.htm). Bob is a Dell PowerEdge system consisting of one head node with two sockets of AMD Shanghai native quad-core processors running at 2.3 GHz and twenty-nine compute nodes of the same type providing a total of 30 x 8 = 240 cores. Each compute node has 16 Gbytes of memory or 2 Gbytes of memory per core. Bob has both a standard 1 Gbit Ethernet interconnect and a low-latency, SDR Infiniband (10 Gbit/second) interconnect. Bob is currently largely dedicated to running the Gaussian suite of computation chemistry programs.
Karle. Karle (karle.csi.cuny.edu) is named in honor of Dr. Jerome Karle, an alumnus of the City College of New York who was awarded the Nobel Prize in Chemistry in 1985, jointly with Herbert A. Hauptman, for the direct analysis of crystal structures using X-ray scattering techniques. Karle functions both as a gateway and interface system to run MATLAB, SAS, MATHEMATICA and other GUI-oriented applications for CUNY users both within and outside the local area network at the College of Staten Island where the CUNY HPC Center is located. Karle can be used to run such computations (in serial or parallel) locally and directly on Karle, or to submit batch work over the network to the clusters Bob or "Andy" described above. As a single, four socket, 4 x 6 = 24 core head-like node, Karle is a highly capable system. Karle's 24 Intel E740-based cores run at 2.4 GHz. Karle has a total of 96 Gbytes of memory or 4 Gbytes per core. Account allocation on Karle will be limited to those requiring access to the GUI-oriented applications it is intended to run.
Neptune. Neptune (neptune.csi.cuny.edu) functions as a generic gateway or interface system for CUNY users that are not within local area network at the College of Staten Island where the CUNY HPC Center is located. Neptune can be addressed using the secure shell command ssh (ssh [-X] neptune.csi.cuny.edu). Neptune is only used as a secure jumping-off point to access other HPCC systems. HPC work loads should NOT be run on Neptune which has limited memory and compute power. Work found running on Neptune that consumes significant quantities of CPU time as shown in the 'top' command will be killed. This applies in general to the head nodes of all the CUNY systems.
Penzias. "Penzias", named after Arno Penzias, a CUNY alumnus and Nobel Laureate in Physics. Penzias is a Dell R720 system consisting of dual head nodes and 72 compute nodes. Each compute node has two sockets of Intel E5-2660 2.2 GHz chips with 16 cores per node. It has a total of 1172 cores available for user computations. The cores each have 4 Gbytes of memory (16 cores with a total of 64 Gbytes to a node). The interconnect network is FDR. This system also has 144 NVIDIA Kepler K20 GPUs.
Salk. Salk (salk.csi.cuny.edu) is named in honor of Dr. Jonas Salk, also an alumnus of the City College of New York and creator of the first vaccine for Polio (http://en.wikipedia.org/wiki/Jonas_Salk#College). Salk is a two-cabinet, Cray XE6m system interconnected with Cray's latest, custom, high-speed, Gemini interconnect. Salk consists of 176 dual-socket compute nodes each containing two 8-core AMD Magny-Cours processors running at 2.3 GHz for a total of 16 cores per node. This gives the system a total of 2816 cores for the production processing of CUNY's HPC applications. Each node has a total of 32 Gbytes of memory or 2 Gbytes of memory per core. Salk's Gemini interconnect is a high-bandwidth, low-latency, high-message-injection rate interconnect supported by a custom ASIC and low-level communications protocol developed by Cray. Unlike the other clusters at the CUNY HPC Center which are connected in a multi-tiered switch topology, the Cray XE6m nodes supported by Gemini are laid out in a 2D torus network. Salk is intended to run jobs of a larger scale than the other CUNY HPC Center systems. Jobs smaller that 16 cores are not allowed on SALK while jobs of 1024 cores and larger are. In addition, SALK, through its Gemini interconnect and compilers, support the Partitioned Global Address Space languages, CoArray Fortran and Unified Parallel C. These languages make programming large, distributed-memory parallel systems easier and more scalable.
Zeus. Zeus (zeus.csi.cuny.edu) is focused on supporting users running Gaussian, and now also, the development of CPU-GPU applications. This system (Dell PowerEdge 1950) consists of one head node (2 x 4 cores running at 1.86 GHz) and 18 compute nodes. Eight of the compute nodes (nodes 0 through 7) have two sockets with Intel 2.66 GHz quad-core Harpertown processors providing a total of eight cores per node. These 8 Harpertown nodes have 2 Gbytes of memory per core for a total of 16 Gbytes per node. Each Harpertown node also has a ~1 TByte disk drive (/state/partition1) for storing Gaussian scratch files. Compute nodes (nodes 8 and 9) have two sockets with Intel 2.27 GHz Woodcrest dual-core processors and a total of 6 Gbytes of memory. Nodes 8 and 9 are also each attached to a NVIDIA Tesla S1070, 1U, 4-way GPU array via dual PCI-Express 2.0 cables to support integrated CPU-GPU computing. Each GPU (4 per 1U Tesla node) has 240, 32-bit floating-pointing units with a peak performance of 1 teraflop (there are 30 64-bit units). Each GPU also has 4 Gbytes of GPU-local memory. Zeus has another 8 compute nodes (compute-0-10 through compute-0-17) that are single socket Intel 2.86 GHz Woodcrest dual-core processors for a total of 88 cores. They may also be used for Gaussian work and include a local 250 Gbyte disk drive for storing Gaussian scratch files. The interconnect network is a standard 1 Gbit Ethernet.
In addition to the above, the HPC Center is installing a new, centralized Storage System and Network. The Storage System will provide an order of magnitude of additional on-line storage capacity for home directories and project space that is directly accessible (although not directly controlled) from any of the HPC Center's installed systems, and include a large, remote tape archival facility.
The remote tape silo will allow for daily incremental backups, full weekly and monthly backups, and long-term retention of critical research data. An iRODS server will be integrated into the environment and will provide a mechanism for the user community to share data.
- The acquisition of the Storage Network is allowing us to transform the environment from a “server centric” to a “data centric” environment.
- At the present time, each system has its own file system for scratch, home directories, and project files.
- The Storage Network will support home directories and project files. This benefits the user in that all files are now in one place accessible from any system. In addition, old servers can be retired and new servers installed without impacting user data.
- Local system disk will be used only for scratch space.
- Offsite storage will be provided for home directories and project files.
- A data transfer node will provide for interconnectivity to instrumentation connected to science DMZs.
- An iRODS server will be provided to support the management of research data.
The HPC Center works to maintain a certain amount of uniformity in its software stack, especially at the user and application level. In general, we have standardized on OpenMPI as our MPI implementation, although vendor versions from Cray and SGI are available (on the SALK the Cray version on MPI is the default). While we support the Intel, PGI, and GNU compilers, we have made the Intel compiler suite the default on all systems, except SALK. Moving down the stack to the operating systems, we are a Linux shop although there is some variation in the flavor on Linux supported on each system dictated by the vendor. As such, on PENZIAS, BOB, and ZEUS which are Commodity Off-The-Self (COTS) clusters from Dell, we support CentOS which is part of the Rocks 5.3 release. The operating system running on ANDY is SLES 11 updated with SGI ProPack SP1 support package. The operating system on SALK's, Cray Linux Environment 3.1 (CLE 3.1), is based on SLES 11. The queuing system in use on all CUNY HPC Center systems is PBS Pro 11 with a queue design that is as identical as possible across the systems. The user application software stack supported on all systems includes the following compilers and parallel library software. Much more detail on each can be found below.
- GNU C, C++ and Fortran compilers;
- Portland Group, Inc. optimizing C, C++, and Fortran compilers with CUDA and GPU support;
- The Intel Cluster Studio including the Intel C, C++ and Fortran compilers, Math and Kernel Library;
- OpenMPI 1.5.5 (Cray's custom MPICH on SALK, SGI's proprietary MPT on ANDY, and Intel's MPI are also available)
SALK, the Cray XE6m system, uses is own proprietary MPI library based on the API to its Gemini interconnect. Cray also provides its own C, C++, and Fortran Compilers which support the Partitioned Global Address Space parallel programming models, Unified Parallel C (UPC) and CoArray Fortran (CAF) respectively.
Hours of Operation
The second and fourth Tuesday mornings in the month from 8:00AM to 12PM are normally reserved (but not always used) for scheduled maintenance. Please plan accordingly. Unplanned maintenance to remedy system related problems may be scheduled as needed. Reasonable attempts will be made to inform users running on those systems when these needs arise.
Users are encouraged to read this Wiki carefully. In particular, the sections on compiling and running parallel programs, and the section on the PBS Pro batch queueing system will give you the essential knowledge needed to use the CUNY HPC Center systems. We have strived to maintain the most uniform user applications environment possible across the Center's systems to ease the transfer of applications and run scripts among them. Still, there are some differences, particularly with the SGI (ANDY) and Cray (SALK) systems.
The CUNY HPC Center staff, along with outside vendors, offer regular courses and workshops to the CUNY community in parallel programming techniques, HPC computing architecture, and the essentials of using our systems. Please follow our mailings on the subject and feel free to inquire about such courses. We regularly schedule training visits and classes at the various CUNY campuses. Please let us know if such a training visit is of interest. In the past, topics have include an overview of parallel programming, GPU programming and architecture, using the evolutionary biology software at the HPC Center, the PBS queueing system at the CUNY HPC Center, Mixed GPU-MPI and OpenMP programming, etc. Staff has also presented guest lectures at formal classes throughout the CUNY campuses.
Users with further questions or requiring immediate assistance in use of the systems should send an email to:
Mail to this address is received by the entire CUNY HPC Center support staff. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a same-day response. During the weekend you may or may not get same-day response depending on what staff are reading email that weekend. Please send all technical and administrative questions (including replies) to this address.
Please do not send questions to individual CUNY HPC Center staff members directly.
These will be returned to the sender with a polite request to send them to 'hpchelp'. This applies to replies to initial questions as well as those initial questions.
The CUNY HPC Center staff are focused on providing high quality support to its user community, but compared to other HPC Centers of similar size our staff is lean. Please make full use of the tools that we have provided (especially the Wiki), and feel free to offer suggestions for improved service. We hope and expect your experience in using our systems will be predictably good and productive.
Data storage, retention/deletion, and back-ups
Each user account, upon creation, is provided a home directory (currently on each system) with a default 50 GB storage ceiling or disk quota. A user may request an increase in the size of their home directory if there is a special need. The HPC Center will endeavor to satisfy reasonable requests, but storage is not unlimited and full file systems (especially large files) make backing up the system more difficult. Please regularly remove unwanted files and directories to minimize this burden and avoid keeping duplicate copies in multiple locations. File transfer among the HPC Center systems is very fast. Furthermore, occasionally HPC Center users have thought that HPC Center disks could be used to 'park' or archive data that was locally generated at their site on our HPC Center systems. This practice strictly forbidden.
By the end of 2013, the HPC Center will have completed upgrading its storage system and network architecture. This will create a central hub, home directory storage location for all systems of over 1 PByte is size with tape backup and high-speed local script space on each system. Look for these changes here and in HPC Center mailings.
An incremental backup of user home directories on Andy, Salk, Karle, Bob, and Zeus is performed daily. These backups are retained for three weeks. Full backups are performed weekly and are retained for two months. These backups are stored in a remote location. A full backup is read off tape, bi-monthly, and verified (to ensure backups are readable and restorable).
The following user and system files are backed up:
Retention/Deletion of Home Directories
For active accounts, current Home Directories are retained indefinitely. If a user account is inactive for one year, the HPCC will contact the user and request that the data be removed from the system. If there is no response from the user within three months of the initial notice, or if the user cannot be reached, the Home Directory will be purged.
System temporary/scratch directories
Files on system temporary and scratch directories, as well as home directories on Neptune are not backed up. There is no provision for retaining data stored in these directories.