Main Page

From CUNYHPC
Revision as of 13:27, 23 April 2012 by Walsh (Talk | contribs)

Jump to: navigation, search

CUNY-HPC-Logo.gif

Contents


Introduction to the City University of New York High Performance Computing Center

The City University of New York (CUNY) High Performance Computing Center (HPCC) is located on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314. HPCC goals are to:

  • Support the scientific computing needs of CUNY faculty, their collaborators at other universities, and their public and private sector partners, and CUNY students and research staff.
  • Create opportunities for the CUNY research community to develop new partnerships with the government and private sectors; and
  • Leverage the HPCC's capabilities to acquire additional research resources for its faculty and graduate students in existing and major new programs.

Please send comments on or corrections to the wiki to HPChelp@mail.csi.cuny.edu 

Installed systems

The HPCC currently operates seven significant systems. The following table summarizes the characteristics of these systems; additional information is provided below the table.

HPCC summary2.jpg

Andy. Andy (andy.csi.cuny.edu) is named in honor of Dr. Andrew S. Grove, an alumnus of the City College of New York and one of the founders of the Intel Corporation (http://educationupdate.com/archives/2005/Dec/html/col-ccnypres.htm) . Andy is composed of two distinct computational halves serviced by a single head node and several service nodes. The first and older half (Andy1) is an SGI ICE system (http://www.sgi.com/products/servers/altix/ice/) with 45 dual-socket, compute nodes each with Intel 2.93 GHz quad-core Intel Core 7 (Nehalem) processors providing a total of 360 compute cores. Each compute node has 24 Gbytes of memory or 3 Gbytes of memory per core. Andy1's interconnect network is a dual rail, DDR Infiniband (20 Gbit/second) network in which one rail is used to access its Lustre storage system and the other is used for inter-processor communication. The second and newer half (Andy2) is cluster of 48 SGI x340 1U compute nodes (each configured similarly to those in Andy1), but also connected to 24 quad-GPU Fermi s2050 accelerator nodes. Each socket in each x340 compute node on Andy2 has a companion GPU associated with it for a total of 96 GPUs system wide. Andy2's interconnect is a single rail QDR Infiniband (40 Gbit/second) network serving both its communication network and Lustre storage system. A portion of Andy2 (3 compute nodes or 24 cores and 6 GPUs) is dedicated to GPU interactive and development work, while the rest (45 dual-socket compute nodes [360 cores] and 90 GPUs) are available for parallel and serial production work in either CPU-only or CPU-GPU mode. Both Andy1 and Andy2 are served by the same head node and home directory, which is a Lustre parallel file system with 24 Tbytes of useable storage.

Athena. Athena (athena.csi.cuny.edu), a Dell PowerEdge 1850, consists of one head node and 86 compute nodes. Each compute node has two sockets, each with an Intel 2.80 GHz Woodcrest dual-core processor providing a total of 4 cores per compute node. Athena has a total of 348 cores, 4 on the head node and 344 available for computation on the compute nodes. Each Athena compute node provides 2 Gbytes of memory per core for a total of 8 Gbytes per node. The interconnect network is a standard 1 Gbit Ethernet.

Bob. Bob (bob.csi.cuny.edu) is named in honor of Dr. Robert E. Kahn, an alumnus of the City College of New York who, along with Vinton G. Cerf, invented the TCP/IP protocol, the technology used to transmit information over the modern Internet (http://www.economicexpert.com/a/Robert:E:Kahn.htm). "Bob" is also a Dell PowerEdge system consisting of one head node with two sockets of AMD Shanghai native quad-core processors running at 2.3 GHz and twenty-nine compute nodes of the same type providing a total of 30 x 8 = 240 cores. Each compute node has 16 Gbytes of memory or 2 Gbytes of memory per core. "Bob" has both a standard 1 Gbit Ethernet interconnect and a low-latency, SDR Infiniband (10 Gbit/second) interconnect.

Karle. Karle (karle.csi.cuny.edu) is named in honor of Dr. Jerome Karle, an alumnus of the City College of New York who was awarded the Nobel Prize in Chemistry in 1985, jointly with Herbert A. Hauptman, for the direct analysis of crystal structures using X-ray scattering techniques. Karle functions both as a gateway and interface system to run MATLAB, SAS, MATHEMATICA and other GUI-oriented applications for CUNY users both within and outside the local area network at the College of Staten Island where the CUNY HPC Center is located. Karle can be used to run such computations (in serial or parallel) locally and directly on Karle, or to submit batch work over the network to the clusters Bob or "Andy" described above. As a single, four socket, 4 x 6 = 24 core head-like node, Karle is a highly capable system. Karle's 24 Intel E740-based cores run at 2.4 GHz. Karle has a total of 96 Gbytes of memory or 4 Gbytes per core. Account allocation on Karle will be limited to those requiring access to the applications it is intended to run.

Neptune. Neptune (neptune.csi.cuny.edu) functions as a gateway or interface system for CUNY users that are not within local area network at the College of Staten Island where the CUNY HPC Center is located. Neptune can be addressed using the secure shell command ssh (ssh [-X] neptune.csi.cuny.edu). "Neptune" is only used as a secure jumping-off point to access other HPCC systems. HPC work loads should not be run on Neptune.

Salk. Salk (salk.csi.cuny.edu) is named in honor of Dr. Jonas Salk, also an alumnus of the City College of New York and creator of the first vaccine for Polio (http://en.wikipedia.org/wiki/Jonas_Salk#College). Salk is a Cray XE6m system interconnected with Cray's latest high-speed, custom Gemini interconnect. Salk consists of 160 dual-socket compute nodes each containing two 8-core AMD Magny-Cours processors running at 2.3 GHz for a total of 16 cores per node. This gives the system a total 1280 cores for the production processing of CUNY's HPC applications. Each node has a total of 32 Gbytes of memory or 2 Gbytes of memory per core. "Salk's" Gemini interconnect is a high-bandwidth, low-latency, high-message-injection rate interconnect supported by a custom ASIC and communications protocol developed by Cray. Unlike the other clusters at the CUNY HPC Center which are connected in a multi-tiered switch topology, the Cray XE6m nodes supported by Gemini are laid out in a 2D torus network. "Salk" is intended to run jobs of a larger scale than the other CUNY HPC Center systems and/or parallel programs that use CoArray Fortran or Unified Parallel C.

Zeus. Zeus (zeus.csi.cuny.edu) is focused on supporting users running Gaussian09, and now also, the development of CPU-GPU applications. This system (Dell PowerEdge 1950) consists of one head node (2 x 4 cores running at 1.86 GHz) and 18 compute nodes. Eight of the compute nodes (nodes 0 through 7) have two sockets with Intel 2.66 GHz quad-core Harpertown processors providing a total of eight cores per node. These 8 Harpertown nodes have 2 Gbytes of memory per core for a total of 16 Gbytes per node. Each Harpertown node also has a ~1 TByte disk drive (/state/partition1) for storing Gaussian scratch files.

Compute nodes (nodes 8 and 9) have two sockets with Intel 2.27 GHz Woodcrest dual-core processors and a total of 6 Gbytes of memory. Nodes 8 and 9 are also each attached to a NVIDIA Tesla S1070, 1U, 4-way GPU array via dual PCI-Express 2.0 cables to support integrated CPU-GPU computing. Each GPU (4 per 1U Tesla node) has 240, 32-bit floating-pointing units with a peak performance of 1 teraflop (there are 30 64-bit units). Each GPU also has 4 Gbytes of GPU-local memory. Zeus has another 8 compute nodes (compute-0-10 through compute-0-17) that are single socket Intel 2.86 GHz Woodcrest dual-core processors. They may also be used for Gaussian work and include a local 250 Gbyte disk drive for storing Gaussian scratch files. The interconnect network is a standard 1 Gbit Ethernet.

Software Overview

The operating system running on the ATHENA, BOB, and ZEUS is CentOS and is part of the Rocks 5.3 release. The operating system running on ANDY is SLES 11 updated with SGI ProPack SP1 support package. The operating system on SALK's, Cray Linux Environment 3.1 (CLE 3.1), is based on SLES 11. The queuing system in use on all CUNY HPC Center systems is PBS Pro 11 with a queue design that is as identical as possible across the systems. The user application software stack supported on all systems includes the following compilers and parallel library software. Much more detail on each can be found below.

  • GNU C, C++ and Fortran compilers;
  • Portland Group, Inc. optimizing C, C++, and Fortran compilers;
  • The Intel Cluster Studio including the Intel C, C++ and Fortran compilers, Math and Kernel Library;
  • OpenMPI 1.5.1 (MPICH, Intel MPI may also be used, and on ANDY, SGI's proprietary MPT)

SALK, the Cray XE6 system, uses is own proprietary MPI library based on the API to its Gemini interconnect. Cray also provides its own C, C++, and Fortran Compilers which support the Partitioned Global Address Space parallel programming models, Unified Parallel C (UPC) and CoArray Fortran (CAF) respectively.


The following third party applications are currently installed, although not on every system described above. The CUNY HPC Center staff will be happy to work with any user interested in installing additional applications, subject to meeting that application's license requirements.

  • ADF (Amsterdam Density Functional Theory)
  • BEST
  • Bioperl
  • Blast
  • BUPC (Berkeley UPC)
  • CUDA
  • Dalton
  • DLPOLY
  • FASTA
  • Gauss (Economic Modeling)
  • Gaussian03
  • Gaussian09
  • Gromacs
  • HondoPlus
  • Lamarc
  • IMA2
  • Mathematica
  • MATLAB
  • MrBayes
  • Migrate
  • NAMD
  • Network Simulator2 (NS2)
  • NWCHEM
  • Octopus
  • Phoenics
  • R
  • RAxML
  • ROMS
  • Structure
  • Visualization/NAG
  • WRF (Weather Research and Forecasting Code)
  • WRF-Chem

The following graphics, IO, and scientific libraries are also supported.

  • Atlas
  • FFTW (2.1.5, 3.2.2, 3.3.0)
  • GRADS
  • GSL
  • HDF4
  • HDF5
  • IMSL
  • LAPACK
  • MET
  • NCAR
  • NETCDF
  • PNETCDF (Argonne)
  • SPARSEKIT

Hours of Operation

The second and fourth Tuesday mornings in the month from 8:00AM to 12PM are normally reserved (but not always used) for scheduled maintenance. Please plan accordingly. Unplanned maintenance to remedy system related problems may be scheduled as needed. Reasonable attempts will be made to inform users running on those systems when these needs arise.

User Support

Users are encouraged to read this Wiki carefully. In particular, the sections on compiling and running parallel programs, and the section on the PBS Pro batch queueing system will give you the essential knowledge needed to use the CUNY HPC Center systems. We have strived to maintain the most uniform user applications environment possible across the Center's systems to ease the transfer of applications and run scripts among them. Still, there are some differences, particularly the SGI (ANDY) and Cray (SALK) systems.

The CUNY HPC Center staff along with outside vendors also offer regular courses to the CUNY community in parallel programming techniques, HPC computing architecture, and the essentials of using our systems. Please follow our mailings on the subject and feel free to inquire about such courses. We regularly schedule training visits and classes at the various CUNY campuses. Please let us know if such are training visit is of interest.

Users with further questions or requiring immediate assistance in use of the systems should send an email to:


hpchelp@mail.csi.cuny.edu

Mail to this address is received by the entire CUNY HPC Center support staff. This ensures that the person on staff with the most appropriate skill set and job related responsibility will respond to your questions. During the business week you should expect a same-day response. During the weekend you may or may not get same-day response depending on what staff are reading email that weekend. Please send all technical and administrative questions (including replies) to this address. Please do not send questions to individual CUNY HPC Center staff members directly.

The CUNY HPC Center staff are focused on providing high quality support to its user community. Please make full use of the tools that we have provided, and feel free to offer suggestions for improved service. We hope and expect your experience in using our systems will be predictably good and productive.

Data storage, retention/deletion, and back-ups

Home Directories

Each user account, upon creation, is provided a home directory with a default 50 GB storage ceiling on each system. A user may request an increase in the size of their home directory if there is a special need. The HPCC will endeavor to satisfy reasonable requests, but storage is not unlimited and full file systems (especially large files) make backing up the system more difficult. Please regularly remove unwanted files and directories to minimize this burden.

An incremental backup of all home directories is performed daily. These backups are retained for three weeks. Full backups are performed weekly and are retained for two months. These backups are stored in a remote location. A full backup is read off tape, bi-monthly, and verified (to ensure backups are readable and restorable). The following user and system files are backed up:


The following user and system files are backed up:

/home
/usr
/
/var
mySQL

Retention/Deletion of Home Directories

For active accounts, current Home Directories are retained indefinitely. If a user account is inactive for one year, the HPCC will contact the user and request that the data be removed from the system. If there is no response from the user within three months of the initial notice, or if the user cannot be reached, the Home Directory will be purged.

System temporary/scratch directories

Files on system temporary and scratch directories are not backed up. There is no provision for retaining data stored in these directories.

Acknowledgements

The CUNY HPCC gratefully acknowledges support from the following sources:

• The acquisition of “Salk”, a Cray XE6m, was made possible by a grant from the National Science Foundation under award CNS-0958379.

• The acquisition of “Andy”, a SGI system with NVIDIA accelerators, was made possible by a grant from National Science Foundation under award CNS-0855217, as well as funding from the New York City Council made possible through the efforts of Councilman James Oddo.

• CUNY HPCC facility upgrades were funded through the efforts of Staten Island Borough President James P. Molinaro.

• Operating funds for the CUNY HPCC are provided by the College of Staten Island and the City University of New York.

Users of the CUNY HPCC resources should include the following statement in papers, journal articles, and presentations:

“This research was supported, in part, under National Science Foundation Grants CNS-0958379 and CNS-0855217 and the City University of New York High Performance Computing Center.”

The CUNY HPCC requests that users of the system provide it copies of any such publications. The publications can be forwarded electronically to Hpchelp@csi.cuny.edu

Important Notice to Users

ANDY, ATHENA, BOB, NEPTUNE, SALK and ZEUS are all currently operational.

In the Fall of 2009, the Buildings and Grounds Department of the College of Staten Island completed a major infrastructure upgrade to the CUNY HPC facility in building 1M. This upgrade includes additional and improved air conditioning, a raised computer room floor, and additional electrical power. This facility upgrade (and additional ones still planned) has enabled the CUNY HPC Center to more easily integrate the several new systems that have been installed at the CUNY HPC Center in the last 12 to 18 months.

This list includes the following newest systems now is use at the new facility: ANDY1 (a 360 core SGI ICE system), ANDY2 (a 384 core, 96 GPU, SGI CPU-GPU cluster), and SALK (a 1280 core Cray XE6m system). Another facility expansion planned for the Fall of 2011 will double the current useable floor space, power and cooling, and provide room for additional CUNY HPCC Center compute and storage equipment. A picture of the new facility is shown below:

Facility.jpg

Program Compilation and Job Submission

Serial Program Compilation

The CUNY HPC Center supports four different compiler suites at this time; those from Cray, Intel, The Portland Group, and GNU. Basic serial programs in C, C++, and Fortran can be compiled with any of these offerings, although the Cray compilers are available only on SALK. Man pages (e.g. for Cray, man cc, for Intel, man icc; for PGI, man pgcc; for GNU, man gcc) and manuals exist for each compiler in each suite and provide details on compiler flags. Optimized performance on a particular system with a particular compiler often depends on the compiler options chosen. Identical flags are accepted by the mpi-equivalents derived from each suite (mpicc, mpif90, etc. [NOTE: SALK does not use mpi-prefixed MPI compile and run tools; it has its own]). Program debuggers and performance profilers are also part of each of the suites.

The Intel Compiler Suite

Intel's Cluster Studio (ICS) compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems.

To check for the default version installed:

icc  -V

Compiling a C program:

icc  -O3 -unroll mycode.c

The line above invokes Intel's C compiler (also used by Intel mpicc). It requests level 3 optimization and that loops be unrolled for performance. To find out more about 'icc', type 'man icc'.

Similarly for Intel Fortran and C++.

Compiling a Fortran program:

ifort -O3 -unroll mycode.f90

Compiling a C++ program:

icpc -O3 -unroll mycode.C

The Portland Group Compiler Suite

The Portland Group Inc. compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems.

To check for the default version installed:

pgcc  -V

Compiling a C program:

pgcc  -O3 -Munroll mycode.c

The line above invokes PGI's C compiler (also used by PGI mpicc). It requests level 3 optimization and that loops be unrolled for performance. To find out more about 'pgcc', type 'man pgcc'.

Similarly for Fortran and C++.

Compiling a Fortran program:

 pgf90 -O3 -Munroll mycode.f90

Compiling a C++ program:

 pgCC -O3 -Munroll  mycode.C

The Cray Compiler Suite

The HCP Center's Cray XE6 system, SALK, includes the Cray Compiler Environment (CCE) provided by Cray along with the others described here. Cray systems use the 'modules' utility to select a default compiler environment. More detail is provided on 'modules' later. Here we show you how to use modules to select Cray's programming environment which includes CCE.

Load the Cray programming environment (available on SALK only):

module load PrgEnv-cray

To check for the default version installed:

cc  -V

Compiling a C program:

cc  -O3 -hunroll2 mycode.c

The line above invokes Cray's C compiler. It requests level 3 optimization and that all loops be unrolled for performance. To find out more about the Cray C compiler type 'man craycc'.

Similarly for Fortran and C++.

Compiling a Fortran program:

ftn -O3 -O unroll2 mycode.f90

Compiling a C++ program:

CC -O3 -hunroll2 mycode.C

NOTE: On SALK (Cray XE6) whatever compiler suite is selected using the 'module load' command shown above becomes the default and the generic command names 'cc', 'ftn', and 'CC' shown here are symbolically associated with the underlying specific names of that loaded suite (Cray, or PGI, Intel, GNU). The man pages for these generic names provide direction as to what the specific names and man pages are.

The GNU Compiler Suite

The GNU compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems.

To check for the default version installed:

gcc  -v

Compiling a C program:

gcc  -O3 -funroll-loops mycode.c

The line above invokes GNU's C compiler (also used by GNU mpicc). It requests level 3 optimization and that loops be unrolled for performance. To find out more about 'gcc', type man gcc.

Similarly for Fortran and C++.

Compiling a Fortran program:

gfortran -O3 -funroll-loops mycode.f90

Compiling a C++ program (uses gcc):

gcc -O3 -funroll-loops mycode.C

OpenMP, OpenMP SMP-Parallel Program Compilation, and PBS Job Submission

All the compute nodes on all the the clusters at the CUNY HPC Center include at least 2 sockets and multiple cores. Some have 4 (ATHENA), some have 8 (ZEUS, BOB, ANDY), and some have 16 (SALK). These multicore SMP compute nodes offer the CUNY HPC Center user community the option of creating parallel programs using the OpenMP Symmetric Multi-Processing (SMP) parallel programming model. SMP parallel programming with the OpenMP model (and other SMP models) has been around for a long time because early parallel HPC systems were built only with shared memories.

In the SMP model, multiple processors work within a single program image and the same memory space. This eliminates the need to copy data from one program (process) image to another (required by MPI) and simplifies the parallel run-time environment significantly. As such, writing parallel programs to the OpenMP standard is generally easier and requires fewer lines of code. However, the size of the problem that can be addressed using OpenMP is limited by the amount of memory on a single compute node, and the parallel performance improvement to be gained is limited by the number of processors (cores) within the node that can address that same memory space. As of Q3 2011 at CUNY's HPC Center, OpenMP applications can run with a maximum of 16 cores (this is on SALK, the Cray XE6 system).

Here, a simple OpenMP parallel version of the standard C "Hello, World!" program is set to run on 8 cores:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define NPROCS 8

int main (int argc, char *argv[]) {

   int nthreads, num_threads=NPROCS, tid;

  /* Set the number of threads */
  omp_set_num_threads(num_threads);

  /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Each thread obtains its thread number */
  tid = omp_get_thread_num();

  /* Each thread executes this print */
  printf("Hello World from thread = %d\n", tid);

  /* Only the master thread does this */
  if (tid == 0)
     {
      nthreads = omp_get_num_threads();
      printf("Total number of threads = %d\n", nthreads);
     }

   }  /* All threads join master thread and disband */

}

An excellent and comprehensive tutorial on OpenMP with examples can be found at the Lawrence Berkeley National Lab web site: (https://computing.llnl.gov/tutorials/openMP)

Compiling OpenMP Programs Using the Intel Compiler Suite

The intel C compiler requires the '-openmp' option, as follows:

icc  -o hello_omp.exe -openmp hello_omp.c

When run this program produces the following output:

$ ./hello_omp.exe
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 1
Hello World from thread = 2 
Hello World from thread = 6
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 7

OpenMP is supported in both Intel's Fortran and C++ compilers as well.

Compiling OpenMP Programs Using the PGI Compiler Suite

The PGI C compiler requires the '-mp' option, as follows:

pgcc  -o hello_omp.exe -mp hello_omp.c

The program produces the same output, although the order of the print statements cannot be predicted and will not be the same over repeated runs. OpenMP is supported in both PGI's Fortran and C++ compilers as well.

Compiling OpenMP Programs Using the Cray Compiler Suite

The Cray C compiler requires the '-h omp' option, as follows:

cc  -o hello_omp.exe -h omp hello_omp.c

The program produces the same output, although the order of the print statements cannot be predicted and will not be the same over repeated runs. OpenMP is supported in both Cray's Fortran and C++ compilers as well.

Compiling OpenMP Programs Using the GNU Compiler Suite

The GNU C compiler requires the '-fopenmp' option, as follows:

gcc  -o hello_omp.exe -fopenmp hello_omp.c

The program produces the same output, although the order of the print statements cannot be predicted and will not be the same over repeated runs. OpenMP is supported in both GNU's Fortran and C++ compilers as well.

Submitting an OpenMP Program to the PBS Batch Queueing System

All non-trivial jobs (development or production, parallel or serial) must be submitted to HPC Center system compute nodes from each system's head node or login node using a PBS script. Jobs run interactively on system head nodes that place a significant and sustained load on the head node will be terminated. Details on the use of PBS are presented later in this document; however, here we present a basic PBS script ('my_ompjob') that can be used to submit any OpenMP SMP program for compute node batch processing.

#!/bin/bash
#PBS -q production
#PBS -N openMP_job
#PBS -l select=1:ncpus=8
#PBS -l place=pack
#PBS -V

# You must explicitly change to your working directory in PBS
# The PBS_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $PBS_O_WORKDIR

# The PBS_NODEFILE file contains the compute nodes assigned
# to the job by PBS.  Uncommenting the next line will show them.
# cat $PBS_NODEFILE

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

./hello_omp.exe

When submitted with 'qsub my_ompjob' a job ID is returned and output will be written to the file 'openMP_job.oXXXX' where XXXX is the job ID.

The key lines in the script are '-l select' and '-l place'. The first defines (1) resource chunk with '-l select=1' and assigns (8) cores to it with ':ncpus=8'. These 8 cores are to be used in concert by our OpenMP executable, hello_omp.exe. Next, the line '-l place=pack:excl' instructs PBS to ensure that all the cores in the resource chunk that we are requesting are placed on a single physical compute node AND that no other job is placed on the same node.

Packed placement is a requirement for OpenMP jobs because each processor assigned to an OpenMP job works within a single program's memory space or image. If the processors assigned by PBS were on another physical node they would not be usable; if they were assigned to another job as well they would not be fully available to the OpenMP program and would delay its completion. Here, the selection of (8) cores would consume all the cores available on a single compute node on either BOB or ANDY forcing PBS to allocate an entire compute node to the OpenMP job. In this case, the OpenMP job will have all of the memory the compute has at its disposal knowing that no other job can use it. If fewer cores were selected (say 4), PBS could place another job on the same node using up to (4) cores which would compete for memory resources proportionally. PBS offers the 'pack:excl' option to force exclusive placement even if the job uses less than all the cores on the physical node.

One thing that should be kept in mind when defining the resource requirements of and submitting any PBS script is that jobs with resource requests that are impossible to fulfill on the system where the job is submitted will be queued forever and never run. In our case here, we must know that the system that we are submitting this job to has (8) processors/cores available on a single physical node. At the HPC Center this job will run on the BOB and ANDY systems, but will be queued indefinitely on ATHENA because ATHENA has only 4 cores per physical node. This resource mapping requirement applies to any resource that you might request in your PBS script, not just cores. Resource definition and mapping is discussed in greater detail in the PBS section later in this document.

Note that on SALK, the Cray XE6 system, the final line in the script would require the use of Cray 'aprun' command as follows:


aprun -n 1 -d 16 ./hello_omp.exe

Here, 'aprun' is requesting that one process be allocated and that it be allowed to use all 16 cores available on a single SALK node. Because the production queue on SALK allows no jobs requesting fewer than 16 cores, the '-l select' would also have to be changed to:


#PBS -l select=1:ncpus=16

MPI, MPI Parallel Program Compilation, and PBS Batch Job Submission

The Message Passing Interface (MPI) is a hardware-independent parallel programming and communications library callable from C, C++, or Fortran. Quoting from the MPI standard:


MPI is a message-passing application programmer interface (API), together with protocol and semantic specifications for how its features must behave in any implementation.


MPI has become the de facto standard approach for parallel programming in HPC. MPI is a collection of well-defined library calls composing an Applications Program Interface (API) for transfering data (packaged as messages) between completely independent processes within independent address spaces. These processes might be running within a single physical node or across distributed nodes connected by an interconnect such as GigaBit Ethernet or InfiniBand. MPI communication is generally two-sided with both the sender and receiver of the data actively participating in the communication events. Both point-to-point and collective communication is supported. MPI's goals are high performance, scalability, and portability. MPI remains the dominant parallel programming model used in high-performance computing today, although is sometimes criticized as difficult to program with.

The original MPI-1 release was not designed with any special features to support traditional shared memory or distributed-shared memory parallel architectures, and MPI-2 provides only limited distributed, shared-memory support with some one-sided, remote direct memory access routines (RDMA). Nonetheless, MPI programs are regularly run on shared memory computers because the MPI model is a parallel architecture neutral paradigm. Writing parallel programs using the MPI model (as opposed to shared-memory models such as OpenMP described above) requires the careful partitioning of program data among the communicating processes to minimize communication events which can sap the performance of parallel applications when they are run at larger scale (with more processors).

The CUNY HPC Center supports several versions of MPI, including proprietary versions from Intel, SGI, and Cray; however, with the exception of the Cray, CUNY HPC Center systems by default have standardized on the public domain release of MPI called OpenMPI. While this version will not always perform as well as the proprietary versions mentioned above, it is a reliable version that can be run on most HPC cluster systems. Among the systems currently running at the CUNY HPC Center, only the Cray (SALK) does not support OpenMPI. In the discussion below, we therefore emphasize OpenMPI (except in our treatment of MPI on the Cray) because it can be run on almost every system the CUNY HPC Center supports. Details on how to use Intel's and SGI's proprietary MPIs, and on using MPICH, another public domain version of MPI will be added later.

OpenMPI (completely different from and not to be confused with OpenMP described above) is a project combining technologies and resources from several previous MPI projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) with the stated aim of building the best freely available MPI library. OpenMPI represents the merger between three well-known MPI implementations:

  • FT-MPI from the University of Tennessee
  • LA-MPI from Los Alamos National Laboratory
  • LAM/MPI from Indiana University

with contributions from the PACX-MPI team at the University of Stuttgart. These four institutions comprise the founding members of the OpenMPI development team which has grown to include many other active contributors and a very active user group.

These MPI implementations were selected because OpenMPI developers thought that each excelled in one or more areas. The stated driving motivation behind OpenMPI is to bring the best ideas and technologies from the individual projects and create one world-class open source MPI implementation that excels in all areas. The OpenMPI project names several top-level goals:

  • Create a free, open source software, peer-reviewed, production-quality complete MPI-2 implementation.
  • Provide extremely high, competitive performance (low latency or high bandwidth).
  • Directly involve the high-performance computing community with external development and feedback (vendors, 3rd party researchers, users, etc.).
  • Provide a stable platform for 3rd party research and commercial development.
  • Help prevent the "forking problem" common to other MPI projects.
  • Support a wide variety of high-performance computing platforms and environments.

At the CUNY HPC Center, OpenMPI may be used to run jobs compiled with the Intel, PGI, or GNU compilers. Two simple MPI programs, one written in C and another in Fortran are shown below as examples. For details on programming in MPI, users should consider attending the CUNY HPC MPI workshop (3 days in length), refer to the many online tutorials, or read one of books on the subject. A good online tutorial on MPI can be found at LLNL here [1]. A tutorial on parallel programming in general can be found here [2].

Parallel implementations of the "Hello world!" program in C and Fortran are presented here to give the reader a feel for the look of MPI code. These sample codes can be used as test cases in the sections below describing parallel applications compilation and job submission. Again, refer to the tutorials mentioned above or attend the CUNY HPC Center MPI workshop for details on MPI programming.

Example 1. C Example (hello.c)
#include <stdio.h>
#include <mpi.h>

int main (argc, argv)
int argc;
char *argv[];
{
 int rank, size;

 MPI_Init (&argc, &argv);    /* starts MPI */
 /* get current process id */
 MPI_Comm_rank (MPI_COMM_WORLD, &rank);
 /* get number of processes */
 MPI_Comm_size (MPI_COMM_WORLD, &size);
 printf( "Hello world from process %d of %d\n", rank, size );
 MPI_Finalize();
 return 0;
}


Example 2. Fortran example (hello.f90)
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
print*, 'node', rank, ': Hello world'
call MPI_FINALIZE(ierror)
end

An excellent and comprehensive tutorial on MPI with examples can be found at the Lawrence Berkeley National Lab web site: https://computing.llnl.gov/tutorials/mpi)

An Overview of the CUNY MPI Compilers and Batch Scheduler

PBS Pro 11 is batch scheduling and queuing system on all CUNY HPC Center systems. The PBS Pro batch queues on all four CUNY HPC Center systems (ANDY, ATHENA, BOB, and ZEUS) are identical in name and largely identical in operation, although maximum job size and job counts have been scaled to the size of each system. On the Cray system (SALK) the queues very similar but not identical and have been modified to emphasize large core count jobs. Still, submit scripts developed on one system should generally work on another, although some tuning for differences in the number of cores per compute node can yield performance benefits or more rapid conversion from a queued to a running state. The queue that production jobs should use on all these systems is production (ANDY2 also has a production_qdr and production_gpu queue which will be described later). Development work should use the development queue, and interactive work, should use the interactive queue. The development and interactive queues have a small segment of each systems resources dedicated to them (except on ZEUS) and have a higher priority than the production queue. For details on the PBS Pro queues, please go to the detailed description of PBS Pro presented below.

Like the PBS Pro queues, the default compiler and MPI stack is also the same on all four systems making possible the transfer of scripts from one system to the other with little or no editing (the Cray is again exception as mentioned above). The default compilers are those released in February 2011 in Intel's Cluster Suite. The default MPI currently in use is OpenMPI 1.5.1, released in the Spring of 2011, compiled with the afore mentioned Intel compilers. Note that the default MPI was compiled with the site-default Intel compilers, but it is NOT Intel's MPI. Intel's MPI is available if required, but is not the default. In addition, the PGI 11.2 compiler suite (Spring 2011), which includes GPU acceleration compiler directives, is also available, as is OpenMPI 1.5.1 built with these PGI compilers. A user can easily toggle from Intel to PGI and back with the help of CUNY provided scripts (see below). [NOTE: In the long run, the HPC Center intends to move to using the 'modules' utility as the Cray already does. Finally, the compiler and parallel applications stack provided by the Rocks 5.3 roll used to build the CUNY HPC cluster systems is available. This includes gcc 4.1.2 (on ANDY and SALK the 'gcc' defaults are 4.3.3 and 4.3.2 respectively), and either the OpenMPI 1.3.3 or MPICH2 MPI stack. In addition, on ANDY (1 and 2) SGI provides its own high-performance MPI stack call MPT. Users seeking maximum scalability on ANDY should consider using SGI's MPT MPI stack.

Using the MPI-derived compile and run commands (mpicc, mpif90, mpiCC, mpirun, etc.) without full Unix paths will deliver the default Intel-compiled versions of OpenMPI 1.5.1. (NOTE: The Cray system, SALK, uses its own command 'aprun' to initiate MPI and other parallel jobs. It is presented in a Cray-specific section). To use the other compiler stacks, care should be taken to update the PATH, MANPATH, and LD_LIBRARY_PATH variables in the user's environment on both the head and compute nodes. The scripts /etc/profile.d/smpi-defaults.[sh,csh] on each system can be adapted and placed in the appropriate "init" files (.bashrc, .chsrc, etc) in the user's home directory to accomplish this. (NOTE: Home directories on the head and compute nodes are identical, so setting new defaults on the head node should take care of this for all nodes). The options required by OpenMPI's mpirun should be the same, regardless of the cluster used. This should be true even on the systems with InfiniBand interconnects (BOB and ANDY) because of the way OpenMPI was built on the InfiniBand systems. InfiniBand should be selected automatically on the InfiniBand systems (BOB and ANDY) and Gigabit Ethernet on the Ethernet systems (ATHENA and ZEUS), or "mpirun" will report that it was not available and select the Gigabit Interconnect as an alternative. Sample, basic PBS Pro batch scripts for running parallel jobs are provided here, but there is much more detail provided in the PBS section below.

Sample Compilations and Production Batch Scripts

These examples should run consistently on all CUNY HPC Center systems (except on SALK).

Intel OpenMPI Parallel C

Compilation (because this is the default, the full path is NOT required):

/share/apps/openmpi-intel/default/bin/mpicc -o hello_mpi.exe ./hello_mpi.c

Intel OpenMPI Parallel FORTRAN

Compilation (again, because this is the default, the full path is NOT required):

/share/apps/openmpi-intel/default/bin/mpif90 -o hello_mpi.exe ./hello_mpi.f90

Intel OpenMPI PBS Submit Script

This script (my.job) sends PBS an 8 processor (core) job allowing PBS to freely distributed the 8 processors to the least loaded nodes. For details on the meaning of all the options in this script please see the full PBS Pro section below.

#!/bin/bash
#PBS -q production
#PBS -N openmpi_intel
#PBS -l select=8:ncpus=1
#PBS -l place=free
#PBS -V

# You must explicitly change to your working directory in PBS
# The PBS_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $PBS_O_WORKDIR

# The PBS_NODEFILE file contains the compute nodes assigned
# to the job by PBS.  Uncommenting the next line will show them.
# cat $PBS_NODEFILE

# Because OpenMPI compiled with the Intel compilers is the default,
# the full path here is NOT required.

/share/apps/openmpi-intel/default/bin/mpirun -np 8 -machinefile $PBS_NODEFILE ./hello_mpi.exe

When submitted with 'qsub myjob' a job ID is returned and output 
will be written to the file 'openmpi_intel.oXXXX' where XXXX is the
job ID.

MPI hello world output:

Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 1 of 8
Hello world from process 6 of 8
Hello world from process 0 of 8
Hello world from process 5 of 8
Hello world from process 7 of 8

Portland Group OpenMPI Parallel C

Compilation (because this is NOT the default, the full path is show, but the environment would still need to be toggled to ensure a clean compile under the PGI environment [see below]):

/share/apps/openmpi-pgi/default/bin/mpicc -o hello_mpi.exe ./hello_mpi.c

Portland Group OpenMPI Parallel FORTRAN

Compilation (again, because this is NOT the default, the full path is show, but the environment would still need to be toggled to ensure a clean compile under the PGI environment [see below]):

/share/apps/openmpi-pgi/default/bin/mpif90 -o hello_mpi.exe ./hello_mpi.f90

Portland Group OpenMPI PBS Submit Script

This script sends PBS an 8 processor (core) job allowing PBS to freely distributed the 8 processors to the least loaded nodes. (Note: the only real difference between this script and the Intel script above is in the path to the mpirun command.) For details on the meaning of all the options in this script please see the full PBS Pro section below.

#!/bin/bash
#PBS -q production
#PBS -N openmpi_pgi
#PBS -l select=8:ncpus=1
#PBS -l place=free
#PBS -V

# You must explicitly change to your working directory in PBS
# The PBS_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $PBS_O_WORKDIR

# The PBS_NODEFILE file contains the compute nodes assigned
# to the job by PBS.  Uncommenting the next line will show them.
# cat $PBS_NODEFILE

# Because OpenMPI PGI is NOT the default, the full path is show,
# but this does not guarantee a clean run. You must ensure that
# the environment has been toggled to PGI either in the batch script
# or within your  init files (see section below).

/share/apps/openmpi-pgi/default/bin/mpirun -np 8 -machinefile $PBS_NODEFILE ./hello_mpi.exe

Similarly, when submitted with 'qsub myjob' a job ID is returned and output 
will be written to the file 'openmpi_intel.oXXXX' where XXXX is the job ID.

MPI hello world output:

Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 1 of 8
Hello world from process 6 of 8
Hello world from process 0 of 8
Hello world from process 5 of 8
Hello world from process 7 of 8

Cray MPI Parallel C

Compilation (because this is the default, the full path is NOT required):

/opt/cray/xt-asyncpe/4.9/bin/cc -o hello_mpi.exe ./hello_mpi.c

Cray MPI Parallel FORTRAN

Compilation (again, because this is the default, the full path is NOT required):

/opt/cray/xt-asyncpe/4.9/bin/ftn -o hello_mpi.exe ./hello_mpi.f90

Also, note that no special wrapper or options are required to instruct SALK's Cray compiler to link in the Cray MPI libraries. The Cray C and Fortran compilers do this automatically.

Cray MPI PBS Submit Script

This script sends PBS a 16 processor (core) job allowing PBS to freely distributed the 16 processors to the least loaded nodes. This script asks for 16 cores instead of 8, because the smallest production job allowed on the Cray (SALK) is 16 cores. (NOTE: serial jobs can only be run in the development on the Cray). For details on the meaning of all the options in this script please see the full PBS Pro section below.

#!/bin/bash
#PBS -q production
#PBS -N openmpi_pgi
#PBS -l select=16:ncpus=1:
#PBS -l place=free
#PBS -V

# You must explicitly change to your working directory in PBS
# The PBS_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $PBS_O_WORKDIR

# The PBS_NODEFILE file contains the compute nodes assigned
# to the job by PBS.  Uncommenting the next line will show them.
# cat $PBS_NODEFILE

aprun -n 16 -N 16  ./hello_mpi.exe

Similarly, when submitted with 'qsub myjob' a job ID is returned and output will be written to the file 'openmpi_intel.oXXXX' where XXXX is the job ID. Note the presence of the 'aprun' command which replaces 'mpirun' in this Cray specific PBS submit script. More can be found on 'aprun' and its command-line options below and by entering the command 'man aprun' on the Cray. The Cray requires the following preparation step to place PBS and its commands into your working environment.


module load pbs

This should be placed into you shell's 'init' file to ensure that it occurs automatically.

MPI hello world output:

Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 1 of 8
Hello world from process 6 of 8
Hello world from process 0 of 8
Hello world from process 5 of 8
Hello world from process 7 of 8

GNU OpenMPI Parallel C

Compilation (because this is NOT the default, the full path is show, but the environment would still need to be toggled to ensure a clean compile under the GNU environment [see below]). Note that the GNU version of OpenMPI is installed in a different location:

/opt/openmpi/bin/mpicc -o hello_mpi.exe ./hello_mpi.c

GNU OpenMPI Parallel FORTRAN

Compilation (again, because this is not the default, the full path is show, but the environment would still need to be toggled to ensure a clean compile under the GNU environment [see below]):

/opt/openmpi/bin/mpif90 -o hello_mpi.exe ./hello_mpi.f90

The above description applies to the running the GNU version of the mpi commands on CUNY HPC Center's Gigabit Ethernet Rocks 5.3 interconnected systems (ATHENA and ZEUS). On BOB, which is an InfiniBand system, the path to the mpi commands is different:

/usr/mpi/gcc/openmpi-1.2.8/bin/[mpicc,mpif90,mpirun,etc.]

GNU OpenMPI PBS Submit Script

This script sends PBS an 8 processor (core) job allowing PBS to freely distributed the 8 processors to the least loaded nodes. (Note: the only real difference between this script and the Intel script above is in the path to the mpirun command.) For details on the meaning of all the options in this script please see the full PBS Pro section below.

#!/bin/bash
#PBS -q production
#PBS -N openmpi_gnu
#PBS -l select=8:ncpus=1
#PBS -l place=free
#PBS -V

# You must explicitly change to your working directory in PBS
# The PBS_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $PBS_O_WORKDIR

# The PBS_NODEFILE file contains the compute nodes assigned
# to the job by PBS.  Uncommenting the next line will show them.
# cat $PBS_NODEFILE

# Because OpenMPI GNU is NOT the default, the full path is show,
# but this does not guarantee a clean run. You must ensure that
# the environment has been toggled to GNU either in the batch script
# or within your init files (see section below).

/opt/openmpi/bin/mpirun -np 8 -machinefile $PBS_NODEFILE ./hello_mpi.exe

Similarly, when submitted with 'qsub myjob' a job ID is returned and output 
will be written to the file 'openmpi_intel.oXXXX' where XXXX is the job ID.

MPI hello world output:

Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 1 of 8
Hello world from process 6 of 8
Hello world from process 0 of 8
Hello world from process 5 of 8
Hello world from process 7 of 8

NOTE: The paths used above for the gcc version of OpenMPI apply to ATHENA and ZEUS, which have GE interconnects. On BOB the path to the InfiniBand version of the gcc OpenMPI commands and libraries is:

/usr/mpi/gcc/openmpi-1.2.8/[bin,lib]

Other Custom Versions of the MPI Stack

The GNU version of MPI is NOT available on ANDY, which is a SUSE 11.2 system from SGI, not based on Rocks and Red Hat.. SGI's optimized version of MPI called MPT is available on the SGI in addition OpenMPI. SGI's MPT mpirun command is located in:

/opt/sgi/mpt/mpt-1.25/bin/mpirun

Compiling MPI programs for SGI's MPT does not require the mpi wrapper commands that OpenMPI does, but instead uses compiler flags offered directly to the native compilers (icc, pgcc). For more information on user SGI's MPT please inquire with the HPC Center staff or consult the SGI documentation.

Setting Your Preferred MPI and Compiler Defaults

As mentioned above the default version of MPI on the CUNY HPC Center clusters is OpenMPI 1.5.1 compiled with the Intel compilers. This default is set by scripts in the /etc/profile.d directory (i.e. smpi-defaults.[sh,csh]). When the mpi-wrapper commands (mpicc, mpif90, mpirun, etc.) are used WITHOUT full path prefixes, these Intel defaults will be invoked. To use either of the other supported MPI environments (OpenMPI compiled with the PGI compilers, or OpenMPI compiled with the GNU compilers) users should set their local environment either from their home directory init files (i.e. .bashrc, .cshrc) or manually in their batch scripts. The script provided below can be used for this.

WARNING: Full path references to non-default mpi-commands will NOT guarantee clean compiles and runs because of the way OpenMPI references the environment it runs in!!

CUNY HPC Center staff recommend fully toggling the site default environment away from Intel to PGI or GNU when the non-default environments are preferred. This can be done relatively easily by commenting out the default and commenting in one of the preferred alternatives referenced in the script provided below. Users may copy the script smpi-default.sh (or smpi-defaults-csh) from /etc/profile.d. A copy is provided here for reference. (NOTE: This discussion does NOT apply on the Cray which uses the 'modules' system to manage its default applications environment.)

# general path settings 
#PATH=/opt/openmpi/bin:$PATH
#PATH=/usr/mpi/gcc/openmpi-1.2.8/bin:$PATH
#PATH=/share/apps/openmpi-pgi/default/bin:$PATH
#PATH=/share/apps/openmpi-intel/default/bin:$PATH
export PATH

# man path settings 
#MANPATH=/opt/openmpi/share/man:$MANPATH
#MANPATH=/usr/mpi/gcc/openmpi-1.2.8/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-pgi/default/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-intel/default/share/man:$MANPATH
export MANPATH

# library path settings 
#LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-1.2.8/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-pgi/default/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-intel/default/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

By selectively commenting in the appropriate line in each paragraph above the default PATH, MANPATH, and LD_LIBRARY_PATH can be set to the MPI compilation stack that the user prefers. The right place to do this is inside the user's .bashrc file (or .cshrc file in the C-shell) in the user's HOME directory. Once done, full path references in the PBS submit scripts listed above become unecessary and one script would work for any compilation stack.

Getting the Right Interconnect for High Performance MPI

A few comments should be made about interconnect control and selection under OpenMPI. First, this question applies ONLY to ANDY and BOB which have both InfiniBand and Gigabit Ethernet interconnects. InfiniBand provides both greater bandwidth and lower latencies than Gigabit Ethernet, and it should be chosen on these systems because it will deliver better performance at a given processor count and greater application scalability.

Both the Intel and Portland Group versions of OpenMPI installed on both ANDY and BOB have been compiled to include the OpenIB libraries. This means that by default the mpirun command will attempt to use the OpenIB libraries at runtime without any special options. If this cannot be done because no InfiniBand devices can be found, a runtime error message will be reported in PBS Pro's error file, and mpirun will attempt to use other libraries and interfaces (namely GigaBit Ethernet, which is TCP/IP based) to run the job. If successful, the job will run to completion, but perform in a sub-optimal way.

To avoid this, or to establish with certainty which communication libraries and devices are being used by your job, there are options that can be used with mpirun to force the choice of one communication device, or the other.

To force the job to use the OpenIB interface (ib0) or fail, use:

mpirun  -mca btl openib,self -np  8 -machinefile $PBS_NODEFILE ./hello_mpi.exe

To force the job to use the GigaBit Ethernet interface (eth0) or fail, use:

mpirun  -mca btl tcp,self -np  8 -machinefile $PBS_NODEFILE ./hello_mpi.exe

Note, this discussion does not apply on the Cray which uses its own proprietary Gemini interconnect. It is worth noting that the Cray's interconnect is not switched-based like the other systems, but rather a 2D toroidal mesh for which being aware of job placement on the mesh can be an important consideration when tuning a job for performance at scale.

GPU Parallel Program Compilation and PBS Job Submission

The CUNY HPC Center supports computing with Graphics Processing Units (GPUs). GPUs can be thought of of as highly parallel co-processors (or accelerators) connected to a node's CPUs via a PCI Express bus. The HPC Center provides GPU accelerators on two systems, ZEUS (largely for development purposes) and ANDY (for development and production). ZEUS has a rack-mounted NVIDIA Tesla S1070 attached to two, dual-socket, dual-core x86-64 compute nodes (nodes compute-0-8 and compute-0-9). This arrangement provides 4 GPUs, one per socket for CUDA and OpenCL development work on ZEUS. Recently (October of 2010), the HPC Center upgraded ANDY, its fully configured CPU-GPU cluster, with new 448 core, 1U rack-mounted NVIDIA Fermi S2050 nodes each of which include 4 Fermi GPUs. Referred to as ANDY2, this system is coupled to, (although distinct from) ANDY1 installed in December of 2009. ANDY2 combines an additional 384 Nehalem cores with 96 NVIDIA Fermi GPUs (4 per Fermi 1U form factor). Each of the 96 Fermis has 448 light-weight cores for parallel floating-point or integer calculation. The details of ANDY's (ANDY1 and ANDY2) architecture are described above in the CUNY HPC Center's system description section.

Each NVIDIA Fermi S2050 (and Tesla S1070) includes 4 NVIDIA GPUs enhanced for scientific use. Two GPUs of these 4 are connected (one per socket) to each x86-64 compute node via a single 16x PCI-Express 2.0 cable. In combination, a Fermi's 448 cores (clocked at 1.147 GHz at CUNY) are capable of 515 double-precision GFlops and more than 1 TFlops single precision. This gives each four-GPU Fermi S1070 a peak single-precision performance of over 4 TFlops and a peak double precision of over 2 TFlops. In combination, the peak single-precision performance of the 96 Fermi GPUs available on ANDY2 is over 100 TFlops. The Tesla GPU's 240 cores (clocked at 1.296 GHz at CUNY) on ZEUS are capable of 993 GFlops in single-precision and 78 GFlops in double-precision.

Two distinct parallel programming approaches for the HPC Center's GPU resources are described here. The first (a compiler directive's based extension available in the Portland Group's Inc. (PGI) C and Fortran compilers) delivers ease of use at the expense of somewhat less than highly tuned performance. The second (NVIDIA's Compute Unified Device Architecture, CUDA C or PGI's CUDA Fortran GPU programming model) provides the ability within C or Fortran to more directly address the GPU hardware for better performance, but at the expense of a somewhat greater programming effort. We will introduce both approaches here, and present the basic steps for GPU parallel program compilation and job submission using PBS for both as well.

GPU Parallel Programming with the Portland Group Compiler Directives

The Portland Group, Inc. (PGI) has taken the lead in building a general purpose, accelerated parallel computing model into its compilers. Programmers can access this new technology at CUNY using PGI's compiler, which supports the use of GPU-specific, compiler directives in standard C and Fortran programs. Compiler directives simplify the programmer's job of mapping parallel kernels onto accelerator hardware and do so without compromising the portability of the user's application. Such a directives-parallelized code can be compiled and run on either the CPU-GPU together, or on the CPU alone. At this time, PGI supports the current, HPC-oriented GPU accelerator products from NVIDIA, but intends to extend its compiler-directives-based approach in the future to other accelerators.

The simplicity of coding with directives is illustrated here with a sample code ('vscale.c') that does a simple iteration independent scaling of a vector on both the GPU and CPU in single precision and compares the results:

        #include <stdio.h>
        #include <stdlib.h>
        #include <assert.h>
        
        int main( int argc, char* argv[] )
        {
            int n;      /* size of the vector */
            float *restrict a;  /* the vector */
            float *restrict r;   /* the results */
            float *restrict e;  /* expected results */
            int i;

            /* Set array size */
            if( argc > 1 )
                n = atoi( argv[1] );
            else
                n = 100000;
            if( n <= 0 ) n = 100000;
        
            /* Allocate memory for arrays */
            a = (float*)malloc(n*sizeof(float));
            r = (float*)malloc(n*sizeof(float));
            e = (float*)malloc(n*sizeof(float));

            /* Initialize array */
            for( i = 0; i < n; ++i ) a[i] = (float)(i+1);
        
            /* Scale array and mark for acceleration */
            #pragma acc region
            {
                for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
            }

            /* Scale array on the host to compare */
                for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;

            /* Check the results and print */
            for( i = 0; i < n; ++i ) assert( r[i] == e[i] );

            printf( "%d iterations completed\n", n );

            return 0;
        }

In this simple example, the only code and instruction to the compiler required to direct this vector scaling kernel to the GPU is the compiler directive:

 #pragma acc region

that precedes the second C 'for' loop. A user can build a GPU-ready executable ('c1.exe' in this case) for execution on ZEUS or ANDY with the following compilation statement:

pgcc -o vscale.exe vscale.c -ta=nvidia -Minfo=accel -fast

The option '-ta=nvidia' declares to the compiler what the destination hardware acceleration technology is going to be (PGI's model is intended to be general, although its implementation for NVIDIAs GPU accelerators is the most advanced to date), and the '-Minfo=accel' option requests output describing what the compiler did to accelerate the code. This output is included here:

main:
     32, Generating copyin(a[0:n-1])
            Generating copyout(r[0:n-1])
     34, Loop is parallelizable
            Accelerator kernel generated
            34, #pragma acc for parallel, vector(256)
                  Using register for 'a'

In the output, the compiler explains where and what it intends to copy to (and from) CPU memory to GPU accelerator memory. It explains that the C 'for' loop has no loop iteration dependencies and can be run on the accelerator in parallel. It also indicates the vector length (256, the block size of the work to be done on the GPU). Because the array pointer 'a[]' is declared 'restricted', it will point only into 'a'. This ensures the compiler that pointer-alias-related, loop dependencies cannot occur.

The Portland Group C and Fortran Programming Guides provide a complete description its accelerator compiler directives programming model [3]. Additional introductory material can be found in four PGI white paper tutorials (part1, part2, part3, part4), here: [4], [5], [6], [7].

Submitting Portland Group, GPU-Parallel Programs Using PBS

CUNY has set up special batch queues on ZEUS and ANDY to direct GPU-ready executables to those compute nodes that are connected to GPUs. GPU job submission is very much like other batch job submission under PBS. Here is a PBS example script that can be used to run the GPU-ready executable created above on ANDY (on ZEUS the queue would be changed to 'development_gpu' and the accelerator name would be changed to 'tesla'):

#!/bin/bash
#PBS -q production_gpu
#PBS -N PGI_GPU_job
#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi
#PBS -l place=free
#PBS -V

echo "Starting PGI GPU job ..."

cd $PBS_O_WORKDIR

echo $PBS_NODEFILE
cat  $PBS_NODEFILE

./c1.exe

echo "PGI GPU job is done!"

Focusing on what is required and different from non-GPU jobs, the first requirement is the use of the 'production_gpu' routing queue with the '-q' option to PBS. (Note: On Zeus there is no 'production_gpu' queue. Use the 'development_gpu' queue instead):

#PBS -q production_gpu

The second requirement comes on the '-l select' line:

#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi

Here, the script requests 1 PBS Pro resource chunk ('-l select=1') composed of one standard x86-64 processor (ncpus=1), one GPU (ngpus=1), and most importantly, it asks that the resource chunk use a compute node with an NVIDIA Fermi GPU accelerator attached (accel=fermi). On the ANDY2, GPU-side of ANDY, every compute node (gpute-0 through gpute-47) has 2 Fermi GPUs attached (2 x 48 = 96 in all, again one per compute node CPU socket). On ZEUS, only compute-0-8 and compute-0-9 have Tesla GPUs attached, also with one GPU per CPU socket. On ZEUS, the accelerator name would have to be change to 'accel=tesla'.

These are the essential PBS script requirements for submitting any GPU-Device-ready executable. This applies to the one with compiler directives compiled above, but might also be used to run GPU-ready executable code generated from native CUDA C or Fortran code as described in the next example. In the case above, the PGI compiler-directive marked loops will run in parallel on a single NVIDIA GPU after the data in array 'a[]' is copied to it across the PCI-Express bus. Other variations are possible, including jobs that combine MPI or OpenMP (or even both of these) and GPU parallel programming in a single GPU-SMP-MPI multi-parallel job. The HPC Center staff has created code examples that illustrate these multi-parallel programming model approaches and will provide them to interested users at the HPC Center.

GPU Parallel Programming with NVIDIA's CUDA C or PGI's CUDA Fortran Programming Models

The previous section described the recent advances in compiler development from PGI that make utilizing the data- parallel compute power of the GPU more accessible to C and Fortran programmers. Yet, for over 3 years NVIDIA has offered and continued to develop its Compute Unified Device Architecture (CUDA), and its direct, NVIDIA-GPU-specific programming environment for C programmers. More recently, PGI has release CUDA Fortran jointly with NVIDIA offering a second language choice for programming NVIDIA GPUs using CUDA.

In this section, the basics of compiling and running CUDA C and Fortran applications at the CUNY HPC Center are covered. The current default version of CUDA in use at the CUNY HPC Center as of 8-1-11 is CUDA release 3.2.

CUDA is a complete programming environment that includes:

1.  A modified version of the C or Fortran programming language for programming the GPU Device and
   moving data between the CPU Host and the GPU Device.

2. A runtime environment and translator that generates and runs device-specific, CPU-GPU
  executables from more generic, single, mixed instruction set executables.

3. A Software Development Kit (SDK), HPC application-related libraries, and documentation
  to support the development of CUDA applications.

NVIDIA and PGI have put a lot of effort into making CUDA a flexible, full-featured, and high-performance program- ming environment similar to those in use in HPC to program CPUs. However, CUDA is still a 2-instruction-set, CPU-GPU programming model that must manage two separate memory spaces linked only by the compute node's PCI-Express bus. As such, programming GPUs using CUDA is more complicated than PGI's compiler-directives-based approach presented above which hides many details from the programmer. Still, CUDA's more explicit, close-to-the-hardware approach offers CUDA programmers the chance to get the best possible performance from the GPU for their particular application application.

Adapting a current application or writing a new one for the CUDA CPU-GPU programming model involves dividing that application into those parts that are highly data-parallel and better suited for the GPU Device (the so-called GPU Device code, or device kernel(s)) and those parts that have little or limited data-parallelism and are better suited for execution on the CPU Host (the driver code, or the CPU Host code). In addition, one should inventory the amount of data that must be moved between the CPU Host and GPU Device relative to the amount of GPU computation for each candidate data-parallel GPU kernel. Kernels whose compute-to-communication time ratios are too low should be executed on the CPU.

With the natural GPU-CPU divisions in the application marked, what were once host kernels (usually substantial looping sections in the host code) must be recoded in CUDA C or Fortran for the GPU Device. Also, Host CPU- to-GPU interface code for transferring data to and from the GPU, and for calling the GPU kernel must be written. Once these steps are completed and the host driver and GPU kernel code are compiled with NVIDIA's 'nvcc' compiler driver (or PGI CUDA Fortran compiler), the result is a fully executable mixed CPU-GPU binary (single file, dual instruction set) that typically does the following for each GPU kernel it calls:

1.  Allocates memory for required CPU source and destination arrays on the CPU Host.

2.  Allocates memory for GPU input, intermediate, and result arrays on the GPU Device.

3.  Initializes and/or assigns values to these arrays.

4.  Copies any required CPU Host input data to the GPU Device.

5.  Defines the GPU Device grid, block, and thread dimensions for each GPU kernel.

6.  Calls (executes) the GPU Device kernel code from the CPU Host driver code.

7.  Copies the required GPU Device results back the CPU Host.

8.  Frees (and perhaps zeroes) memory on the CPU Host and GPU Device that is no longer needed.

The details of the actual coding process are beyond the scope of the discussion here, but are treated in depth in NVIDIA's CUDA C Training Class notes, in NVIDIA's CUDA C Programming Guide, and in PGI's CUDA Fortran Programming Guide [8], [9] and in many tutorials and articles on the web [10].

A Sample CUDA GPU Parallel Program Written in NVIDIA's CUDA C

Here, we present a basic example of a CUDA C application that includes code for all the steps outlined above. It fills and then increments a 2D array on the GPU Device and returns the results to the CPU Host for printing. The example code is presented in two parts--the CPU Host setup or driver code, and the GPU Device or kernel code. This example comes from the suite of examples used by NVIDIA in its CUDA Training Class notes. There are many more involved and HPC-relevant examples (matrixMul, binomialOptions, simpleCUFFT, etc.) provided in NVIDIA's Software Development Toolkit (SDK) which any user of CUDA may download and install in their home directory on their CUNY HPC Center account.

The basic example's CPU Host CUDA C code or driver, simple3_host.cu, is:

#include <stdio.h>

extern __global__ void mykernel(int *d_a, int dimx, int dimy);

int main(int argc, char *argv[])
{
   int dimx = 16;
   int dimy = 16;
   int num_bytes = dimx * dimy * sizeof(int);

   /* Initialize Host and Device Pointers */
   int *d_a = 0, *h_a = 0;

   /* Allocate memory on the Host and Device */
   h_a = (int *) malloc(num_bytes);
   cudaMalloc( (void**) &d_a, num_bytes);

   if( 0 == h_a || 0 == d_a ) {
       printf("couldn't allocate memory\n"); return 1;
   }

   /* Initialize Device memory */
   cudaMemset(d_a, 0, num_bytes);

   /* Define kernel grid and block size */
   dim3 grid, block;
   block.x = 4;
   block.y = 4;
   grid.x = dimx/block.x;
   grid.y = dimy/block.y;

   /* Call Device kernel, asynchronously */
   mykernel<<<grid,block>>>(d_a, dimx, dimy);

   /* Copy results from the Device to the Host*/
   cudaMemcpy(h_a,d_a,num_bytes,cudaMemcpyDeviceToHost);

   /* Print out the results from the Host */
   for(int row = 0; row < dimy; row++) {
      for(int col = 0; col < dimx; col++) {
         printf("%d", h_a[row*dimx+col]);
      }
      printf("\n");
   }

   /* Free the allocated memory on the Device and Host */
   free(h_a);
   cudaFree(d_a);

   return 0;

}

The GPU Device CUDA C kernel code, simple3_device.cu, is:

__global__ void mykernel(int *a, int dimx, int dimy)
{
   int ix = blockIdx.x*blockDim.x + threadIdx.x;
   int iy = blockIdx.y*blockDim.y + threadIdx.y;
   int idx = iy * dimx + ix;

   a[idx] = a[idx] + 1;
}

Using these simple CUDA C routines (or code that you have developed yourself), one can easily create a CPU-GPU executable that is ready to run on one of the CUNY HPC Center's GPU-enabled systems (ZEUS and ANDY).

Because of the variety of source and destination code states that the CUDA programming environment can source, generate, and manage, NVIDIA has provided a master program, 'nvcc', called the CUDA compiler driver to handle all of these possible compilation phase translations as well as other compiler driver options. The detailed use of 'nvcc' is documented on ZEUS and ANDY by 'man nvcc' and also in NVIDIA's Compiler Driver Manual [11]. NOTE: Compiling CUDA Fortran programs can be accomplished using PGI's standard Fortran compiler making sure that the CUDA Fortran code is marked with the '.CUF' suffix as in 'matmul.CUF'.

Among the 'nvcc' command's many groups of options are a series of options that determine what source files 'nvcc' should expect to be offered and what destination files it is expected to produce. A sampling of these compilation phase options includes:

--compile    or -c       ::    Compile whatever input files are offered (.c, .cc, .cpp, .cu) into object files (*.o file).
--ptx      or -ptx     ::    Compile all .gpu or .cu input files into device-only .ptx files.
--link     or -link     ::    Compile whatever input files are offered into an executable (the default).
--lib     or -lib        ::    Compile whatever input files are offered into a library file (*.a file).

For a typical compilation to an executable, the third option above (which is to supply nothing or simply the string '-link') is used. There are a multitude of other 'nvcc' options that control file and path specifications for libraries and include files, control and pass options to 'nvcc' companion compilers and linkers (this includes much of the gcc stack, which must be in the user's path for 'nvcc' to work correctly), and for code generation, among other things. For a complete description, please see the manual referred to above or the 'nvcc' man page.

Our concern here is generating an executable from the simple example files above that can be used (like the PGI executables generated in the previous section) in a PBS batch submission script. First, we will produce object files (*.o files), and then we will link them into a GPU-Device-ready executable. Here are the 'nvcc' commands for generating the object files:

nvcc -c  simple3_host.cu
nvcc -c  simple3_device.cu

The above commands should be familiar to C programmers and produce 2 object files, simple3_host.o and simple3_device.o in the working directory. Next, the GPU-Device-ready executable is created:

nvcc -o simple3.exe *.o

Again, this should be very familiar to C programmers. It should be noted that these two steps can be combined as follows:

nvcc *.cu

No additional libraries or include files are required for this simple example, but in a more complex case like those provided in the CUDA Software Development Kit (SDK), library paths and libraries might be specified using the '-L' and '-l' options, include file paths with the '-I' option, among others. Again, details are provided in the 'nvcc' man page or NVIDIA Compiler Driver manual.

We now have an an executable code, 'simple3.exe', that can be submitted with PBS to one of the GPU-enabled compute nodes on ZEUS or ANDY that will create and increment a 2D matrix on the GPU, return the results to the CPU, and print them out.

A Sample CUDA GPU Parallel Program Written in PGI's CUDA Fortran

As mentioned, in addition to CUDA C, PGI and NVIDIA have jointly developed a CUDA Fortran programming model and CUDA Fortran compiler. CUDA Fortran has been fully integrated into PGI's Fortran programming environment. The HPC Center's version of the PGI Fortran compiler fully supports CUDA Fortran.

Here, the same example presented above in CUDA C has been translated by HPC Center staff into CUDA Fortran. The CUDA Fortran host driver or main program that run on the compute node host is presented first followed by the CUDA Fortran device or GPU code. The CUDA Fortran model proves to be economical and elegant because it can take advantage of Fortran's array-based syntax. For instance in CUDA Fortran moving data to and from the device does not require calls to cudaMemcpy() or cudaMemset(), but is accomplished using Fortran's native array assignment capability across a simple '=' sign.

   program simple3()
!
   use cudafor
   use mykernel
!
   implicit none
!
   integer :: dimx = 16, dimy = 16
   integer :: row = 1, col = 1
   integer :: fail = 0
   integer :: asize = 0
!
   integer, allocatable, dimension(:) :: host_a
   integer, device, allocatable, dimension(:) :: dev_a
!
   type(dim3) :: grid, block

   asize = dimx * dimy

   allocate(host_a(asize),dev_a(asize),stat=fail)

   if(fail /= 0) then
      write(*,'(a)') 'couldn''t allocate memory'
      stop
   end if

   dev_a(:) = 0

   block = dim3(4,4,1)
   grid  = dim3(dimx/4,dimy/4,1)

   call mykernel<<<grid,block>>>(dev_a,dimx,dimy)

   host_a(:) = dev_a(:)

   do row=1,dimy
      do col=1,dimx
         write(*,'(i1)', advance='no') host_a((row-1)*dimx+col)
      end do
      write(*,'(/)', advance='no')
   end do

   deallocate(host_a,dev_a)

   end program

Here is the CUDA Fortran device code:

module mykernel
!
   contains
!
   attributes(global) subroutine mykernel(dev_a,dimx,dimy)
!
   integer, device, dimension(:) :: dev_a
   integer, value  :: dimx, dimy
!
   integer :: ix, iy
   integer :: idx

   ix = (blockidx%x-1)*blockdim%x + threadidx%x
   iy = (blockidx%y-1)*blockdim%y + (threadidx%y-1)
   idx = iy * dimx + ix

   dev_a(idx) = dev_a(idx) + 1

   end subroutine

end module mykernel

Compiling CUDA Fortran code is also simple requiring nothing more than the default PGI compiler. Here is how the above code would be compiled in to a device-ready executable that could be submitted in the same manner as the CUDA C original.

pgf90 -Mcuda -fast -o simple3.exe simple3.CUF

The primary thing to remember is to use the '.CUF' suffix on all CUDA Fortran source files. As mentioned above the basics of CUDA Fortran are presented here [12].

Submitting CUDA (C or Fortran), GPU-Parallel Programs Using PBS

The PBS script for submitting the 'simple3.exe' executable generated by the 'nvcc' compiler driver to ANDY is very similar to the script used for the PGI executable provided above:

#!/bin/bash
#PBS -q production_gpu
#PBS -N CUDA_GPU_job
#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi
#PBS -l place=free
#PBS -V

echo "Starting CUDA GPU job ..."

cd $PBS_O_WORKDIR

echo $PBS_NODEFILE
cat    $PBS_NODEFILE

./simple3.exe

echo "CUDA GPU job is done!"

Focusing on what is required and different from non-GPU jobs, the first requirement is the use of the 'production_gpu' routing queue with the queuing option '-q'. (Note: On Zeus there is no 'production_gpu' queue. Use the 'development_gpu' queue instead and replace the string 'fermi' with 'tesla'):

#PBS -q production_gpu

The second requirement comes on the '-l select' line:

#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi

Here, the script requests 1 PBS Pro resource chunk ('-l select=1') composed of one standard x86-64 processor (ncpus=1), one GPU (ngpus=1), and most importantly, it asks that the resource chunk use a compute node with an NVIDIA Fermi GPU accelerator attached (accel=fermi). On the ANDY2, GPU-side of ANDY, every compute node (gpute-0 through gpute-47) has 2 Fermi GPUs attached (2 x 48 = 96 in all, again one per compute node CPU socket). On ZEUS, only compute-0-8 and compute-0-9 have Tesla GPUs attached, also with one GPU per CPU socket. On ZEUS, the accelerator name would have to be change to 'accel=tesla'.

These are the essential PBS script requirements for submitting any GPU-Device-ready executable. This applies to both GPU-ready executable code generated from native CUDA C or Fortran code, and compiler-directives-based GPU code. Other variations are possible, including jobs that combine the MPI or OpenMP (or even both of these) and GPU parallel programming in a single GPU-SMP-MPI multi-parallel job. These other options are discussed in the more detailed section on PBS Pro below. The HPC Center staff has developed a series of sample codes showing all these multi- parallel programming model combinations based on a simple Monte Carlo algorithm for calculating the price of an option. To obtain this examples code suite, makefile, and submit scripts please send a request to hpchelp@csi.cuny.edu.

CoArray Fortran and Unified Parallel C (PGAS) Program Compilation and PBS Job Submission

As part of its plan to offer CUNY HPC Center users a unique variety of HPC parallel programming alternatives (beyond even those described above), the HPC Center has recently acquired a 1280 core Cray XE6 system which supports two newer and similar, highly scalable approaches to parallel programming, CoArray Fortran (CAF) and Unified Parallel C (UPC). Both are extensions to their parent languages, Fortran and C respectively, and offer a symbolically concise alternative to the de facto standard, message-passing model, MPI. CAF and UPC are so-called Partitioned Global Address Space (PGAS) parallel programming models.

Both MPI and the PGAS approach to parallel programming rely on a Single Program Multiple Data (SPMD) model. In the SPMD parallel programming model, identical collaborating programs (with fully separate memory spaces, or program images) are executed by different processors that may or may not be separated by a network. Each processor-program produces different parts of the result in parallel by working on different data and taking conditionally different paths through the program. The PGAS approach differs from MPI in that it abstracts away, as much as possible, communication among the processors reducing the way that communication is expressed to minimal built-in extensions to the base language, in our case C and Fortran. In large part, CAF and UPC are free of extension-related, explicit library calls. With the underlying communication layer abstracted away, PGAS languages appear to provide a singular, global memory space among its processes.

In addition, communication among processes in a PGAS program is one-sided in the sense that any process can read and/or write into the memory of any other process without informing it of its actions. Such one-sided communication has the advantage of being economical, lowering the latency (first byte delay) that is part of the cost of communication among parallel processes. Lower latency parallel programs are generally more scalable because they waste less time in communication, especially when the data to be moved are small in size, in fine-grained communication patterns.

Summarizing, PGAS languages such as CAF and UPC offer the following potential advantages over MPI:

1. Communication is abstracted out of the programming model.

2. Process memory is logically unified into a global address space.

3. Parallel work is economically expressed through simple extensions
    to a base language rather than through a library call based API.

4. Parallel coding is easier and more intuitive.

5. Performance and scalability are better because communication latency is lower.

6. Implementation of fine-grained communication patterns is faster, easier.

The primary drawbacks of PGAS programming models include much less wide-spread support than MPI on common case HPC system architectures such as traditional HPC clusters, and a need for special hardware support to get the best performance out of the PGAS model. Here at the CUNY HPC Center, the Cray XE6 system, SALK, has special PGAS hardware support for both UPC and CAF. The other systems at the HPC Center support Berkeley UPC and Intel CAF on top standard cluster interconnects without the advantage of PGAS hardware support.

An Example CoArray Fortran (CAF) Code

The following simple example program includes some of the essential features of the CoArray Fortran (CAF) programming model, including multiple processor, image-spanning co-array variable declaration, one-sided data transfer between CAF's memory-space distinct images via simple assignment statements, the use of critical regions, and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the CAF; rather the goal is to give the reader a feel for the CAF extensions adopted in the Fortran 2008 programming language standard that now includes CoArrays. This example, which computes PI by numerical integration, can be cut and pasted into a file and run on SALK, the CUNY Cray XE6 system.

A tutorial on the CAF parallel programming model can be found here [13], a more formal description of the language specifications here [14], and the actual CAF standard document as defined and adopted by the Fortran standard's committee for Fortran 2008 here [15].

! 
!  Computing PI by Numerical Integration in CAF
!

program int_pi()
!
implicit none
!
integer :: start, end
integer :: my_image, tot_images
integer :: i = 0, rem = 0, mseg = 0, nseg = 0
!
real :: f, x
!

! Declare two CAF scalar CoArrays, each with one copy per image

real :: local_pi[*], global_pi[*]

! Define integrand with Fortran statement function, set result
! accuracy through the number of segments

f(x) = 1.0/(1.0+x*x)
nseg = 4096

! Find out my image name and the total number of images

my_image   = this_image()
tot_images = num_images()

! Each image initializes its part of the CoArrays to zero

local_pi  = 0.0
global_pi = 0.0

! Partition integrand segments across CAF images (processors)

rem = mod(nseg,tot_images)

mseg  = nseg / tot_images
start = mseg * (my_image - 1)
end   = (mseg * my_image) - 1

if ( my_image .eq. tot_images ) end = end + rem

! Compute local partial sums on each CAF image (processor)

do i = start,end
  local_pi = local_pi + f((.5 + i)/(nseg))

! The above is equivalent to the following more explicit code:
!
! local_pi[my_image]= local_pi[my_image] + f((.5 + i)/(nseg))
!

enddo

local_pi = local_pi * 4.0 / nseg

! Add local, partial sums to single global sum on image 1 only. Use
! critical region to prevent read-before-write race conditions. In such
! a region, only one image at a time may pass.

critical
 global_pi[1] = global_pi[1] + local_pi
end critical

! Ensure all partial sums have been added using CAF 'sync all' barrier
! construct before writing out results

sync all

! Only CAF image 1 prints the global result

if( this_image() == 1) write(*,"('PI = ', f10.6)") global_pi

end program

This sample code computes PI in parallel using a numerical integration scheme. Taking the key CAF-specific features present in this example in order, first we find the declaration of two simple scalar co-arrays (local_pi and global_pi) using CAF's square-bracket notation for the co-array, (e.g. sname[*], vname(1:100)[*], or vname(1:8,1:4)[1:4,*]). The square bracket notation follows the standard Fortran array notation rules, except that the last dimension is always indicated with a asterisk ('*") that is expanded to ensure that the number of co-arrays, co-dimensioned is equal to the number of images (processes) the application has launched.

Next, the example uses the this_image() and num_images() intrinsic functions to determine each image's image ID (a number from 1 to the number of processors requested) and the total number of images or processes requested by the job. These functions return values are stored in typical, image-local, Fortran integer variables and are used later in the example to partition the work among the processors and define image-specific paths through the code. After the integral segments are partitioned among the CoArray images or processes (using the start and end variables), each image computes its piece of the integral in what is a a standard Fortran do loop. However, the variable local_pi, as noted above, is a co-array. Two notations, one implicit and one explicit (but commented out) are presented. The implicit code, with it square-bracket notation dropped, is allowed (and encouraged for optimization reasons) with only the image-local part of a co-array is referenced by a given image. The explicit code makes it clear through the square-bracket extension [my_image] that each image is working with a local element of the local_pi co-array. When the practice of dropping the []s is adopted as a notational covention, all remote references (which are more time consuming operations) in CoArray Fortran are immediately identifiable through the presence square-bracket suffixes in the code. Optimal coding practice should seek to minimize the use of required square-bracketed references.

With the local, partial sums computed by each image and placed in their piece of the local_pi[*] co-array, a global sum is then safely computed and written out only on image 1 with the help of a CAF critical region. Within a critical region, only one image (process) may pass at a time. This ensures that global_pi[1] is accurately summed from each local_pi[my_image] avoiding mistakes that might be caused by simultaneous reads of the same still partially summed global_pi[1] before each image-specific increments were written. Here, we see the variable global_pi[1] with the square-bracket notation which is a reminder that each image (process) is writing its result into the memory space on image 1. This is a remote write for all images, except image 1.

The last section of the code synchronizes (sync all) the images to ensure all partial sums have been added, and then has image 1 write out the global result. Note that, as writtenhere, only image 1 has the global result. For a more detailed treatment of the CoArray Fortran language extension, now part of the Fortran 2008 standard, please see the web references included above.


The CUNY HPC Center supports CoArray Fortran on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where the Intel Cluster Studio provides a beta-level implementation of CoArray Fortran layered on top of Intel's MPI library, an approach that offers CAF's coding simplicity, but no performance advantage over MPI.

Here, the process of compiling a CAF program both for Cray's CAF on SALK, and for Intel's CAF on the HPC Center's other systems is described. On the Cray, compiling a CAF program, such as the example above, simply requires adding an option to the Cray Fortran compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: ftn -h caf -o int_PI.exe int_PI.f90
salk:
salk: ls
int_PI.exe
salk:

In the sequence above, first the Cray programming environment is loaded using the 'module' command; then the Cray Fortran compiler is invoked with the -h caf option to include the CAF features of the Fortran compiler. The result is a CAF-enabled executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (CAF images) can be selected at run time using the -n ## option to Cray's 'aprun' command. The required form of the 'aprun' command is shown below in the section on CAF program job submission using PBS on the Cray.

To compile for a fixed number of processors (a static compile) or CAF images use the -X ## option on the Cray, as follows:

salk:
salk: ftn -X 32 -h caf -o int_PI_32.exe int_PI.f90
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or CAF images, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Intel Fortran compiler 'ifort' and requires a CAF configuration file to be defined by the user. Here is a typical configuration file to compile statically for 16 CAF images followed by the compilation command. This compilation requests a distributed mode compilation in which distinct CAF images are not expected to be on the same physical node.

andy$cat cafconf.txt
-rr -envall -n 16 ./int_PI.exe
andy$
andy$ifort -o int_PI.exe -coarray=distributed -coarray-config-file=cafconf.txt int_PI.f90

The Intel CAF compiler is relatively new and has had limited testing on CUNY HPC systems. It also makes use of Intel's MPI rather than the CUNY HPC Center default, OpenMPI, which means that Intel CAF jobs will not be properly account for. As such, we recommend that Intel CAF compiler be used for development and testing only, while production CAF codes be run on SALK using Cray's CAF compiler. An upgrade is planned for the Intel Compiler Suite in the near future, and this should improve the performance and functionality of Intel's CAF compiler release. Additional documentation on using Intel CoArray Fortran is available here.

Submitting CoArray Fortran Parallel Programs Using PBS

Finally, two PBS scripts that will run the above CAF executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#PBS -q production
#PBS -N CAF_example
#PBS -l select=64:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -o int_PI.out
#PBS -e int_PI.err
#PBS -V

cd $PBS_O_WORKDIR

aprun -n 64 -N 16 ./int_PI.exe

Above, the dynamically compiled executable is run on 64 SALK, Cray XE6 cores (-n 64) with 16 cores packed to a physical node (-N 16). More detail is presented below on PBS job submission to the Cray and on the use of of the Cray's 'aprun' command. On the Cray, 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control and submission on the Cray (SALK) without understanding the 'aprun' command.

A PBS script for the example code compiled dynamically (or statically) for 16 processors with the Intel compiler (ifort) for execution on one of the HPC Center's more traditional HPC clusters looks like this:

#!/bin/bash
#PBS -q production
#PBS -N CAF_example
#PBS -l select=16:ncpus=1:mem=1920mb
#PBS -l place=scatter
#PBS -V

echo ""
echo -n "The primary compute node hostname is: "
hostname
echo ""
echo -n "The location of the PBS nodefile is: "
echo $PBS_NODEFILE
echo ""
echo "The contents of the PBS nodefile are: "
echo ""
cat  $PBS_NODEFILE
echo ""
NCNT=`uniq $PBS_NODEFILE | wc -l - | cut -d ' ' -f 1`
echo -n "The node count determined from the nodefile is: "
echo $NCNT
echo ""

# Change to working directory
cd $PBS_O_WORKDIR

echo "You are using the following 'mpiexec' and 'mpdboot' commannds: "
echo ""
type mpiexec
type mpdboot
echo ""

echo "Starting the Intel 'mpdboot' daemon on $NCNT nodes ... "
mpdboot -n $NCNT --verbose --file=$PBS_NODEFILE -r ssh
echo ""

mpdtrace
echo ""

echo "Starting an Intel CAF job requesting 16 cores ... "

./int_PI.exe

echo "CAF job finished ... "
echo ""

echo "Making sure all mpd daemons are killed ... "
mpdallexit
echo "PBS CAF script finished ... "
echo ""

Here, the PBS script requests 16 processors (CAF images). It simply names the executable itself to setup the Intel CAF runtime environment, engage the 16 processors, and initiate execution. This script is more elaborate because it include the procedure for setting up and breaking down the Intel MPI environment on the nodes that PBS has selected to run the job.

An Example Unified Parallel C (UPC) Code

The following simple example program includes the essential features of the Unified Parallel C (UPC) programming model, including shared (globally distributed) variable declaration and blocking, one- sided data transfer between UPC's memory-space distinct threads via simple assignment statements, and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the UPC; rather the goal is to give the reader a feel for basic UPC extensions to the C programming language. A tutorial on the UPC programming model can be found here [16], a user guide here [17], and a more formal description of the language specifications here [18]. Cray also has its own documentation on UPC [19]

// 
//  Computing PI by Numerical Integration in UPC
//

// Select memory consistency model (default).

#include<upc_relaxed.h> 

#include<math.h>
#include<stdio.h>

// Define integrand with a macro and set result accuracy

#define f(x) (1.0/(1.0+x*x))
#define N 4096

// Declare UPC shared scalar, shared vector array, and UPC lock variable.

shared float global_pi = 0.0;
shared [1] float local_pi[THREADS];
upc_lock_t *lock;

void main(void)
{
   int i;

   // Allocate a single, globally-shared UPC lock. This 
   // function is collective, intial state is unlocked.

   lock = upc_all_lock_alloc();

   // Each UPC thread initializes its local piece of the
   // shared array.

   local_pi[MYTHREAD] = 0.0;

   // Distribute work across threads using local part of shared
   // array 'local_pi' to compute PI partial sum on thread (processor)

   for(i = 0; i <  N; i++) {
       if(MYTHREAD == i%THREADS) local_pi[MYTHREAD] += (float) f((.5 + i)/(N));
   } 

   local_pi[MYTHREAD] *= (float) (4.0 / N);

   // Compile local, partial sums to single global sum.
   // Use locks to prevent read-before-write race conditions.

   upc_lock(lock);
   global_pi += local_pi[MYTHREAD];
   upc_unlock(lock);

   // Ensure all partial sums have been added with UPC barrier.

   upc_barrier;

   // UPC thread 0 prints the results and frees the lock.

   if(MYTHREAD==0) printf("PI = %f\n",global_pi);
   if(MYTHREAD==0) upc_lock_free(lock);

}

This sample code computes PI in parallel using a numerical integration scheme. Taking the key UPC-specific features present in this example in order, first we find the declaration of the memory consistency model to be used in this code. The default choice is relaxed which is selected explicitly here. The relaxed choice places the burden of ensuring that shared memory operations in the code that are dependent and must be order on the programmer through the use of barriers, fences, and locks. This code includes explicit locks and barriers to ensure memory operations are complete and processor have been synchronized.

Next, three declarations outside the main body of the application demonstrate the use of UPC's shared type. First, a scalar shared variable global_pi is declared. This variable can be read from and written to by any of the UPC threads (processors) allocated by the runtime environment to the application it is executed. It will hold the final result of the calculation of PI in this example. Shared scalar variables are singular and always reside in the shared memory of THREAD 0 in UPC.

Next, a shared one dimensional array local_pi with a block size of one (1) and a size of THREADS is declared. The THREADS macro is always set to the number of processors (UPC threads) requested by the job at runtime. All elements in this shared array are accessible by all THREADS allocated to the job. The block size of one means that array elements are distributed, one-per-thread, across the logically Partitioned Global Address Space (PGAS) of this parallel application. One is the default block size for shared arrays, but other sizes are possible.

Finally, a pointer to a special shared scalar variable to be used as a lock is declared. Because UPC defines both a shared and private memory spaces for each program image or THREAD, it must support four classes of pointers: private pointers to private, private pointers to shared, shared pointers to private, and shared pointers to shared. The pointer declared here is a shared pointer to shared which makes the lock's memory location available to all threads. In the body of the code, the lock's memory is allocated and placed in the unlocked state with the call to upc_all_lock_alloc().

Next, each thread initializes its piece of the shared array local_pi to zero with the help of the MYTHREAD macro, which contains the thread identifier of the particular thread that does the assignment. In this case, each UPC thread initializes only the part of the share array the is in its portion of shared PGAS memory. The standard C for-loop that follows divides the work of integration among the different UPC threads so that each thread works on its only local portion of the shared array local_pi. UPC provides a work-sharing loop construct upc_forall that accomplishes the same thing implicitly.

Processor-local (UPC thread) partial sums are then summed globally and in a memory consistent fashion with the help of the UPC lock function upc_lock() and upc_unlock(). Without the explicit locking code here, there would be nothing to prevent two UPC threads from reading the most current value in memory before it had been updated with a latest partial sum. This would produce an incorrect under-summing of the result. Next, a upc_barrier ensures all the summing is completed before the result is printed and the lock's memory is freed.

This example includes some of the more important UPC PGAS-parallel extensions to the C programming language, but a complete review of the UPC parallel extension to C is provide in the web documentation referenced above.

As suggested above, the CUNY HPC Center supports UPC on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where Berkeley UPC is installed and uses the GASNET library to support the PGAS memory abstraction on top of a number standard underlying cluster interconnects. At the HPC Center this would include Ethernet and/or InfiniBand depending on the CUNY HPC Center cluster system being used.

Here, the process of compiling a UPC program both for Cray's UPC on SALK, and for Berkeley UPC on the HPC Center's other systems is described. On the Cray, compiling a UPC program, such as the example above, simply requires adding an option to the Cray C compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: cc -h upc -o int_PI.exe int_PI.c
salk:
salk: ls
int_PI.exe
salk:

First, the Cray programming environment is loaded using the 'module' command; then the Cray compiler is invoked with the -h upc option to include the UPC elements of the compiler. The result is an executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (UPC threads) can be selected at run time using the -n ## option to 'aprun'. The required form the 'aprun' line is shown below in the section on UPC program PBS job submission.

To compile for a fixed number of processors (a static compile) or UPC threads use the -X ## option on the Cray, as follows:

salk:
salk: cc -X 32 -h upc -o int_PI_32.exe int_PI.c
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or UPC threads, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Berkeley UPC compiler driver 'upcc'.

andy:
andy: upcc  -o int_PI.exe int_PI.c
andy:
andy: ls
int_PI.exe
andy:

Similarly, the 'upcc' compiler driver from Berkeley allows for static compilations using its -T ## option:

andy:
andy: upcc -T 32  -o int_PI_32.exe int_PI.c
andy:
andy: ls
int_PI_32.exe
andy:

The Berkeley UPC compiler driver has a number of other useful options that are described in its 'man' page. In particular, the -network= option will target the executable for the GASNET communication conduit of the user's choosing on systems that have multiple interconnects (Ethernet and InfiniBand, for instance) or target the default version of MPI as the communication layer. Type 'man upcc' for details.

In general, users can expect better performance from Cray's UPC compiler on SALK, but having UPC on the HPC Center's traditional cluster architectures provides another location for development and supports the wider use of UPC and an alternative to MPI. In theory, well-written UPC code should perform as well as MPI on a standard cluster, while reducing the number of lines of code to achieve that performance. In practice, this is still not always the case; more development and hardware support is still needed to get the best performance from PGAS languages on commodity cluster environments.

Submitting UPC Parallel Programs Using PBS

Finally, two PBS scripts that will run the above UPC executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#PBS -q production
#PBS -N UPC_example
#PBS -l select=64:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -o int_PI.out
#PBS -e int_PI.err
#PBS -V

cd $PBS_O_WORKDIR

aprun -n 64 -N 16 ./int_PI2.exe

Here the dynamically compiled executable is run on 64 Cray XE6 cores (-n 64), 16 cores packed to a physical node (-N 16). More detail is presented below on PBS job submission on the Cray and the use of of the Cray's 'aprun' command. On the Cray 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control on the Cray (SALK) without understanding 'aprun'.

A similar PBS script for the example code compiled dynamically (or statically) for 32 processors with the Berkeley UPC compiler (upcc) for execution on one of the HPC Center's more traditional HPC cluster looks like this:

#!/bin/bash
#PBS -q production
#PBS -N UPC_example
#PBS -l select=32:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -o int_PI.out
#PBS -e int_PI.err
#PBS -V

cd $PBS_O_WORKDIR

upcrun -n 32 ./int_PI2.exe

Here, the PBS script requests 32 processors (UPC threads). It uses the 'upcrun' command to setup the Berkeley UPC runtime environment, engage the 32 processors, and initiate execution. Please type 'man upcrun' for details on the 'upcrun' command and its options.

Available Mathematical Libraries

FFTW Scientific Library

FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).

The library is described in detail at the FFTW home page at http://www.fftw.org. The CUNY HPC Center has installed FFTW versions 2.1.5 (older), 3.2.2 (default), and 3.3.0 (recent release) on ANDY. All versions were built in both 32-bit and 64-bit floating point formats using the latest Intel 12.0 release of their compilers. In addition, version's 2.1.5 and 3.3.0 support a MPI parallel version of the library. The default version at the CUNY HPC Center is version 3.2.2 (64-bit) located in /share/apps/fftw/default/*.

The reason for the extra versions is that over the course of FFTW's development some changes were made to the API for the MPI parallel library. Version 2.1.5 supports the older MPI-parallel API and the recently released version 3.3.0 supports a newer MPI-parallel API. NOTE: The default version does NOT include an MPI-parallel verstion, which skipped this version generation. A threads version of each library was also built.

Please refer to the on-line documentation at the FFTW website for details on using the library (whatever the version). With the calls properly included in your code you can link in the default at compile and link time with:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/default/lib -lfftw3 

(pgcc or gcc would be used in the same way)

For the non-default versions substitute the version directory for the string 'default' above. For example, for the new 3.3 release in 32-bit use:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/3.3_32bit/lib -lfftw3f

For an MPI-parallel, 64-bit version of 3.3 use:

mpicc -o my__mpi_fftw.exe my_mpi_fftw.c -L/share/apps/fftw/3.3_64bit/lib -lfftw3_mpi

The include files for each release are in the 'include' directory along side the version lib directory. The names of all available libraries for each release can be found by simply listing the contents of the appropriate version's lib directory. Do this for the names of the threads-version of each library for instance.

GNU Scientific Library

The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.

The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

Here is an example of code that uses GSL routines:

#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
 
int main(void)
{
  double x = 5.0;
  double y = gsl_sf_bessel_J0(x);
  printf("J0(%g) = %.18e\n", x, y);
  return 0;
}

The example program has to be linked to the GSL library upon compilation:

gcc $(/share/apps/gsl/default/bin/gsl-config --cflags) test.c $(/share/apps/gsl/default/bin/gsl-config --libs)

The output is shown below, and should be correct to double-precision accuracy:

J0(5) = -1.775967713143382642e-01

Complete GNU Scientific Library documentation may be found of official website of the project: http://www.gnu.org/software/gsl/

MKL

Documentation to be added.

IMSL

IMSL (International Mathematics and Statistics Library) is a commercial collection of software libraries of numerical analysis functionality that are implemented in the computer programming languages of C, Java, C#.NET, and Fortran by Visual Numerics.

C and Fortran implementations if IMSL are installed on Bob cluster under
/share/apps/imsl/cnl701 
and
/share/apps/imsl/fnl600
respectively.

Fortran Example

Here is an example of FORTRAN program that uses IMSL routines:

! Use files
 
       use rand_gen_int
       use show_int
 
!  Declarations
 
       real (kind(1.e0)), parameter:: zero=0.e0
       real (kind(1.e0)) x(5)
       type (s_options) :: iopti(2)=s_options(0,zero)
       character VERSION*48, LICENSE*48, VERML*48
       external VERML
 
!  Start the random number generator with a known seed.
       iopti(1) = s_options(s_rand_gen_generator_seed,zero)
       iopti(2) = s_options(123,zero)
       call rand_gen(x, iopt=iopti)
 
!     Verify the version of the library we are running
!     by retrieving the version number via verml().
!     Verify correct installation of the license number
!     by retrieving the customer number via verml().
!
      VERSION = VERML(1)
      LICENSE = VERML(4)
      WRITE(*,*) 'Library version:  ', VERSION
      WRITE(*,*) 'Customer number:  ', LICENSE

!  Get the random numbers
       call rand_gen(x)
 
!  Output the random numbers
       call show(x,text='                              X')

! Generate error
       iopti(1) = s_options(15,zero)
       call rand_gen(x, iopt=iopti)
 
       end

To compile this example use

 . /share/apps/imsl/imsl/fnl600/rdhin111e64/bin/fnlsetup.sh

ifort -openmp -fp-model precise -I/share/apps/imsl/imsl/fnl600/rdhin111e64/include -o imslmp imslmp.f90 -L/share/apps/imsl/imsl/fnl600/rdhin111e64/lib -Bdynamic -limsl -limslsuperlu -limslscalar -limslblas -limslmpistub -limf -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/fnl600/rdhin111e64/lib


To run it in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

 Library version:  IMSL Fortran Numerical Library, Version 6.0     
 Customer number:  702815                                          
                               X
     1 -    5   9.320E-01  7.865E-01  5.004E-01  5.535E-01  9.672E-01

 *** TERMINAL ERROR 526 from s_error_post.  s_/rand_gen/ derived type option
 ***          array 'iopt' has undefined option (15) at entry (1).

C Example

More complicated example in C.

#include <stdio.h>
#include <imsl.h>

int main(void)
{
    int         n = 3;
    float       *x;
    static float        a[] = { 1.0, 3.0, 3.0,
                                1.0, 3.0, 4.0,
                                1.0, 4.0, 3.0 };
    static float        b[] = { 1.0, 4.0, -1.0 };

    /*
     * Verify the version of the library we are running by
     * retrieving the version number via imsl_version().
     * Verify correct installation of the error message file
     * by retrieving the customer number via imsl_version().
     */
    char        *library_version = imsl_version(IMSL_LIBRARY_VERSION);
    char        *customer_number = imsl_version(IMSL_LICENSE_NUMBER);

    printf("Library version:  %s\n", library_version);
    printf("Customer number:  %s\n", customer_number);

                                /* Solve Ax = b for x */
    x = imsl_f_lin_sol_gen(n, a, b, 0);
                                /* Print x */
    imsl_f_write_matrix("Solution, x of Ax = b", 1, n, x, 0);
                               /* Generate Error to access error 
                                  message file */
    n =-10;

    printf ("\nThe next call will generate an error \n");
    x = imsl_f_lin_sol_gen(n, a, b, 0);
}

To compile this example use

. /share/apps/imsl/imsl/cnl701/rdhsg111e64/bin/cnlsetup.sh

icc -ansi -I/share/apps/imsl/imsl/cnl701/rdhsg111e64/include -o cmath cmath.c -L/share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -L/share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t -limslcmath -limslcstat -limsllapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -lgfortran -i_dynamic -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -Xlinker -rpath -Xlinker /share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t

To run the binary in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

Library version:  IMSL C/Math/Library Version 7.0.1
Customer number:  702815
 
       Solution, x of Ax = b
         1           2           3
        -2          -2           3

The next call will generate an error 

*** TERMINAL Error from imsl_f_lin_sol_gen.  The order of the matrix must be
***          positive while "n" = -10 is given.

Training Courses

The CUNY HPCC provides training course and organizes seminars on various HPC topics. The training courses are provided at no cost and may be held at any CUNY campus site, the CUNY HPCC at the College of Staten Island, or the Graduate Center.

For more information on attending a course, please visit http://www.csi.cuny.edu/cunyhpc/events.html
For information about having a course scheduled, please send an email to hpchelp@mail.csi.cuny.edu

UPCOMING WORKSHOPS:
High Performance Computing on CUNY’s Cray XE6 at the College of Staten Island - 11 March 2011 - 9AM-5PM
High Performance Computing on CUNY’s Cray XE6 at Baruch College - 16-17 Feb 2011 - CLOSED FOR REGISTRATION

To register for a seat, fill out an application at http://www.csi.cuny.edu/cunyhpc/events.html


The curriculum for a typical 2 1/2 day course in parallel programming using the Message Passing Interface Library (MPI) is provided below. The course is typically given as a workshop with hands-on exercises. It is expect that attendees know UNIX (or one of its variants) and either C or FORTRAN.


DAY 1 (Half day; 1:00 PM to 5:00 PM)

    Overview of computer architectures
        Distribution of class materials
        Serial computers
        Vector processors
        Symmetric Multi-processors
        Parallel computers
            Single Instruction Multiple Data
            Multiple Instruction Multiple Data
        Heterogeneous computing with general purpose graphical processing units

    The City University of New York High Performance Computing Initiative
        Why HPC?
        Installed systems
        Future plans

    Getting familiar with the systems
        Account set-up
        Logging on
        Running a sample job

DAY 2  (Full day; 9:00 AM to 5:00 PM)

        Introduction to MPI
	  MPI point-to-point communications
          Collectives
          Blocking sends and receives
          Non-blocking sends and receives
          Testing for completion
        Hands on exercises

DAY 3  (Full day; 9:00 AM to 5:00 PM)

        MPI collectives
	     Gather/scatter
           All-to-all
           Performance notes
        OpenMP
           What is OpenMP
           Compiler Directives
           Conditional Compilation
           Environmental Variables
           OpenMP Performance
        Parallel Programming Futures
	  Hands on exercises

User Accounts

Applying for a HPCC Account

Only CUNY faculty, research staff, their collaborators at other universities and their public and private sector partners, and currently enrolled CUNY students (who MUST have a faculty sponsor) are allowed to use the CUNY HPCC systems. Applications for accounts are accepted at any time, but accounts expire on 30 September and must be renewed before then.

A CUNY HPCC account is required to log into the HPCC systems. Faculty, staff or students at CUNY may apply for a HPCC account by following this link: (http://www.csi.cuny.edu/cunyhpc/Accounts.html).

Please be sure to complete all parts of the application including information on publications, funded projects, and resources required. With regard to the latter, please indicated the number of processor hours the are required for the academic year. For example, if you expect that you will submit 30 jobs per week, each using 16 processors, and each running, for 2 hours, then you requirement is for 49,920 processor hours (30 jobs * 52 weeks *16 processors * 2 hours).

By applying for and obtaining an account, the user agrees to comply with the CUNY Acceptable Use Policy, the HPCC User Account and Password Policy, and to include a Citation regarding use of the CUNY HPC resources.

Acceptable Use Policy

Use of the computing resources at the HPCC is governed by the CUNY Acceptable Use Policy (AUP). The AUP is documented at

http://portal.cuny.edu/cms/id/cuny/documents/level_3_page/001171.htm and http://www.csi.cuny.edu/privacy/index.html

Citations

Users of the CUNY HPC systems must include the following citation on any publication or presentation that includes results or is based on work using CUNY HPC resources:

"This research was supported, in part, by a grant of computer time from the City University of New York 
 High Performance Computing Center under NSF Grants CNS-0855217 and CNS - 0958379."

Renewal applications should include a list of publications or presentation that resulted from the use of the CUNY HPC resources as future grants of time will be based, in part, on past research accomplishments.

Users are request to sent a copy of the publication or presentation to the Center either electronically (hpchelp@mail.csi.cuny.edu) or by mail to CUNY HPC, Building 1M-206, College of Staten Island, 2800 Victory Boulevard, Staten Island, NY 10314.

User Account and Password Policy

A user account is issued to an individual user. Accounts are not to be shared.

By default all users have access to NEPTUNE, ATHENA, BOB, ZEUS and ANDY. Access to SALK and KARLE are granted by request only. The default disk storage for a general account is 50GB on each system.

Users are responsible for protecting their passwords. Passwords are not to be shared.

When an account is opened, the user will receive a one use only password sent by mail to his university mailing address. The user, upon receiving the one use password should log onto the HPCC systems and change the password. If the password is not changed within 30 days of issuance, it will be expired.

The new password must conform to the CUNY password policy, which requires that it be at least eight (8) characters long, include at least one capitalized letter, one numerical character, and one of the following special characters:

 ! @ # $ % & * = + ) ( 

Passwords are good for 92 days. You will receive a notice two weeks before the end of the 92 day period, requesting that you change your passwords. If you do not change your passwords, your accounts will be locked and the password will need to be reset.

How to change password

The command to change a password is "passwd". An example of its use follows:

[user.name@athena ~]$ passwd
Changing password for user user.name.
Changing password for user.name
(current) UNIX password: old_password
New UNIX password: new_password
Retype new UNIX password: new_password
passwd: all authentication tokens updated successfully.
[user.name@athena ~]$ 

Groups

All users belong to a group.
To locate your group(s), use the following command:

groups

To share files within a group. 1. Set group ownership to the file

chgrp groupname filename

2. Set the file permissions; to read, write

chmod g+r filename
chmod g+w filename

Logging in to HPCC

Notice: Users may not access CUNY computer resources without authorization or use it for purposes beyond the scope of authorization. This includes attempting to circumvent CUNY computer resource system protection facilities by hacking, cracking or similar activities, accessing or using another person's computer account, and allowing another person to access or use the user's account. CUNY computer resources may not be used to gain unauthorized access to another computer system within or outside of CUNY. Users are responsible for all actions performed from their computer account that they permitted or failed to prevent by taking ordinary security precautions.

For security reasons, CUNY only allows users to communicate using SSH. Secure Shell (abbreviated SSH) is a secure means of connecting to a remote server over an encrypted channel. SSH is a protocol designed to allow logging into a remote machine and executing commands on a remote machine using improved secure encrypted communication between two non-trusted hosts over an insecure network, while other protocols like Telnet cannot.

The HPC systems located at the CUNY HPCC accept IP addresses only from the CSI campus. Users not located on the CSI campus must first log into an authentication server. The authentication server for the HPCC is neptune.csi.cuny.edu. To log into the HPC systems, the user must then ssh from neptune.csi.cuny.edu to the desired HPC system.

Logging from windows machine

If you are using Windows machine locally you need to have SSH client installed on it. While other SSH clients may exist, CUNY strongly recommends the use of WinSCP or PuTTY. Once you have SSH client installed run it and connect to HPCC. Using the above links you may find documentations on this applications. Another option is installing Cygwin.

Login from Unix

On Unix/Linux machines, the user should use ssh to log in to the HPCC systems. Under most Linux and Unix, ssh command is located in /usr/bin. Please refer to corresponding manpage.

Command

 
$ ssh user.name@neptune.csi.cuny.edu 

will log you onto authentication server. Once you are logged therу you are ready to go to one of the HPCC system (athena in this example)

[username@neptune ~]$ ssh athena


username@athena's password: YouR_password**HeRE
Last login: Mon Oct 20 13:04:23 2008 from neptune.csi.cuny.edu
Rocks 5.0 (V)
Profile built 19:20 30-Sep-2008
Kickstarted 16:04 30-Sep-2008
[username@athena ~]$

When connecting to any of our hosts for the first time you are asked to validate the authenticity of the key presented by that host. Once you answer yes, that key will be stored. Future logging attempts to that same server will check they key against what is stored in the file:

~/.ssh/known_hosts

In a very rare cases when our team performs reinstallation of the OS host identification will change. If we perform this sort of maintenance all users get a notice from HPC team.

When this happens you get message similar to this:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
5c:0b:18:56:b6:cd:12:10:32:cd:1d:a2:9a:cd:e5:1c.
Please contact your system administrator.
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
Offending key in /home/user/.ssh/known_hosts:3
RSA host key for neptune.csi.cuny.edu has changed and you have requested strict checking.
Host key verification failed. 

This reads as "remote host information that is kept in your /.ssh/known_hosts at line #3 does not match that remote host, therefore ssh connection can not be established".

To get rid of this message you need to modify you ~/.ssh/known_hosts. Open it in your favorite text editor and delete line #3. After that try to ssh again. You will be asked if you want to save host identification (which is -- if you want to add host identification to your ~/.ssh/known_hosts). Answer "yes" and proceed normally.

However if you get the above ssh warning without our maintenance notice it will be a good ideat to contact HPC stuff.

X11 Forwarding or Tunneling

X11 forwarding is required when logging in to a remote location, but an application GUI must be display locally. This could be done with Mathematica for instance, if using the command-line interface was not acceptable.

UNIX clients

X11 forwarding or tunneling back through the 'ssh' connection can be provided by including the flag '-X' to 'ssh' command. For users off the CSI campus, the following forwards X11 traffic back from the HPCC gateway system, NEPTUNE, to your desktop:

 
$ ssh -X username@neptune.csi.cuny.edu 

If you need to then login to ATHENA to run Mathematica you will have forward X11 traffic again through the second connection with:

 
$ ssh -X username@athena.csi.cuny.edu 

Note that double-forwarding will be significantly slower and may make working with a GUI from outside of CSI campus inconvenient.

WINDOWS clients

In order to allow X11 forwarding for Windows-based client CUNY recommends to install Xming -- X Server for Windows. Once Xming is downloaded and installed users should

  • start the server
  • connect to remote machine using PuTTY with X11 forwarding enabled:

Configputty.JPG

  • once the connection is established start your X application. For example, type in console
    xterm
    This will give you xterm session.

Transfer files between HPCC systems and your PC

Sometimes user may edit files on Windows PC first before uploading them to HPCC. For Windows users the easiest way is using WinSCP. Please note that SCP protocol should be selected instead of other protocol.

For the first time use, WinSCP may give warning that the server host key was not found in the cache. User could accept the server host key. User could upload and download files to/from HPCC using drag & drop.

Text files (job files, scripts) prepared in Windows may contain non-visible symbols (end-of-line symbols for example) that are not understood by HPCC system. To avoid errors related to this use dos2unix command:

dos2uinx text_file_prepared_in_windows.txt

GNU/Linux users or MacOS users may use scp to copy files from localhost to HPCC systems. Please refer to scp manpage for details.

Basic Unix/Linux Commands

UNIX Tutorial

If you are unfamiliar with UNIX or Linux, an excellent online UNIX tutorial can be found in the "User's Guide to UNIX" from the Department of Electronic Engineering, University of Surrey, United Kingdom [20]. Although that link is to a UNIX tutorial, the commands, at the user level, are essentially identical to those of Linux.

vi Usage

While other text editors exist, vi may be the most powerful text editor under UNIX/Linux. There are two modes in vi: input and control mode. Actually vi has three modes -- editing, command, and last-line mode. We are not going to confuse the readers with information that is hardly ever used. Here we only present them in two main categories to make it clear for the beginner.

When we start vi, we are in control mode. When we add or change text, we need to shift to input mode. Pressing the ESC (Escape) key at any time, we will return to the control mode.

For the next several sections, we will begin with the basic knowledge of vi. Until section 6.5.4 we will only cover the control mode. In section 6.5.4 we discuss how to input and edit text in vi. For more information for vi, the user could reference vi man page or other resource.

Starting vi

To create a new file or edit an existing file, type "vi" followed by the filename at the shell prompt:

$> vi ''filename'' 

In the vi control mode, type

To save current file:
ESC : w 
Or to save the current file and exit from vi.
ESC : wq 

Moving the cursor

The arrow keys work in vi, but not all terminals support them. The movement keys could be used to move the cursor around. The "h" key and "l" key move the cursor left and right; the "j" and "k" move down and up.

Here is the illustration of the cursor keys on the keyboard,

k up
h left l right
j down


left down up down
h j k l

Delete, Undo

The x command deletes the current character.
The dw command deletes the current word.
The dd delete the current line.
The undo command u restores the text deleted or changed by mistake. The undo command can only restore the most recently deleted or changed text.

Input/Editing

In a vi session, user must shift to the Input Mode before entering text. Press the ESC and i to invoke the Input Mode

tar and gzip/bzip2

tar

On both UNIX and Linux, tar may be the most common used archive tool. The synopsis of tar is:

tar [option] [file...]

The most import options in tar are, –c, –x, –v, –f and –z. The –c option is used to create archive and -x to extract an archive. –v will allow tar to print important information during the archive/extract process. –z is a new feature in tar which means compress/uncompress the archive. For example, we are planning to archive and compress the text file “water” to water.tar.gz

tar cvf water.tar.gz water

To extract the text file “water” from the archive file

tar xvf water.tar.gz

gzip

There are several compression tools under Unix/Linux, such as compress. While gzip (GNU zip) is a compression utility designed to be a replacement for compress, its main advantage over compress is better compression. It has been adopted by the GNU project and is now relatively popular.

The synopsis for gzip is:

gzip [option] [file …]

The options include, -d (decompress), -l (list compress file content) and –v (verbose). There is no compress option for gzip. The default option in gzip is compress.

bzip2

Similar to gzip, bzip2 is a newer algorithm for compressing data. bzip2 is a freely available, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques, while being around twice as fast at compression and six times faster at decompression. Bzip2 is available at:

http://www.bzip.org/

The synopsis for bzip is:

bzip [option] [file …] 

The generally used options are –z (compress), -d (decompress), -v (verbos)


PBS Pro 11.0, Job Submission, and Queues

As the number and management needs of its systems has grown, CUNY's HPC Center decided to move to a more fully-featured, commercially supported job queueing and scheduling system (workload manager). The HPC Center has selected PBS Pro as a replacement for SGE and has fully transitioned from SGE to PBS Pro on all of its resources. PBS Pro offers numerous features that improve the full, fair, and effective usage of CUNY's HPC Resources. This includes several distinct approaches to resource allocation and scheduling (priority-formula-based, fair-share, etc.), interfaces to control application license use, multiple methods of partitioning systems and scheduling jobs between them, and a full-featured usage analysis package (PBS Analytics), among other things. As of min-January 2011, all CUNY HPC Center multi-node, parallel systems are running the PBS Pro batch scheduling system version 11.0 or higher.

PBS Pro Design and the Cluster Paradigm

PBS Pro places 3 distinct service daemons onto the classic head-node, compute-node(s) cluster architecture. These are the queuing service daemon (known as the Server in PBS Pro), the job scheduling daemon (known as the Scheduler), and the compute or execution host daemon (know as the MOM). The Server and Scheduler typically run on the cluster's head or login node. They receive and distribute jobs submitted by users via the interconnect to the compute nodes in a resource-intelligent fashion. The Server and Scheduler do not run on the compute nodes, only the MOM does. A MOM daemon runs on each of the compute nodes or 'execution hosts' as PBS Pro refers to them. There the MOM accepts, monitors, and manages the work delivered to it by the scheduler. While possible, a MOM is not typically run on the cluster login node because the head node is not usually tasked for production batch work. A diagram of this basic arrangement is presented here.

PBS Daemons.jpg

A SPECIAL NOTE ABOUT THE CRAY (SALK):

A significant modification of this arrangement obtains on our Cray XE6m system, SALK. On SALK, the PBS Server and Scheduler run on one of several special service nodes (head nodes) referred to as the System DataBase node or the SDB. Users cannot login to this node. On the Cray, the node onto which users login (salk.csi.cuny.edu) runs ONLY the PBS Pro MOM. In the Cray's case, PBS Pro views the login node as a single, very large virtual compute node with 1280 cores (or 320, 4-core, virtual nodes called numa-nodes). The single PBS Pro MOM on the login node starts and tracks all of the work scheduled on this large set of virtual nodes through Cray's Application Level Placement Scheduler (ALPS). It is ALPS that is fully aware of the Cray compute resources and its interconnect, and it is the ALPS daemon that is responsible for the physical placement of each PBS job submitted by the users onto the Cray's compute nodes. The Cray 'aprun' command functions as an intermediate between the PBS 'qsub' command and the resources it requests, and the ALPS deamon. The resources requested via 'aprun' can never be greater that those reserved by PBS through 'qsub' and the PBS script. More detail will be provided on Cray-specific PBS differences later.

PBS Pro Job Submission Modes

The PBS Pro workload management system is designed to serve as a gateway to compute node resources of each CUNY system. All jobs (both interactive and batch) submitted through PBS Pro are tracked and placed on the system in a way that efficiently utilizes the resources while keeping potentially competing jobs out of each other's way. The assumption that PBS Pro makes optimal decisions about job placement depends on the idea that there is no 'back-door' production work submitted to the cluster's compute nodes without PBS Pro's knowledge. When operating as designed, this results in better overall throughput for the job mix and better individual job performance. As such on CUNY's HPC systems, all application runs (whether interactive or batch, development or production) should be submitted through PBS Pro (again, SALK is a minor exception where two nodes [32 cores] are provided for PBS-independent interactive job execution via 'aprun'). Furthermore, no jobs should be run outside of PBS on CUNY system head nodes. This leaves only code compilation and basic serial testing for the head-node. The CUNY HPC Center staff has designed its PBS queue structure to accomnodate interactive, development, production, and other classes of work. Jobs submitted to the compute nodes through other means (or head nodes) will be killed. Login sessions on compute nodes that do not have PBS scheduled work from the user will be terminated.

Running Batch Jobs with PBS Pro

Two steps should be completed to submit a typical batch job for execution under PBS Pro. A user must create a job submission script that includes the sequence of commands (serial or parallel) that are to be run by the job (this step is not required for an interactive PBS Pro job). The user must also specify the resources (cores, memory, etc.) required by the job. This may be done within the script itself using PBS-specific comment lines, or may be provided as options on PBS's job submission command line, 'qsub'. These command-line options (or submit script #PBS comment-lines) typically include information on the number of cores (cpus) required, the estimated memory and CPU time required, the name of the job, and the queue into which the job is to be submitted, among other things. The submit script is submitted for execution to the PBS Server daemon through the PBS Pro 'qsub' command (e.g. 'qsub job.script'). Jobs targeted for the compute nodes that are not submitted via 'qsub' will be killed.

The submit script can contain numerous options. These are described in detail in the PBS Pro 11.1 User Guide here, or on-line with 'man qsub'. All options within the submit script to be interpreted by 'qsub' should be placed at the beginning of the script file and must be preceded by the special comment string '#PBS'. Options offered on the 'qsub' command-line override script-defined options. Some of the most important PBS Pro options are presented here:

The option to specify the name that will be given to the job (limited to 15 characters):

#PBS -N job_name

The option to specify the queue that the job will be placed in:

#PBS -q queue_name

A detailed description of the available queues is provided here.

The flag to specify the number and kind of resource chunks required by the job:

#PBS -l select=#:[resource chunk definition]

More detail on this very important option is provided in examples below.

The flag to determine how the job's resource chunks are to be distributed (placed) on the compute nodes:

#PBS -l place=[process placement instructions]

The flag to limit and indicate to PBS what a job's total cpu time requirement will be (useful for short jobs):

#PBS -l cput=HH:MM:SS

The flag to pass the head node environment variables to each compute node process:

#PBS -V 

A SPECIAL NOTE ABOUT THE CRAY (SALK):

The Cray includes an alternative and deprecated, but still functioning set of options for specifying the resources required by the job (the so-called 'mpp' resource options). These will not be covered here in the CUNY HPC Wiki, but can be read about on SALK in the pbs_resources manual pages ('man pbs_resources').

More detailed information on PBS Pro 'qsub' options is available from 'man qsub' on all CUNY HPC Center systems and is available in the PBS Pro 11.1 User Guide here.

Submitting Serial (Scalar) Jobs

Serial (scalar) jobs (as opposed to multiprocessor jobs) use only one processor. For example, executing the simple UNIX command 'date' requires only one processor and simply returns the current date and time. While 'date' and most other UNIX commands would not typically be run by themselves in a batch job, one or more longer-running serial HPC applications are often run this way to avoid tying up a local workstation or as part of a parametric study of some large HPC problem space. Preparing and submitting a serial (scalar) job for batch execution requires many of the same steps that are required to submit a more complicated parallel HPC job, and therefore serial jobs serve as a good basic introduction to batch job submission in PBS Pro.

The following steps would typically be required for serial job submission using PBS Pro:

1. Create a new working directory (named serial for example ) in your home directory and move to it by executing the commands:

bob$ mkdir serial
bob$ cd serial

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a submit script file (named serial_job.sh for example) and insert the following lines in it:

#!/bin/bash
#
# My serial PBS test job.
#
#PBS -q production
#PBS -N serial_job
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

# Find out name of compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

date

Working through the script line-by-line. The first line selects the shell that will be used to interpret the lines in the script. Everything after just a # is treated as a regular comment. Everything after a #PBS is an option to be interpreted by the PBS Pro 'qsub' command. The -l select=1:ncpus=1 option (line) above needs some explanation. In PBS Pro, the -l select option specifies the number and kind of resource units to be associated with the job. PBS Pro refers to these resource units as chunks. The number of chunks defined by the '-l select=integer'. Here, 1 chunk has been requested; defined by '-l select=1'. To ask for 2 chunks one would write '-l select=2', and so on.

The particular resources contained in a PBS chunk are specified by the colon-separated list that follows. In this case, only one, 'npcus' is specified. This is the number of cores (cpus) in this script's chunk. To compute the total quantity of any particular resource requested by the job script, you must multiply the number of chunks by the number of the particular resource specified. Here, to compute the total number of cores to be reserved by this job, you multiply the number of chunks (1 in this case) by the number of cores in each chunk (also 1 , defined by ncpus=1), or 1 chunk x 1 ncpus = 1. As this is a serial job, this is precisely what is needed. The date command is not a parallel application and cannot take advantage of multiple processors anyway. In this case, the contents of the resource chunk include 1 core (cpu) explicitly defined, but also a set of default resources defined by the PBS administrator because they were left undefined by the user. Other resources like memory, processor time, applications licenses, or disk space can also be explicitly requested in a chunk. Here, when resources are not requested explicitly, the job is given the local site's default setting for the unrequested resource. Resource defaults are inherited from those defined for the execution queue that the job is finally placed in, or from the global Server settings. More involved examples of the -l select option are given below.

When defining resource chunks several things should be kept in mind. No resource "chunk" should be defined that exceeds any of the component resources (cores, memory, disk, etc.) available on any single, physical compute node on the system to which the job is being submitted. This is because PBS resource chunks are 'atomic', and therefore each must be allocated on the system's compute nodes as a whole. If there are no physical nodes that have the resources requested in a PBS chunk, then PBS will find it impossible to run the job. That job will remain queued forever (the 'Q' state) without any error message to the user. This is one of the most common PBS job submission errors. The number and kind of chunk(s) defined by the user (colon separated resource list) in by the '-l select' statement determine what resources PBS Pro allocates to the job, in combination with any PBS defaults.

Moving on to subsequent lines, the -l place=pack option is not strictly required for this serial job, but is included for illustration. It requests that ALL resource chunks (not just a single chunk which is atomic) specified in the '-l select=integer' line be allocated on a single physical compute node. In this case, because we are asking for only 1 chunk with 1 core and have not specified other resources in our chunk, it will be easy to fulfill this placement request, but if the -l select option had asked for more resources in total than were available on all individual compute node in the cluster, the job would never run because it would be making a resource request impossible to fulfill. Here, again it would be queued and never run. There would be no way to pack the -l select resource request on a single node.

Unwittingly making either type of impossible-to-fulfill request (packing too many resource chunks, or defining chunks that by themselves are too larger) is a common mistake in PBS Pro submit scripts created by beginners. Jobs can also be delayed because the requested resources are temporarily unavailable due to other work on the system. Both possibilities may produce the same "No Resources Available" message at the end of the 'qsub -f JID' output, confusing the user. In the next example, showing a submit script for a symmetric multiprocessing (SMP) parallel job, the issue of proper resource chunk placement comes up again.

The final PBS option in the script, '-V', instructs PBS to transfer the user current local environment to the compute nodes where the job will be run. This is important in that if the Unix paths to commands and libraries were different on the compute nodes the script or executables that link in libraries dynamically might fail. This is another error common to new users new of PBS. Lastly, in the body of the script, the directory is changed to the current working directory (the directory from which the 'qsub' command was issued) with 'cd $PBS_O_WORDIR'. In this case, because the 'date' command will be found from where ever we execute it, this is not required, but in general when an executable and its input files are referenced without full paths in the PBS script the user must ensure the PBS batch session is started in the correct directory. By default the user is placed only in there home directory on the compute nodes just as if they had logged in.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

On the Cray all jobs queued to the production queue must request at least 16 cores on the '-l select' line to be run. The Cray is intended for those running jobs that can scale to larger sizes (at least 64 cores). To run this simple, serial PBS script on SALK one would have to submit it to the development queue by replacing:

#PBS -q production

with:

#PBS -q development

3. Submit the job to PBS Pro by entering the command 'qsub serial_job.sh'. If your submit file is correctly constructed, PBS Pro will respond by reporting the job's request ID (59 in this case) followed by the host name of the system that submitted the job:

$qsub serial_job.sh
59.athena.csi.cuny.edu

A SPECIAL NOTE ABOUT THE CRAY (SALK):

You may find that the PBS Pro commands (qsub, qstat, qdel, etc.) and environment are not active by default on your Cray account. This can be remedied by using the Cray 'module' command, which is used to control environmental variable setting such at the $PATH setting on the Cray. You can load the PBS module with:

module load pbs

4. You can check the status of your submitted job with the command 'qstat', which by itself lists all jobs that PBS is managing whether in the queued (Q), running (R), hold (H), suspended (S), or exciting (E) states. To get a full listing for a particular job you can type 'qstat -f JID'. For more detail on the PBS Pro version of 'qstat' please consult the man page with 'man qstat'. The job request ID "59" above is a unique numerical ID assigned by the PBS Pro Server to your job. JID numbers are assigned in ascending order as each user's job gets submitted to PBS. The output from the 'qstat' job monitoring command always lists jobs in job request ID order. The oldest jobs on the system are always at the top of the 'qstat' output; those most recently submitted are at the bottom.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

Because on the Cray PBS must work through the Cray scheduler ALPS, the 'qstat' command does not provide as much information as it does on the HPC Center's other systems. For instance, the 'Time Used' column in the Cray's 'qstat' output will typically show no time accumulated or very little. This because PBS can track only the time used by the process that submits your job to ALPS, not the time used by the job itself. Fortunately, Cray provides its own command for obtaining such details, 'apstat':

salk$
salk$ qstat

PBS Pro Server salk.csi.cuny.edu at CUNY CSI HPC Center
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
6519.sdb          3D_P30           kconnington       00:00:03 R qlong512        
6831.sdb          Par100           kconnington       00:00:00 R qlong128        
6852.sdb          rat_run64        haralick          00:00:00 R qlong128        
6853.sdb          case6_128        poje              00:00:00 R qlong16         
6854.sdb          case7_128        poje              00:00:00 R qlong16 
        
salk$
salk$
salk$ apstat

Compute node summary
    arch config     up    use   held  avail   down
      XT     80     79     43      3     33      1

No pending applications are present

Total placed applications: 5
Placed  Apid ResId     User   PEs Nodes    Age   State Command
       18255   246 kconning   512    32 161h32m  run   a.out
       18854   589 kconning   128     8   2h01m  run   a.out
       18889   609 haralick    16     1   0h04m  run   mpi_decomposi
       18891   610     poje    16     1   0h04m  run   pvs_128
       18893   611     poje    16     1   0h04m  run   pvs_128

Please consult the 'apstat' man page ('man apstat') for details.


5. Once the job is finished you will see the job's output (Unix std.out) in the file 'serial_job.o59' which is your job name followed by the job ID number. Errors will be written to 'serial_job.e59' (Unix std.err), if there are problems with your job. If for some reason these files cannot be written your account will receive two email messages with their contents included.

Looking at the output of our submitted serial job with the date command in it:

athena$ cat serial_job.o59 
Wed Mar 11 17:15:59 EDT 2011

The output from the 'date' command executed on one of the compute nodes is written there.

Submitting OpenMP Symmetric Multiprocessing (SMP) Parallel Jobs

Symmetric multiprocessing (SMP) requires a multiprocessor (and/or multicore) computer with a unified (physically integrated) memory architecture. SMP programs use two or more processors (cores) to complete parallel work within a single program image within a unified memory space. SMP systems were among the earliest types of multi- processor HPC architectures (pre-dating the current multicore chips by decades), but the SMP architecture supports a limited number of processors compared to a distributed memory system like an HPC cluster. Yet, CUNY's HPC cluster system compute nodes are each themselves small SMP systems with 4 or 8 processors (cores) that can work together on a program within their node's unified memory space. For example, on ATHENA, each compute node has 4 cores (2 sockets with 2 cores each) that share 8 Gbytes of memory. Each of the compute nodes on the SGI system, ANDY, have 8 Nehalem cores (2 sockets with 4 cores each) that share 24 Gbytes of memory. Each node on the newly installed Cray XE6m system, SALK, has 16 AMD MagnyCour cores (2 sockets with 8 cores each) that share 32 Gbytes of memory. The current trend in microprocessor development away from faster clocks and toward higher on-chip core counts means that the compute nodes of next-generation HPC clusters are likely to have even higher core counts available for SMP parallel operations.

While the core count of SMP systems limits their parallel and peak performance, their integrated memory architecture makes programming them in parallel much simpler. OpenMP (not to be confused with OpenMPI) is a compiler-directive based SMP parallel programming model that is commonly used on SMP systems, and it is supported by CUNY's HPC Center compilers. OpenMP is relatively easy to learn compared to the Message Passing Interface (MPI) parallel programming model, which was designed to work even on distributed memory systems like CUNY's HPC clusters and Cray. Still, some HPC applications, both commercial and researcher developed, are still serial in design. As a first step, they can be re-written in parallel to use the unified memory space of an SMP system or a single cluster compute node.

In an earlier section, an OpenMP parallel version of the standard "Hello World" program was presented. It is a simple matter to incrementally modify the serial PBS Pro submit script presented above to run "Hello World" in SMP-parallel mode on 4 processors (cores) within a single CUNY HPC cluster compute node. The primary differences relate to reserving multiple processors (cores) and then ensuring that they are placed (packed) within a single compute node where OpenMP must function. Here is an example PBS script for an SMP-parallel job, smpjob.sh.

The following steps would typically be required for SMP job submission using PBS Pro:

1. Create a new working directory (named smp for example ) in your home directory, copy your program into it, and compile it by by executing the following commands:

athena$ mkdir smp
athena$ cp ./hello_omp.c smp
athena$ cd smp
athena$ icc -openmp -o hello_omp.exe hello_omp.c

Note: On the Cray (SALK) using Cray's compilers you would use 'cc' to compile. Also, the Cray compilers interpret OpenMP directives by default (i.e. the -h omp flag is on by default).

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a submit script file (named smpjob.sh for example) and insert the following lines in it:

#!/bin/bash
#
# My script to run a 4-processor SMP parallel test job
#
#PBS -q production
#PBS -N parallel_hello
#PBS -l select=1:ncpus=4:mem=7680mb
#PBS -l place=pack
#PBS -V

# Find out name of compute node
hostname

# Find out OMP thread count
echo -n "OMP thread count is: "
echo $OMP_NUM_THREADS

# Change to working directory
cd $PBS_O_WORDIR

./hello_omp.exe

Most of the options in this SMP submit script are the same as those in the serial job script presented above.

Select the queue into which to place the job:

#PBS -q production

Specify a name for the job:

#PBS -N parallel_hello

Define the number and kind of resource chunks needed by the job:

#PBS -l select=1:ncpus=4:mem=7680mb

Describe how the processes in the job should be distributed (or not) across the compute nodes:

#PBS-l place=pack

Export local environment variables to the compute nodes from the submission (head) node:

#PBS -V 

Check to see what PBS has set the OpenMP thread count too:

echo -n "OMP thread count is: "
echo $OMP_NUM_THREADS

Change to the job's working directory (PBS Pro 'qsub' does not have a 'cwd' option):

cd $PBS_O_WORKDIR

Run the SMP-parallel "Hello World" job on 4 processors (cores):

./hello_omp.exe

In the '#PBS' options header section, the primary change is with the '-l select' option. Here, a single resource chunk is still being requested ('-l select=1'), but it is larger than before and is now defined to include 4 processors (cores) and 7.68 Gbytes of memory. The '-l place=pack' option is the same as it was in the serial script, but here it ensures that the 4 cores in this resource chunk are confined (packed) to a single compute node so that they may be used in the SMP parallel programming style. The effect of these changes is to inform the PBS Pro Server that a resource chunk with 4 processors and 7.68 Gbytes of physical memory (more when virtual memory is counted) should be placed on 1 compute node. A careful reader will note that in fact the '-l place=pack' option is unnecessary here because as stated earlier a single resource chunk MUST always fit within a single physical compute node. If we had requested the same total resources, but with '-l select=4:ncpus=1:mem=1920mb' (4 chunks of one fourth the size), then the pack option would be necessary because each chunk might otherwise be placed on a different physical node. In either case, the PBS Scheduler will attempt to place all the resources requested here on a single physical node. If it cannot do this, the job will not be run due to insufficient resources -- not insufficient across the cluster, but within any single cluster compute node.

The compute-node limits apply to the physical memory requested as well. Here, the memory requested (7680mb) is actually 7680 * 1,048,576 = 8,053,063,680 bytes. This is a good high-water maximum for the 4 processors on a compute node with 8 * 2^30 bytes of memory. Again, the size of a resource chunk should never be greater than the size of the largest physical compute node in the cluster that the job is to be run on. On a Linux system, the files /proc/cpuinfo and /proc/meminfo are good sources for determining a compute node's processor counts and memory size. PBS Pro resource chunk defaults have been configured by CUNY HPC Center staff with these values in mind.

Finally, OpenMP SMP programs are able to determine the number processors available to it from environmental variables set by PBS Pro with the help of the resources requested in the '-l select' option. No additional processor specification is needed on the './hello_omp.exe' command-line as it would be with an MPI job. For this OpenMP program, PBS Pro sets the OMP_NUM_THREADS environmental variable to 4 from the 'ncpus=4' setting of the '-l select' option. This ensures that 4 OMP threads will be used by the OpenMP executable, one for each core PBS Pro has reserved. Additional examples of the interplay between the '-l select' option, OpenMP, and SMP applications are provided below.

As with the serial job, this SMP job is submitted to the PBS Pro Server and the queuing system using the 'qsub' command: 3. As with the serial job, submit the job to PBS Pro by entering the command 'qsub smp_job.sh'. If your submit file is correctly constructed, PBS Pro will respond by reporting the job's request ID (59 in this case) followed by the host name of the system that submitted the job:

qsub smp.bash
71.athena.csi.cuny.edu

A SPECIAL NOTE ABOUT THE CRAY (SALK):

All jobs run on the Cray's (SALK) compute nodes must be started with Cray's aprun command. This applies to SMP parallel jobs and in this case requires the use of 'aprun' command-line options specific to SMP type work. Mapping PBS Pro resource reservation definitions from the '-l select' line onto 'aprun' command-line options can be confusing. Users of the Cray should read the 'aprun' man page ('man aprun') carefully, paying particular attention to the multiple examples presented near the end. The resources that PBS reserves on the '-l select' line define a limit that bounds what can be requested using 'aprun.' When an 'aprun' request exceeds those PBS reservation boundaries, the user will receive a message of the form:

apsched: claim exceeds reservation's node-count

which indicates that what is being requested via 'aprun' exceeds what has been reserved by PBS on the '-l select' line. Any PBS-reserved resource (npus, memory, etc.) can potentially generate this error message.

For this OpenMP job on 4 cores, the correct Cray 'aprun' command would be:

aprun -n 1 -d 4 -N 1 ./hello_omp.exe

Here, the '-n 1' option defines the total number of Cray processing elements (PEs) to be used by the job. This corresponds to the number of PBS chunks specified and cannot exceed that number. The '-d 4' option defines the number of threads to be used per PE. This corresponds to the number of cpus (cores) per PBS chunk and cannot exceed that number. The '-N 1' option defines the number of Cray PEs to be used per Cray physical compute node, in this case 1 out of the 16 cores available per node. This number cannot exceed the number of PEs defined by the '-n' option.

A Cray PE is a software concept, NOT a hardware concept and corresponds to a distinct Linux process with its own memory space. For instance, an MPI job of rank 8 would have 8 PEs assigned to it because each MPI rank (process) has its own memory space. A thread is also a software concept, NOT a hardware concept and corresponds to an independent piece of parallel work within a PE or SMP application. These software concepts are mapped to Cray compute node hardware by the 'aprun' command. Typically, the total number of cores requested by 'aprun' is the product of the number of PEs requested by the '-n' option and the number of threads requested by the '-d' option. This is sometimes referred to as the 'width-by-depth' of the job. This number cannot exceed the number of PBS chunks multiplied by the number of cpus (cores) per chunk defined in the '-l select' line of your PBS script. In our case here, that total is 4, and the 'aprun' command asks for exactly as many cores (4) as PBS has reserved. More detail on the relationship between PBS resource reservations made with '-l select' and the command-line options to the Cray 'aprun' command are provided below.

Submitting MPI Distributed Memory Parallel Jobs

Taking an incremental approach one step further, just a few modifications to the '#PBS' options in the SMP script above and another to the execution line are required to create a working PBS script for an MPI distributed memory parallel job. Distributed memory parallel programs are by definition designed to make parallel use of an arbitrary collection of interconnected processors (cores) -- whether they happen to be within a single compute node as in the SMP job above or on opposite ends of a cluster's interconnecting switch. Such applications are referred to as distributed because (unlike SMP applications) each cooperating process has its own completely distinct memory space that might reside on any node in a distributed memory system. Message Passing Interface (MPI) communication is accomplished through two-way, message passing between two or more processes managing their own distinct memory spaces. Because of its ability to run on almost any parallel computing architecture, MPI has become the de facto standard parallel programming model for distributed memory and other parallel computing architectures. MPI applications have been shown to scale up to thousands of processors.

In an earlier section above, an example MPI distributed-memory parallel version of the standard "Hello World" program was presented. It is a simple matter to incrementally modify the SMP-parallel PBS submit script in the prior section to run this MPI "Hello World" program on 16 processors (cores). The changes required in the distribute memory parallel PBS script relate to reserving 16 PBS resource chunks large enough for the needs of each each MPI process (rank), but small enough to be placed freely on the physical compute nodes of the cluster with enough unused resources (cores, memory, etc.) to hold them. The notion of 'free' placement in PBS allows for putting resource chunks where ever they will fit, including each chunk on separate nodes or some chunks on the same node. Here is an example PBS script for such a distributed-memory, MPI parallel job, dparallel_job.sh.

The following steps would typically be required to submit an MPI program to the CUNY HPC Center PBS Pro batch scheduling system:

1. Create a new sub-directory (named "dparallel" for example) in your home directory, copy your program into it, and compile it by executing the following commands:

andy$ mkdir dparallel
andy$ cp ./hello_mpi.c dparallel
andy$ cd dparallel
andy$ mpicc -o hello_mpi.exe hello_mpi.c

Note: On the Cray (SALK) you would compile the program using 'cc'. On the Cray the MPI library is linked in by default.

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a PBS Pro submit script file (named dparallel_job.sh for example) and insert the following lines in it:

#!/bin/bash
#
# Simple distributed memory MPI PBS Pro batch job
#
#PBS -p production
#PBS -N dparallel_job
#PBS -l select=16:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of primary compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Start the job on 16 processors using 'mpirun' (different on Cray)
mpirun -np 16 -machinefile $PBS_NODEFILE ./hello_mpi.exe

3. Submit the job to the PBS Pro Server using 'qsub dparallel_job.sh'.

qsub dparallel_job.sh
65.athena.csi.cuny.edu

Here, 65 is the PBS Pro job request ID and 'athena.csi.cuny.edu' is the name of the system from which the job was submitted. (Note: This is not always the system on which the job is run. PBS Pro can be configured to allow users to queue jobs up on any system in a resource grid. At this time, the CUNY HPC Center has system-local scheduling only.) As with the examples above, this MPI job's status can be checked with the command 'qstat', or for a full listing of job 65 'qstat -f 65'.

4. When the job completes its output will be written to the file 'dparallel_job.o65' (errors will go to dparallel_job.e65):

andy$
andy$ cat dparallel_job.o65
Hello world from process 0 of 16
Hello world from process 1 of 16
Hello world from process 4 of 16
Hello world from process 12 of 16
Hello world from process 8 of 16
Hello world from process 2 of 16
Hello world from process 5 of 16
Hello world from process 13 of 16
Hello world from process 9 of 16
Hello world from process 3 of 16
Hello world from process 6 of 16
Hello world from process 14 of 16
Hello world from process 10 of 16
Hello world from process 7 of 16
Hello world from process 15 of 16
Hello world from process 11 of 16

The output file contains the messages generated by each of the 16 MPI processes requested by the 'mpirun' command.

What are the differences in this MPI distributed memory PBS script? First, in the '-l select' line the number of chunks has been set to the number of PEs (and because in this job there is a chunk-to-core equivalency, the number of cores as well) that will be used, 16 in this case. The composition of each chunk has also changed from the SMP job. The distributed memory script defines chunks of 1 core (ncpus=1) and about 2 GBytes of memory each (mem=1920mb). We leave it up to the reader to compute the exact amount being requested using the 2^20 multiplier. PBS Pro lists those compute nodes it has allocated to the job in a file whose PATH is given in the $PBS_NODEFILE variable. Here it is provided as an argument to the 'mpirun' command's '-machinefile' option and is used by the 'mpirun' command to select the compute nodes PBS has reserved for the job.

The only other real difference in the PBS options section is with the '-l place' option. The placement option here has been set to 'free' which allows the PBS Pro scheduler to place the 16 chunks requested, one for each MPI process, on compute nodes with (execution hosts in PBS Pro terms) the required space and with the lowest load. This typically results in a distribution that is not even partially packed as PBS seeks out one node after another that is least utilized. The '-l place' option could also have been set to 'scatter' which would force placement on physically distinct compute nodes regardless of load, if there are enough with the required resources available. In either case, when requesting 16 chunks with 16 cores, using 'pack' here would not work on ANDY, because there are NO compute nodes on ANDY with 16 cores or that much memory. At this point, readers may be asking themselves the question, "How do I get performance efficient packing of my MPI processes?"

Clearly, having more MPI processes on the same compute node should reduce the time to send messages at least those processes and reduce total communication time. To get closer to a communication efficient packing scheme, a combination of the '-l place=pack' option should be used along with the '-l select' 'mpiprocs=' resource. For example:

#PBS -l select=4:ncpus=4:mem=7680mb:mpiprocs=4
#PBS -l place=pack

This combination asks the scheduler to pack 4 resource chunks of 4 cpus each (16 cpus in total) onto 4 compute nodes. The 'mpiprocs=4' variable requests that each block of 4 processors be scheduled on the same compute node (execution host). This is accomplished by creating a $PBS_NODEFILE file that repeats the name of each node assigned to each of the 4 chunks, 4 times, as follows:

andy$ cat $PBS_NODEFILE
r1i0n3
r1i0n3
r1i0n3
r1i0n3
r1i0n10
r1i0n10
r1i0n10
r1i0n10
r1i1n5
r1i1n5
r1i1n5
r1i1n5
r1i1n6
r1i1n6
r1i1n6
r1i1n6

As assigned, on ANDY this job would run on 4 physical compute nodes (3, 10, 5, and 6) and would use half (4) of the available processors (cores) on each node. Other combinations of chunks and processors are possible and might be preferred on machines with more or fewer cores per node. At the CUNY HPC Center both ANDY and BOB have 8 processors per compute node. SALK (the Cray) has 16 processors (cores). ATHENA has 4 processors per node. The chunk-count determines the number of resource pieces that PBS Pro must find a place for, and the product of the chunk-count and the 'ncpus' resource variable determines the total number of processors (cores) reserved by PBS for the job. How would you change the above 16 processor (core) job to be closely packed (2 chunks of 8 cores each) to be run on ANDY or BOB?

While such close packing schemes may be recommended for reducing your job's communication time and speeding its execution once started by PBS, there is a potential down-side when the system that you are submitting to is busy. A busy system is unlikely to have completely unassigned compute nodes. In such a case, a job submitted with a 'close packing' approach will be queued until unassigned nodes become available. On a busy system the wait could be a significant amount of time. As a result, more wall-clock time than might be saved in processor time by packing the job might be wasted. In general, submitting your work with '-l place=free' gives the PBS Scheduler the most flexibility in placing your job and is the best choice for moving your job as quickly as possible from the PBS queued state (Q) to the running state (R) on a busy system.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

To run a similar MPI distributed memory job using PBS on the Cray would require several modifications. Here is the script modified for submission on the Cray:

#!/bin/bash
#
# Simple distributed memory MPI PBS Pro batch job
#
#PBS -q production
#PBS -N dparallel_job
#PBS -l select=16:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -V

# Find out name of primary compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Start the job on 16 processors using 'mpirun' (different on non-Cray systems)
aprun -n 16 -d 1 -N 16 ./hw2.exe < ./message.input

First, the PBS-reserved memory per chunk has to be raised because on the Cray 'aprun' requests 2 Gbytes per core by default which is also the memory per core on each Cray compute node. Without this change, Cray's 'apsched' daemon produces the following error message:

apsched: claim exceeds reservation's memory

which indicates that 'aprun' is requesting more memory that the script's PBS-defined resource chunks have reserved. There are ways of asking for more and less memory per core using options to the 'aprun' command, but generally on the Cray 2000 Mbytes (2 Gbytes) should be requested on the PBS '-l select' line per cpu (core) in a resource chunk. In this case, there is 1 cpu in our PBS resource chunk and therefore we need only 2000 Mbytes (mem=2000mb) of memory.

In addition, on the Cray the 'aprun' command must be used to submit the job instead of 'mpirun'. Here, 'aprun' requests 16 PEs ('-n 16') with just 1 thread per PE ('-d 1'), which is the default, and requests 16 PEs per Cray compute node ('-N 16'). This 'aprun' request falls within the boundaries of the resources reserved in the PBS '-l select' line, and therefore is formally scheduled by the ALPS daemon on the Cray. As defined, this request will run on just a single Cray compute node.

Submitting GPU-Accelerated Data Parallel Jobs

The addition of GPU acceleration hardware to the compute nodes on the GPU side of ANDY (ANDY2) and also on ZEUS (compute nodes compute-0-8 and compute-0-9) adds a completely new layer to the parallel programming architecture of these systems augmenting the OpenMP and MPI parallel programming alternatives described above. With one GPU per socket on each ANDY2 node (two per compute node, 96 in all), users may run CPU-serial jobs that obtain performance acceleration by staging parallel work to the node's attached GPU. Additional combinations are possible. A single program might combine OpenMP, symmetric multi-processor SMP parallelism with GPU parallel acceleration in which each OpenMP thread controls its own locally attached GPU. Alternatively, users might also combine MPI's distributed memory, message-passing CPU parallelism across nodes with GPU parallel acceleration within nodes. Combing all three parallel programming models (MPI, OpenMP, and GPU parallelism) in a single program is even possible, although not often dictated by program requirements.

In a manner similar to OpenMP SMP parallelism, GPU-acceleration takes advantage of an application's loop-level data parallelism by creating a separate execution thread for each independent iteration in single or nested looping structures. It then distributes those threads among the GPU's many, small-foot-print processors or cores (each HPC Center Fermi has 448 cores; Tesla has 240 cores). Only loops whose iterations are fully independent can be parallelized on a GPU. Loops with dependencies can sometimes be restructured to eliminate those dependencies and allow for GPU processing. Many of the old concepts used to optimize code loops for vector computers are directly applicable to GPU data parallel acceleration and GPU programming.

While vector systems process loop iteration data in long, strip-mined, loop-iteration-based, 'vectors', GPUs process the same loop iteration data in wide, processor-block-mapped 'warps'. Vector systems execution a vector quantity (32, 64, 128) of loop iterations with a single pipelined vector instruction, while GPUs generate a separate instruction sequence (a thread) for each loop iteration and schedule them in blocks called 'warps'. A GPU 'warp' (32 iterations wide) is analogous to a vector. When multiple GPUs are involved (GPU-MPI or GPU-OpenMP programs), loop iteration data is further divided across a collection of GPUs. Programmers are reminded that using GPUs requires them to negotiate the distribution of their data across another level in the memory hierarchy because a GPU's memory and processing power are accessible only through the attached GPU's (motherboard) PCI Express bus.

Taking either of the parallel batch submission scripts above as a starting point, just a few modifications to the '#PBS' options and the command sequence are required to create a single-CPU, GPU-accelerated data parallel script for PBS Pro. A few additional changes would be required to create a combined GPU-OpenMP or GPU-MPI parallel PBS script. At the CUNY HPC Center only ANDY2 and ZEUS have GPU capability. ANDY2 can be used for either development or production GPU runs using one or more of its 96 NVIDIA's Fermi GPUs; ZEUS should only be used for development using one or more of its 4 Telsa GPUs.

Follow the instructions here to submit a basic serial CPU program with GPU acceleration written in CUDA C to the CUNY HPC Center's PBS Pro batch scheduling system:

1. Create a new sub-directory (named "gpuparallel" for example) in your home directory and move into by executing the following commands:

andy$ mkdir gpuparallel
andy$ cd gpuparallel

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a file for the following CUDA C Host and Device code (simple3.cu in this example) and cut-and-paste these lines into it:

#include <stdio.h>

extern __global__ void kernel(int *d_a, int dimx, int dimy);

/* -------- CPU or HOST Code --------- */

int main(int argc, char *argv[])
{
   int dimx = 16;
   int dimy = 16;
   int num_bytes = dimx * dimy * sizeof(int);

   int *d_a = 0, *h_a = 0; // device and host pointers

   h_a = (int *) malloc(num_bytes);
   cudaMalloc( (void**) &d_a, num_bytes);

   if( 0 == h_a || 0 == d_a ) {
       printf("couldn't allocate memory\n"); return 1;
   }

   cudaMemset(d_a, 0, num_bytes);

   dim3 grid, block;
   block.x = 4;
   block.y = 4;
   grid.x = dimx/block.x;
   grid.y = dimy/block.y;

   kernel<<<grid,block>>>(d_a, dimx, dimy);

   cudaMemcpy(h_a,d_a,num_bytes,cudaMemcpyDeviceToHost);

   for(int row = 0; row < dimy; row++) {
      for(int col = 0; col < dimx; col++) {
         printf("%d", h_a[row*dimx+col]);
      }
      printf("\n");
   }

   free(h_a);
   cudaFree(d_a);

   return 0;

}

/* --------  GPU  or DEVICE Code -------- */

__global__ void kernel(int *a, int dimx, int dimy)
{
   int ix = blockIdx.x*blockDim.x + threadIdx.x;
   int iy = blockIdx.y*blockDim.y + threadIdx.y;
   int idx = iy * dimx + ix;

   a[idx] = a[idx] + 1;
}

3. Compile this basic CUDA C code using 'nvcc', NVIDIA's CUDA C compiler. The default login environment on ANDY and ZEUS has been set up to find 'nvcc' and all standard libraries needed to complete basic compilations. NVIDIA distributes additional libraries with its Software Development Kit (SDK) that may be useful for achieving full performance on production applications. The current default version of the CUDA Programming environment installed on both ANDY and ZEUS is version 3.2.

andy$ nvcc -o ./simple3.exe ./simple3.cu

4. Use a text editor (CUNY HPCC suggests vi or vim) to create a PBS Pro submit script (named gpu_job.sh for example) and insert the lines below. This script selects 1 cpu and 1 companion GPU where the CUDA Host (CPU) code and Device (GPU) code listed above each run, respectively. The simple3.exe file is a mixed binary that includes both CPU and GPU code, and everything needed for the CUDA runtime environment to negotiate whatever CPU-GPU cross-bus interaction is required to complete its execution. This script is designed to run on ANDY. On ZEUS the queue should be changed to '-q development_gpu' and the accelerator type should be changed to 'accel=tesla'.

#!/bin/bash
#
# Simple 1 CPU, 1 GPU PBS Pro batch job
#
#PBS -q production_gpu
#PBS -N gpu_job
#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi
#PBS -l place=free
#PBS -V

# Find out which compute node the job is using
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Run executable on a single node using 1 CPU and 1 GPU.
./simple3.exe


5. Submit the job to the PBS Pro Server using 'qsub gpu_job.sh'. You will then get the message:

andy$ qsub gpu_job.sh
551.service0.csi.cuny.edu

Note: 'service0' is the SGI-local name of ANDY's head or login node.

Here, 551 is the PBS Pro job request ID and 'service0.csi.cuny.edu' is the name of the system from which the job was submitted. (Note: This is not always the system on which the job is run. PBS Pro can be configured to allow users to queue jobs up on any system in a resource grid. At this time, the CUNY HPC Center has system-local scheduling only.) As with the examples above, this GPU job's status can be checked with the command 'qstat', or for a full listing of job 551 'qstat -f 551'.

6. When the job completes its output will be written to the file 'gpu_job.o551':

$
$cat gpu_job.o551

gpute-17
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111

Any errors for this job would have been written to 'gpu_job.e551'.

The output file (transferred to and printed on the Host) contains the integer of 1 assigned to each element in the integer array a[]. At the top it prints the name of the compute node PBS assigned to the job, 'gpute-17'. On ANDY2, those nodes with GPUs attached are named 'gpute-0' through 'gpute-47', 48 nodes in all each with 2 Fermi GPUs.

What are the differences from the MPI script-file above? First, because this job is intended for execution on the GPU side of ANDY (ANDY2), the queue to which the job is submitted is not simply 'production', but rather 'production_gpu'. This designation for production-GPU work guarantees that compute nodes on ANDY2 will be selected by PBS Pro rather than those on ANDY1. (Note: On ZEUS, there is no 'production_gpu' queue. Instead use its 'development_gpu' queue).

Next, in the '-l select' line a single chunk is requested with 1 cpu ('ncpus=1'), 1 GPU ('ngpus=1'), and a Fermi type accelerator ('accel=fermi'). This chunk provides the required CPU-GPU pair to complete the job and guarantees that it will be run on a node with a Fermi GPU (on ZEUS one would use 'accel=tesla'). Because all the parallelism in this job is provided by a single GPU, this job will be routed from the 'production_gpu' routing queue to the 'qserial_gpu' execution queue. The 'qserial_gpu' queue will be displayed when the status of the running job is checked using the 'qstat' command as shown here:

andy$ qstat

PBS Pro Server andy.csi.cuny.edu at CUNY CSI HPC Center
Job id           Name           User               Time Use  S     Queue
-----------      --------       -----              -------   --    ---------     
551.service0     gpu_job        andy.grove          0:33      R     qserial_gpu     

The placement option in the script has been set to 'free' which allows the PBS Pro scheduler to place this CPU-GPU chunk anywhere on the ANDY2 side of the ANDY system (or on any compute node with GPUs on ZEUS).

HPC Center staff has also created example OpenMP-GPU and MPI-GPU codes, makefiles, and PBS scripts that can be requested by sending an email to 'hpchelp@csi.cuny.edu' (very cool stuff)!

Submitting 'Interactive' Batch Jobs

PBS Pro provides a special kind of batch option called interactive-batch. An interactive-batch job is treated just like a regular batch job (in that it is queued up, and has to wait for PBS Pro to provide it with resources before it can run); however, once the resources are provided, the user's terminal input and output are connected to the job in a manner similar to an interactive session. The user is interactively logged into a "master" execution host (compute node), and its resources and the rest of the resources reserved (processors and otherwise) by PBS are held for the interactive job's duration.

Interactive-batch jobs can take a script file like regular batch jobs, but only the '#PBS' options in the header are read. All script-file commands are ignored. It is assumed that the user will supply their commands interactively after the session has started on the assigned execution host. As always, the '#PBS' options can also be supplied on the 'qsub' command-line. All PBS Pro interactive-batch jobs must include the -I on the 'qsub' command line. The following example starts a 4 processor interactive-batch session packed onto a single compute node (compute-0-2 here) from which a 4 processor MPI job is run. One should note that while the resources requested by the interactive-batch job are reserved for the duration of the job, the user does not have to use them all with each interactive job submission.

bob$qsub -I -q interactive -N intjob -V -l select=4:ncpus=1:mem=1920mb -l place=pack
  qsub: waiting for job 73.bob.csi.cuny.edu to start
  qsub: job 73.bob.csi.cuny.edu ready
compute-0-2$
compute-0-2$
compute-0-2$cd dparallel
compute-0-2$cat $PBS_NODEFILE

compute-0-2
compute-0-2
compute-0-2
compute-0-2

compute-0-2$hostname
compute-0-2
$
$mpirun -np 4 -machinefile $PBS_NODEFILE ./hello_mpi.exe
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
$
$CNTRL^D

$hostname
bob.csi.cuny.edu
$
$

The 'qsub' options (provided on the command-line in this case) define all that is needed for this PBS interactive-batch job. Here the '-I' option must be provided on the command-line, and the CUNY HPCC's 'interactive' queue must be selected. This queue has compute nodes dedicated for interactive work that cannot be used by production batch jobs. When the requested resources are found by the PBS Pro Scheduler, the user is logged into one of those compute nodes, the shell prompt returns, and a $PBS_NODEFILE is created. The user must change to the directory from which he wishes to submit the job as he would in a regular batch script. This is all laid out in the session above. Above, a 4 processor job is started directly with the 'mpirun' command from the shell prompt. It runs and returns its output to the terminal. More such jobs could be run if desired, although there is a cpu time and wall-clock time limit imposed on interactive sessions. The defaults are 8 and 16 minutes respectively. The maximums are 16 and 32 minutes. An interactive-batch session is terminated by simply logging our of the execution host that PBS provided the session by typing a CNTRL^D. This logs the user out of the compute node and returns them to the head node where the job was submitted.

Through the 'interactive' queue, CUNY HPCC has reserved compute nodes resources for interactive-batch jobs only. The 'interactive' queue along with the 'development' queue described below have been created to ensure that some systems resources are always available for code development. More about these and CUNY HPCC's other PBS Pro queues is provided in subsequent section.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

The sequence above will work on the Cray if the the following changes are made. Change the qsub command to:

qsub -I -q interactive -N intjob -V -l select=4:ncpus=1:mem=2000mb -l place=pack

where the memory requested per cpu (core) is 2000 Mbytes. And replace the 'mpirun' command with:

aprun -n 4 -d 1 -N 4 ./hello_mpi.exe

Because the Cray's login node (salk.csi.cuny.edu) functions as a master compute node or execution host with access to all the compute node resources managed by Cray ALPS daemon, the 'hostname' command will always return the name 'salk.csi.cuny.edu' on the Cray when run from any PBS script. Neither users nor PBS can directly access the Cray's compute nodes.

One last point relating to Cray (SALK) interactive work. On the Cray only, interactive jobs may be run directly from the default command-line using 'aprun' because the HPC Center has reserved 2 Cray compute nodes (2 x 16 cores) for interactive completely outside of PBS. This compute-node privilege is not to be confused with running jobs directly on the Cray head node (a service node) which is forbidden as it is on all other CUNY HPC systems. Such a job would not use the 'aprun' command to launch it.

An 'aprun' inititated Cray-interactive job would be run simply with:

salk$ aprun -n 4 -d 1 -N 4 ./hello_mpi.exe

without ANY interaction with PBS. Such jobs are limited in size (32 cores) and will be killed if they run for more than 30 wall-clock minutes.

More on PBS Pro resource 'chunks' and the '-l select' Option

With examples of options to the 'qsub' command and how to submit jobs to the PBS Pro Server presented above, the more complete description and additional examples provided here will be easier to understand. The general form for specifying PBS resource 'chunks' to be allocated on the compute nodes with the '-l select' option, is as follows:

-l select=[N:]chunk_type1 + [N:]chunk_type1 + ...

Here, the values of N give the number of each type of chunk requested, and each chunk type is defined with a collection of node-specific, resource-attribute assignments using the '=' sign. Each attribute is separated from the next by a colon, as in:

ncpus=2:mem=1920mb:arch=linux:switch=qdr ...

While it was not seen in the examples above, more than one type of PBS resource chunk can be defined within a single '-l select' option. Additional distinct types are appended with the '+' sign as shown just above. Using multiple chunk types should be a relatively infrequent occurrence at CUNY HPCC because our nodes are in general physically uniform. There are many kinds of node resource attributes (for more detail enter 'man pbs_resources'). Many are built-into PBS Pro; some may be site-defined. They can have a variety of data types (long, float, boolean, size, etc.), and they can be consumed by the running jobs when they are requested (ncpus for instance) or just used to define a destination. More detail on the node-specific attributes used to define chunks can also be found with 'man pbs_node_attributes'.

Once the number and type of chunk is defined, the PBS Pro scheduler maps the chunked resources requested onto the available system resources. If the system can physically fulfill the request, and there are not other jobs already using the resources requested, the job will be run. Jobs with resource requests that are physically impossible to fulfill will never run, although they can be queued with no warning about the requests impossibility to the user. Those that cannot be fulfilled because other jobs are using the resources will be queued and eventually run. To determine exactly what resources you have been given, whether your job is running or not, and the reason why not if not, generate your jobs full description with 'qstat -f JID' command. (Note: When no resource is requested by the user, the default values set for the queue that the user's job ends up in are applied to the job. These may not be exactly what is required by the job or wanted by the user.)

Below are a number of additional examples of '-l select' resource requests with an explanation of what is being requested. These are more complicated, synthetic cases to give users the idea of what is possible. They are not designed to apply to any specific CUNY HPC PBS Pro system, and they will not necessarily find an obvious use exactly as provided here. They should give you an idea of the variety of resource requests possible with the '-l select' option. Users are directed to the PBS Pro 11.1 User Guide here for additional examples and a full description of PBS Pro from the user's perspective.

Example 1:

-l select=2:ncpus=1:mem=10gb:arch=linux+3:ncpus=2:mem=8gb:arch=solaris

This job requests two chunk types, 2 of one type and 3 of the other. The first chunk type is to be placed on Linux compute nodes, and the second type on Solaris compute nodes. The first chunk requires nodes with at least 1 processor and 10 GBytes of memory. The second chunk requires nodes with at least 2 processors and 8 GBytes of memory.

Example 2:

-l select=4:ncpus=4:bigmem=true

This jobs request 4 chunks of the same type, each with 4 processors, and each with a site-specific boolean attribute of 'bigmem'. Nodes with 4 processors available and that had been ear-marked by the site as large memory nodes would be selected. A total of 16 processors in all would be allocated by PBS.

Example 3:

-l select=4:ncpus=2:lscratch=250gb
-l place=pack

This job also asks for 4 resource chunks of the same type. Each node selected by PBS must have 2 processors available and 250 Gbytes of node-local scratch space. The job requests a total of 8 processors and then asked for the resources to be packed on a single node. Unless the system that this job is submitted to has nodes with 8 cores and a total of 1 TByte of local storage (4 x 250 GBytes), this job will remain queued indefinitely. The 'lscratch' resource attribute is site-local and availability would be determined by a run-time script that checks available disk space. The CUNY HPC Center has defined a local scratch disk resource on ANDY and ZEUS.

Example 4:

-l select=3:ncpus=2:mpiprocs=2

This job requests 3 identical resource chunks, each with 2 processors for 6 in total. The 'mpiprocs' resource attribute effects how the $PBS_NODEFILE is constructed. It ensures that the $PBS_NODEFILE file generated includes two entries for each of the three chunks (and probably nodes) allocated so that there will be two MPI processes run per node. The $PBS_NODEFILE file generated from this job script would contain something like this:

node10
node10
node11
node11
node12
node12

Without the 'mpiprocs' attribute there would be only three entries in the file, one for each execution host.

Example 5:

-l select=1:ncpus=1:mem=7680mb
-l place=pack:excl

This job requests just 1 chunk for a serial job that requires a large amount of memory. The '-l place=pack:excl' option ensures that when the chunk is allocated, no other job will be able to allocate resource chunks on that node -- even if there are additional resources available. This will perhaps idle some processors on that node, but will ensure that its memory is entirely available to this job. The 'pack:excl' stands for pack exclusively (e.g. do not allow any other job on that node). By default CUNY HPC system nodes have been configured to allow their resources to be divided among multiple jobs. This means that a given node may have more than one user job running on it. An exception is on the Cray where this is prevented by Cray's ALPS scheduler in order to reduce interrupt driven load imbalance for jobs with high core counts.

Example 6:

-l select=4:ncpus=2:ompthreads=4
-l place=pack:excl

This job is configured to run a hybrid MPI and OpenMP code. The number of processors explicitly requested in total is 8 (the chunk number [4] times the ncpus number [2]). The $PBS_NODEFILE would include 4 compute nodes, one for each chunk (also, one for each of just 4 MPI processes). Assigned nodes would need to have at least 2 processors on them, but could have more. If they had, for instance, 4 processors (cores), then OpenMP would start 4 threads, one per physical processor. If they had only 2, then OpenMP would still run 4 threads on each node, but the 4 threads would compete with each other among the 2 physical cores. This set of options would suit a hyper-threaded processor like the Intel Nehalem that is used on CUNY's ANDY cluster system. (Note: The CUNY HPC Center has limited the number of processes that may run on ANDY's nodes to 8, the same as the number of physical cores). The '-l place=pack:excl' options again ensures that no other jobs will be placed there to compete with this jobs 4 OpenMP (SMP) threads.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

PBS job submission on the Cray is more complicated because of the requirement to use 'aprun' (a relatively complicated utility) to execute your applications and to ensure that the resources 'aprun' requests do not exceed those that PBS has reserved for the job in the '-l select' line. Reading the 'aprun' man page carefully, including the examples at the end, will reduce your frustration in the long run and is highly recommended by HPC Center staff. A few rules of thumb relating the '-l select' line PBS options to the 'aprun' command-line options are provided here to help with basic job submission.

Rule 0:

>> Jobs submitted to the production queue ('-q production') cannot request fewer than 16 PBS resource chunks (PEs). <<

Jobs requiring fewer than 16 chunks (PEs) must be submitted to the 'development' queue. If you try to submit such a job, you will get the following error message:

qsub: Job rejected by all possible destinations

Rule 1:

>> The number of PEs (often equivalent to cores, but not always) per physical compute node, set with the 'aprun' '-N' option, should never be greater than 16 (the << >> maximum per node on SALK) or greater than the total number of PEs requested by the job via the '-n' option. <<

If you try to submit such a job, you will get the following error message:

apsched: -N value cannot exceed largest node size

or else:

aprun: -N cannot exceed -n

Rule 2:

>> The total number of PEs set by the 'aprun' '-n' option and to be used by the application should never exceed the total number of PBS chunks requested in the << >> '-l select=' statement. <<

If you try to submit such a job, you will get the following error message:

apsched: claim exceeds reservation's node-count

Rule 3:

>> The product of the number of PEs requested by the 'aprun' '-n' option and the number of threads requested with the 'aprun' '-d' options should never exceed the << >> product of the number of PBS chunks, '-l select=' and the number of cores per chunk, 'ncpus='. <<

If you try to sbumit such a job, you will get the following error message:

apsched: claim exceeds reservation's CPUs

Rule 4:

>> By default, the 'aprun' command requests 2 Gbytes (2000 Mbytes) of memory per cpu (core). You should set your PBS '-l select' per-cpu (core) memory resource << >> to 'mem=2000mb' to match this default. <<

This is a per cpu (core) requirement. If your PBS resources chunks have multiple cpus (cores) then the memory requested per chunk should be the appropriate multiple. If you set your memory resource in the PBS '-l select' option to less than 2 Gbytes per cpu (core), you will get the following error message:

apsched: claim exceeds reservation's memory

If you have difficulty or need to do something special like carefully place you job for performance on the Cray's 2D torus interconnect, or use more than 2 Gbytes memory pre core, you should ask for help via 'hpchelp@csi.cuny.edu' after reading through the 'man aprun' man page.

CUNY HPC Center PBS Pro Queue Structure

CUNY HPC Center has designed its PBS Pro queue structure to efficiently map our population of batch jobs to HPC Center system resources. PBS Pro has two distinct types of queues, execution queues and routing queues. Routing queues are defined to accept general classes of work (production, development and debugging, etc.). Jobs submitted to routing queues are then directed to their associated execution queues based on the resources requested by the job. Job resource requirements are assigned either explicitly by the user with the '-l select' option as described above, or (when not indicated by the user) implicitly through pre-defined PBS server and PBS queue resource defaults. Jobs in each general class are sorted first by the routing queue selected and then according to their resource requirements for placement in an execution queue. The execution queue is a PBS job's final destination and is shown by the 'qstat' command.

CUNY HPC Center PBS Pro Routing Queues

The routing queues that CUNY HPC Center users will general use in their PBS Pro batch submission scripts ('-q' option) include:

Routing Queue Type 1:

interactive          ::  A development and debug queue for small scale, short ''interactive''-batch jobs.

Routing Queue Type 2:

development          ::  A development and debug queue for small scale, short batch ''test'' jobs.

Routing Queue Type 3:

production           ::  A production queue for ''production'' work of any scale and length.

On ANDY and ZEUS there are additional routing queues based on minor variations of those above that allow users to direct their jobs to nodes with special features.

On ANDY, users with CPU-only batch production work can submit their jobs to the default production queue described above (production) which offers 360 Intel Nehalem cores attached to a dedicated DDR Infiniband interconnect, OR they can submit their batch production work to the production_qdr queue described below which offers 360 identical Intel Nehalem cores attached to a 2x faster, but shared, communication-and-storage QDR Infiniband inteconnect.

QDR Routing Queue Type 3:

production_qdr           ::  A production queue for ''production'' work of any scale and length.

On ANDY and ZEUS, some nodes have GPUs connected via its PCI Express bus. Nodes of this type each have 2 GPUs attached 48 on ANDY, and 2 on ZEUS) one per socket. On ANDY these are NVIDIA Fermi nodes with 448 cores and on ZEUS they are NVIDIA Tesla nodes with 240 cores. The nodes with these special features can be selected by using:

GPU Routing Queue Type 1 (ANDY only):

interactive_gpu          ::  A development and debug queue for small scale, short ''interactive''-batch jobs using GPUs.

GPU Routing Queue Type 2:

development_gpu          ::  A development and debug queue for small scale, short batch ''test'' jobs using GPUs.

GPU Routing Queue Type 3 (ANDY only):

production_gpu           ::  A production queue for ''production'' work of any scale and length using GPUs.**

(** Note: On ZEUS there is only a 'development_gpu' queue.)

The GPU and QDR sides of ANDY (ANDY2) used for production work are congruent and the compute nodes there (gpute-0 through gpute-45) can accepted either CPU-only work or CPU-GPU work. The type of work run is depended on the routing queue selected by the user as indicated above. These two classes of work compete for CPU resources on ANDY2, although GPU jobs can consume only 1 CPU per GPU leaving at least 6 free CPUs on each node for non-GPU work. To reduce the chance that CPU-only jobs will lock out GPU workloads on ANDY2, GPU jobs are given a somewhat higher scheduling priority on the GPU-QDR side of ANDY (ANDY2). The GPU development queue resources on ZEUS, and in the GPU interactive and development queue resources on ANDY2 are dedicated to GPU workloads.

Other routing queues have been defined for reservations, dedicated time, idle cycles, and rush jobs. These are currently disabled, but will be activated as CUNY HPC Center develops its 24 x 7 scheduling policy on each of its HPC systems.

Choosing the right routing queue is important because resources have been reserved for each class of work and limits have been set on the resources available in each queue. For instance, jobs submitted to the interactive routing queue are limited to 4 or fewer processors and can run for not more than a maximum of 32 processor minutes in total. The production routing queues will accept a job of virtually any size and duration, and move it to the appropriate execution queues defined below.

CUNY HPC Center PBS Pro Execution Queues

At the CUNY HPC Center, one of the currently active routing queues described above MUST be the queue name used with the '-q' option to the PBS Pro 'qsub' command, as in '-q production'. As defined at CUNY HPCC, jobs can ONLY be submitted to routing queues. From there CPU-only jobs are routed to one of the following execution queues based on the resources requested:

Execution Queue 1 (not on SALK):

qint4         ::  A queue limited to ''interactive'' work of not more than 4 processors and 16 total processor minutes.

Execution Queue 2 (not on SALK):

qdev8        ::  A queue limited to batch ''development'' work of not more than 8 processors and 60 total processor minutes.

Execution Queue 3 (not on SALK):

qserial(_qdr)       ::  A queue limited to batch ''production'' work of not more than 1 processor (currently no cpu time limit)

Execution Queue 4:

qshort16(_qdr)    ::  A queue limited to batch ''production'' work of between 2 to 16 processors, and fewer than 32 total processor hours

Execution Queue 5:

qlong16(_qdr)     ::  A queue limited to batch ''production'' work of between 2 and 16 processors, and more than 32 total processor hours (currently no cpu time limit)

Execution Queue 6 (not on SALK):

qshort64(_qdr)    ::  A queue limited to batch ''production'' work of between 17 and 64 processors, and fewer than 128 total processor hours

Execution Queue 7 (not on SALK):

qlong64(_qdr)    ::   A queue limited to batch ''production'' work of between 17 and 64 processors, and more than 128 total processor hours (currently no cpu time limit)

Execution Queue 8 (SALK only):

qshort128         ::  A queue limited to batch ''production'' work of between 17 and 128 processors, and fewer than 128 total processor hours

Execution Queue 9 (SALK only):

qlong128         ::   A queue limited to batch ''production'' work of between 17 and 128 processors, and more than 128 total processor hours (currently no cpu time limit)

Execution Queue 10 (SALK only):

qshort512         ::  A queue limited to batch ''production'' work of between 129 and 512 processors, and fewer than 512 total processor hours

Execution Queue 11 (SALK only):

qlong512         ::   A queue limited to batch ''production'' work of between 129 and 512 processors, and more than 512 total processor hours (currently no cpu time limit)

Execution Queue 12:

qmax(_qdr)        ::  A queue limited to batch ''production'' work of between 65 and 132 processors (currently no cpu time minimum or limit) **

(**) On SALK the 'qmax' queue accepts jobs from 513 to 1024 processors. Special provisions for jobs requiring more resources than are allowed by default in 'qmax' can be made. The small size of ZEUS required that the qshort64 and qlong64 executions be replaced by a smaller qmax queue. The parenthetical extensions (_qdr) indicate the name of the QDR-equivalent execution queue on ANDY2.

Because of the typical 1-to-1 pairing of CPUs to GPUs in GPU workloads and the more limited number of GPUs per node on ANDY2 than CPUs per node, the ANDY2 GPU execution queue structure is a somewhat scaled down version of the CPU-only execution queue structure presented above:

GPU Execution Queue 1:

qint1_gpu         ::  A queue limited to ''interactive'' work of not more than 1 CPU, 1 GPU, and 8 total processor minutes.

GPU Execution Queue 2 (ANDY2 and ZEUS):

qdev2_gpu        ::  A queue limited to batch ''development'' work of not more than 2 CPUs, 2 GPUs, and 32 total processor minutes.

GPU Execution Queue 3:

qserial_gpu       ::  A queue limited to batch ''production'' work of not more than 1 CPU and 1 GPU (currently no cpu time limit)

GPU Execution Queue 4:

qshort4_gpu    ::  A queue limited to batch ''production'' work of from 2 to 4 CPUs, 2 to 4 GPUs, and fewer than 32 total processor hours

GPU Execution Queue 5:

qlong4_gpu     ::  A queue limited to batch ''production'' work of from 2 to 4 CPUs, 2 to 4 GPUs, and more than 32 total processor hours (currently no cpu time limit)

GPU Execution Queue 6:

qshort16_gpu    ::  A queue limited to batch ''production'' work of from 5 to 16 CPUs, 5 to 16 GPUs, and fewer than 128 total processor hours

GPU Execution Queue 7:

qlong16_gpu    ::   A queue limited to batch ''production'' work of from 5 to 16 CPUs, 5 to 16 GPUs, and more than 128 total processor hours (currently no cpu time limit)

GPU Execution Queue 8:

qmax_gpu        ::  A queue limited to batch ''production'' work of from 17 to 64 CPUs and 17 to 64 GPUs (currently no cpu time minimum or limit)**

(**) Special provisions for jobs requiring more resources than are allowed by default in 'qmax' can be made. The small size of ZEUS means that only one GPU routing queue (development_gpu) and one GPU execution queue (qdev2_gpu) is available.

As you can see from the resource limits, the execution queues are designed to contiguously pack the resource request space. Jobs submitted to the routing queues will be sorted according to the resources requested on the '-l select' option and placed in the appropriate execution queues. The job's memory requirements are also considered. The entire physical memory of a given system's compute node has been divided proportionately among the cores available on the node. This value is the default requested for each resource chunk unless otherwise specified by the user. This value sets the amount of the job's total memory to be mapped to the node's physical memory space. Job's that actually need more memory will have pages that spill out onto disk. Each execution queue limits the amount memory available to this proportional fraction of the node's memory times the processor (core) count of the job, up to the processor (core) limit for the queue.

Each execution queue has its priority set according to the prevailing usage pattern on each system. Currently, this priority scheme slightly favors jobs that are between 8 and 16 processors in size all systems, except the Cray (SALK). On SALK, jobs between 128 and 512 cores have the highest priority to encourage the execution of large jobs there. Still, a job's priority is dependent on more than the priority of the execution queue that it ends up in. As it accumulates time in the queued (Q) state, its priority rises and this new priority is used at the next scheduling cycle (currently every 5 minutes) to decide whether or not to run the job. Furthermore, the current CUNY HPC Center PBS Pro configuration has backfilling enable, so that some smaller jobs with lower priority may be started if there is not space enough to run queued larger jobs with higher priority. This 'priority-formula' based approach to job scheduling may be supplanted by a 'fair-share' approach in the future.

The workload at the CUNY HPC Center is varied which makes it difficult to achieve perfect utilization while maintaining fair access to the resources by all parties. The objective of the design of the queueing structure is to strike a balance between high utilization and fair access. The queue limits and priorities have already been refined several times since PBS Pro became CUNY's default batch scheduler to better meet these goals. Your input is invited on this question, and if you find your jobs have remained queued for periods of more than a day please feel free to make an inquiry on the matter. We have found that the majority of jobs that are delayed have been delayed unnecessarily due job submission script errors (impossible-to-fulfill resource requests) or inefficiencies (using an alternative script would allow the job to start). This is not always the case and when problems arise their resolution usually leads to a better PBS configuration. The CUNY HPC Center also recommends that users be prepared to run their applications on multiple systems in the event that one system is busy and another is more lightly loaded. From our usage records we know that users that operate in this manner get more processing time over all.

As our familiarity with PBS Pro grows and as the needs of our user community evolve it is likely that the queue structure will continue to be refined and augmented. There may be a need to create additional queues for specific applications for instance. User comments regarding the queue structure are welcome and should be sent to hpchelp@mail.csi.cuny.edu.

Currently Supported User Level Applications

This an overview of the user-level HPC applications supported by the HPC Center staff for the benefit of the entire CUNY HPC user community. A user can chose to install any application that they are licensed for on their own account, or appeal (based on general interest) to have it installed by HPC Center staff in the shared system directory (usually /shared/apps).

Not every user-level application is installed on every system. This is because system architectural differences, load-balancing considerations, licensing limitations, the time required to maintain them, and other factors, sometimes dictate otherwise. Here, we present the current CUNY HPC Center user-level application catalogue and note the system on which each application is installed and licensed to run.

We encourage the CUNY HPC community to help the HPC Center staff create a applications catalogue that is closely tuned to the needs of the community. As such, we hope that users will solicit staff-help in growing our application install base to suite the needs of the community whatever the application discipline (CAE, CFD, COMPCHEM, QCD, BIOINFORMATICS, etc.).

Unless otherwise noted, all applications built locally were built using our default Intel-OpenMPI applications stack. Furthermore, the PBS Pro job submission scripts below are promised to work (at the time this section of the Wiki was written), but the number of processors (cores), memory, and process placement defined in the example scripts is not necessarily optimal for wall-clock or cpu-time performance. The user should use their knowledge of the application, the system, and the benefit of their experience to choose the optimal combination of processors and memory for their scripts. Details on how to make full use of the PBS Pro job submission options are covered in the PBS Pro section below.

ADCIRC

ADCIRC is a system of computer programs for solving time dependent, free surface circulation and transport problems in two and three dimensions. These programs utilize the finite element method in space allowing the use of highly flexible, unstructured grids. Typical ADCIRC applications have included: (i) modeling tides and wind driven circulation, (ii) analysis of hurricane storm surge and flooding, (iii) dredging feasibility and material disposal studies, (iv) larval transport studies, (v) near shore marine operations.

Currently ADCIRC is available for users on Salk (salk.csi.cuny.edu).

Serial Execution

Create a directory where all the files needed fro the job will be kept

# mkdir testadcirc
# cd testadcirc

Copy example from ADCIRC directory and unzip it

# cp /share/apps/adcirc/default/testcase/serial_shinnecock_inlet.zip ./
# unzip ./serial_shinnecock_inlet.zip 
Archive:  ./serial_shinnecock_inlet.zip
  inflating: serial_shinnecock_inlet/fort.14  
  inflating: serial_shinnecock_inlet/fort.15  
  inflating: serial_shinnecock_inlet/fort.16  
  inflating: serial_shinnecock_inlet/fort.63  
  inflating: serial_shinnecock_inlet/fort.64  

Go to unpacked directory

# cd serial_shinnecock_inlet/

Create file with the following lines in it (this file will be used to submit an ADCIRC job to PBS queue):

#!/bin/bash
#PBS -q production
#PBS -N serial_ADCIRC
#PBS -l select=16:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -j oe
#PBS -o serial_adcirc.out
#PBS -V

cd $PBS_O_WORKDIR
echo "Starting job:  "  ${PBS_JOBID}

aprun -n 1 ./adcirc

echo "Finishing job .... DONE"

Copy serial ADCIRC binary to the working directory

# cp /share/apps/adcirc/default/bin/adcirc ./


And finally submit a job to the PBS queue:

# qsub sendfile

Parallel Execution

Running ADCIRC in a parallel mode requires some additional steps.

Here we present a procedure that allows to run example parallel ADCIRC job.

Create a directory where all the files needed fro the job will be kept

# mkdir testparadcirc
# cd testparadcirc

Copy example from ADCIRC directory and unzip it

# cp /share/apps/adcirc/default/testcase/serial_shinnecock_inlet.zip ./
# unzip ./serial_shinnecock_inlet.zip 
Archive:  ./serial_shinnecock_inlet.zip
  inflating: serial_shinnecock_inlet/fort.14  
  inflating: serial_shinnecock_inlet/fort.15  
  inflating: serial_shinnecock_inlet/fort.16  
  inflating: serial_shinnecock_inlet/fort.63  
  inflating: serial_shinnecock_inlet/fort.64  

Go to unpacked directory

# cd serial_shinnecock_inlet/

Now we need to partition the domain and decompose problem:

# /share/apps/adcirc/default/bin/adcprep 

When prompted enter 8 (for number of processors) and 1 (to partition the domain). When asked for ADCIRC UNIT 14 (Grid) file enter 'fort14'.

This will output some stuff to the screen.

Then run the above command again and when prompted enter 8 (number of processors) and 2 (to decompose the problem).

Copy parallel ADCIRC binary to the working directory

# cp /share/apps/adcirc/default/bin/padcirc ./

At this point you'll have all the files needed to run the job:

# ls 
adc  fort.14  fort.15  fort.16  fort.80  metis_graph.txt  partmesh.txt  PE0000/  PE0001/  PE0002/  PE0003/  PE0004/  PE0005/  PE0006/  PE0007/

The following is the content of PBS submit file:

#!/bin/bash
#PBS -q production
#PBS -N serial_ADCIRC
#PBS -l select=16:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -j oe
#PBS -o par_adcirc.out
#PBS -V

cd $PBS_O_WORKDIR
echo "Starting job:  "  ${PBS_JOBID}

aprun -n 8 ./padcirc

echo "Finishing job .... DONE"

And finally submit a job to the PBS queue:

# qsub sendfile

ADF Amsterdam Density Functional Theory

ADF (Amsterdam Density Functional) is a Fortran program for calculations on atoms and molecules (in gas phase or solution) from first principles. It can be used for the study of such diverse fields as molecular spectroscopy, organic and inorganic chemistry, crystallography and pharmacochemistry. Some of its key strengths include high accuracy supported by its use of Slater-type orbitals, all-electron relativistic treatment of the heavier elements, and fast parameterized DFT-based semi-empirical methods. A separate program BAND is available for the study of periodic systems: crystals, surfaces, and polymers. The COSMO-RS program is used for calculating thermodynamic properties of (mixed) fluids.

The underlying theory is the Kohn-Sham approach to Density-Functional Theory (DFT). This implies a one-electron picture of the many-electron systems but yields in principle the exact electron density (and related properties) and the total energy. If ADF is a new program for you we recommend that you carefully read Chapter 1, section 1.3 'Technical remarks, Terminology', which presents a discussion of a few ADF-typical aspects and terminology. This will help you to understand and appreciate the output of an ADF calculation. The ADF Manual is located on the web here: [21]

ADF 2012 (and SCM's other programs) is installed on ANDY at the CUNY HPC Center, but is currently only licensed by a limited number of specific working groups. Access is controlled based on licensing. HPC Center users interested in the licensing ADF should contact the HPC Center staff via 'hpchelp'. The current group-limited license allows up to 4-way parallel work on any of the nodes on the 'production_qdr' side of the system.

Here is a simple ADF input deck that compute the SCF wave function for HCN. This example can be run with the PBS script shown below on from 1 to 4 cores.

Title    HCN Linear Transit, first part
NoPrint  SFO, Frag, Functions, Computation

Atoms      Internal
  1 C  0 0 0       0    0    0
  2 N  1 0 0       1.3  0    0
  3 H  1 2 0       1.0  th  0
End

Basis
 Type DZP
End

Symmetry NOSYM

Integration 6.0 6.0

Geometry
  Branch Old
  LinearTransit  10
  Iterations     30  4
  Converge   Grad=3e-2,  Rad=3e-2,  Angle=2
END

Geovar
  th   180    0
End

End Input

A PBS script ('adf_4.job')configured to use 4 cores is shown here. Note that ADF does not use the version of MPI that the HPC Center supports by default. ADF used the proprietary version of MPI from SGI that is part of SGI's MPT parallel library package. This script include special lines to configure the run for this. A side effect of this fact is that ADF jobs will not clock time in PBS as shown under the 'Time' column when your job is being checked with 'qstat'

#!/bin/bash
# This script runs a 4-cpu (core) ADF job with the 4 cpus 
# packed onto a single compute node. This is the maximum
# number of cores allowed by the 'floating' license. This script
# requests only one half of the resources on an ANDY compute
# node (4 cores, 1 half its memory). 
# 
# The hcn4.input deck in this directory is configured to work
# with these resources, although this computation is really 
# too small to make full use of them. To increase or decrease
# the resources PBS requests (cpus, memory, or disk) change the 
# '-l select' line below and the parameter values in the input deck.
#
#PBS -q production_qdr
#PBS -N adf_4P_job
#PBS -l select=1:ncpus=4:mem=11520mb:lscratch=400gb
#PBS -l place=free
#PBS -V

# list master compute node 
echo ""
echo -n "Hostname is: "
hostname
echo ""

# set environment up to up SGI MPT version of MPI
BASEPATH=/opt/sgi/mpt/mpt-2.02

export PATH=${BASEPATH}/bin:${PATH}
export CPATH=${BASEPATH}/include:${CPATH}
export FPATH=${BASEPATH}/include:${FPATH}
export LD_LIBRARY_PATH=${BASEPATH}/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=${BASEPATH}/lib:${LIBRARY_PATH}
export MPI_ROOT=${BASEPATH}

# set the ADF root directory
export ADFROOT=/share/apps/adf
export ADFHOME=${ADFROOT}/2012.01

# point ADF to the ADF license file
export SCMLICENSE=${ADFHOME}/license.txt

# set up ADF scratch directory 
export MY_SCRDIR=`whoami;date '+%m.%d.%y_%H:%M:%S'`
export MY_SCRDIR=`echo $MY_SCRDIR | sed -e 's; ;_;'`
export SCM_TMPDIR=/home/adf/adf_scr/${MY_SCRDIR}_$$

mkdir -p $SCM_TMPDIR

echo "The ADF scratch files for this job are in: ${SCM_TMPDIR}"
echo ""

# explicitly change to your working directory under PBS
cd $PBS_O_WORKDIR

# set the number processors to use in this job
export NSCM=4

# run the ADF job
echo "Starting ADF job ... "
echo ""

adf -n 4 < HCN_4P.inp > HCN_4P.out 

# name output files
mv logfile HCN_4P.logfile

echo ""
echo "ADF job finished ... "

# clean up scratch directory files
/bin/rm -r $SCM_TMPDIR

Much of this script is similar to the script that run Gaussian jobs, but several things about the script should be described in more detail. First, at the moment ADF must be submitted to the 'production_qdr' queue is where its license has been limited, and where it can only use 4 cores at one time. Second, there is a block in the script that sets up the environment to use the SGI proprietary version of MPI for parallel runs. Next is the NSCM environmental variable which defines the number of cores to use along with the '-n' option on the command line. Both of these (along with the number of cpus on the PBS '-l select' line at the beginning of the script) must be adjusted to control the number of cores used by the job.

Note the 'adf' command is actually a script that generates and runs another script that actually runs the 'adf.exe' executable. This script (called 'runscript') is built and placed in the users working directory. It make include some preliminary steps that are NOT run in parallel.

With the HCN input file and PBS script above, you can submit an ADF job on ANDY with:

qsub adf_4.job

All users of ADF must be license and placed in the 'gadf' Unix group by HPC Center staff.

BAMOVA

Bamova implements a Bayesian Analysis of Molecular Variance and different likelihood models for three different types of molecular data (including two models for high throughput sequence data), as described in detail in Gompert and Buerkle (2011) and Gompert et al. (2010). Use of the software will require good familiarity with the models described in this paper. It will also likely require some programming to format data for input and to analyze the MCMC output. For more detail on BAMOVA please visit the BAMOVA web site [22] and manual here [23]

Currently, BAMOVA version 1.02 is installed on BOB and ATHENA at the CUNY HPC Center. BAMOVA is a serial program that requires an input file and distance files to run. Here, we show how to run the test input case provided with the downloaded code, 'hapcountexample.txt' which uses the distance file 'distfileexample.txt'. These files may be copied to the users working directory for submission with:

cp /share/apps/bamova/default/examples/*.txt .

Here is PBS batch script that works with this example input case:

#!/bin/bash
#PBS -q production
#PBS -N BAMOVA_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BAMOVA Serial Run ..."
echo ""
/share/apps/bamova/default/bin/bamova -f ./hapcountexample.txt -d ./distfileexample.txt -l 0 -x 250000 -v 0.25 -a 0 -D 0 -w 1 -W 1 -i 0.0 -I 0.0 -T 10
echo ""
echo ">>>> End   BAMOVA Serial Run ..."

It should take less 30 minutes to run and will produce PBS output and error files beginning with the job name 'BAMOVA_serial'. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Please note the BAMOVA command line options. These are described in detail in the manual referenced above.

BAYESCAN

This program, BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model. One of the scenarios covered consists of an island model in which subpopulation allele frequencies are correlated through a common migrant gene pool from which they differ in varying degrees. The difference in allele frequency between this common gene pool and each subpopulation is measured by a subpopulation specific FST coefficient. Therefore, this formulation can consider realistic ecological scenarios where the effective size and the immigration rate may differ among subpopulations. More detailed information on Bayescan can be found at the web site here [24] and in the manual here [25].

Currently, BAYESCAN version 2.01 is installed on BOB and ATHENA at the CUNY HPC Center. BAYSCAN is a serial program that requires a genotype data input file to run. Here, we show how to run the test input case provided with the downloaded code, 'test_SNPs.txt'. This file may be copied to the user's working directory for submission with:

cp /share/apps/bayescan/default/examples/distro/test_SNPs.txt* .

Here is PBS batch script that works with this example input case:

#!/bin/bash
#PBS -q production
#PBS -N BYSCAN_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin BAYESCAN Serial Run ..."
echo ""
/share/apps/bayescan/default/bin/bayescan2 test_SNPs.txt -snp
echo ""
echo ">>>> End   BAYESCAN Serial Run ..."

This batch script file can be dropped into a file (say bayescan_serial.job) on BOB or ATHENA are run with the following command:

qsub bayescan_serial.job

It should take less 20 minutes to run and will produce PBS output and error files beginning with the job name 'BYSCAN_serial' along with a number of BAYESCAN-specific files. Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The BAYSCAN command line options can be printed using the following:

/share/apps/bayescan/default/bin/bayescan2 --help

These options are described in detail in the manual [26].

BEAST

BEAST is a cross-platform Java program for Bayesian MCMC analysis of molecular sequences. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies, but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. The distribution includes a simple to use user-interface program called 'BEAUti' for setting up standard analyses and a suite of programs for analysing the results. For more detail on BEAST please visit the BEAST web site [27].

Currently, BEAST version 1.6.2 is installed on BOB, ATHENA, and ANDY at the CUNY HPC Center. BEAST is a serial program, but can also be run with the help of a companion library (BEAGLE) on systems with Graphics Processing Units (GPUs). On BOB and ATHENA, BEAST must be run serially, but on ANDY which supports GPU processing and on which the BEAGLE 0.2 GPU library has been installed, BEAST can be run either serially or in GPU-accelerated mode. Benchmarks of BEAST show that GPU acceleration provides significant performance improvement over basic serial operation.

BEAST's user interface program, 'BEAUti', can be run locally on an office workstation or from the head nodes of BOB, ATHENA, or ANDY. The latter options assumes that the user has logged in via the secure shell with X-Windows tunneling enabled (e.g. ssh -X my.name@bob.csi.cuny.edu). Details on using ssh are provided elsewhere in this document. Among other things, BEAUti is used to convert raw '.nex' files in into BEAST XML-based input files.

Once a usable BEAST input file has been created, a PBS batch script must be written to run the job, either in serial mode or in GPU mode (GPU mode jobs must be run on ANDY). Below, we show how to run both a serial and GPU-accelerated job with a test input case (testMC3.xml) from the BEAST examples directory. The input file may be copied into the users working directory from BEAST's installation tree for submission with the PBS, as follows:

cp /share/apps/beast/default/examples/testMC3.xml

Next, a PBS Pro batch script must be created to run your job. The first script below shows a serial run that uses the textMC3.xml XML input file.

#!/bin/bash
#PBS -q production
#PBS -N BEAST_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BEAST Serial Run ..."
echo ""
/share/apps/beast/default/bin/beast -m 1920 -seed 666 ./testMC3.xml
echo ""
echo ">>>> End   BEAST Serial Run ..."

This script can be dropped into a file (say 'beast_serial.job) on either BOB, ATHENA, or ANDY and run with:

qsub beast_serial.job

This case should take less ten minutes to run and will produce PBS output and error files beginning with the job name 'BEAST_serial', as well files specific to BEAST. Details on the meaning of the PBS script are covered above in the PBS section of the Wiki. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job (on ANDY the value 2880 should be used because of its larger amount of memory per node). The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The HPC Center staff has made two changes to the memory constraints in operation for all the BEAST distributed programs (see list below). First, the default minimum memory size has been set up from 64 MBs to 128 MBs. Second, a maximum memory control option has been added all the programs. It is not required and if NOT used programs get the historical default for Linux jobs of 1024 MBs (i.e. 1 GBs). If used, it must be the first option included on the execution line in your script and takes the form:

-m XXXXX

where the value 'XXXX' is the new user-selected maximum memory amount. So, the option used in the script above:

-m 1920

would bump up the memory maximum for the 'beast' program to 1.920 GBs. Notice that this matches the amount requested per 'chunk' in the PBS '-l select' line above. You should not ask for more memory that you have requested through PBS.

You may wish to request more memory than the per cpu (core) defaults on a system. This can be accomplished by asking for more cores per PBS 'chunk' than you are going to use, but using ALL of the total memory PBS allocates to the multiple cores. For instance, a '-l select' line of:

#PBS  -l selelect=1:ncpus=4:mem=7680mb

request 4 cpus (cores) and 7,680 MS of memory. You could this request with with a 'beast' execution line of:

/share/apps/beast/default/bin/beast -m 7680 -seed 666 ./testMC3.xml

to get 4 times the single core quantity of memory for your 'beast' run by allocating, but not using the 4 PBS cores requested the '-l select' statement. The non_GPU version of 'beast' is serial (uses only one core). This technique can be used with any of the BEAST distributed programs, another would be 'treeannotator' for instance. Remember that ANDY has more available memory per core (2880 MBs) that ATHENA or BOB (1920 MBs) and your base numbers and multiplers should be specific to each system.

Note that there are larger number of command line options available to BEAST. This example uses the defaults, other than setting the seed with '-s 666'. All of BEAST's options can be listed as follows:

/share/apps/beast/default/bin/beast -help
[richard.walsh@bob beast]$ /share/apps/beast/default/bin/beast -help 
  Usage: beast [-verbose] [-warnings] [-strict] [-window] [-options] [-working] [-seed] [-prefix <PREFIX>] [-overwrite] [-errors <i>] [-threads <i>] [-java] [-beagle] [-beagle_info] [-beagle_order <order>] [-beagle_instances <i>] [-beagle_CPU] [-beagle_GPU] [-beagle_SSE] [-beagle_single] [-beagle_double] [-beagle_scaling <default|none|dynamic|always>] [-help] [<input-file-name>]
    -verbose Give verbose XML parsing messages
    -warnings Show warning messages about BEAST XML file
    -strict Fail on non-conforming BEAST XML file
    -window Provide a console window
    -options Display an options dialog
    -working Change working directory to input file's directory
    -seed Specify a random number generator seed
    -prefix Specify a prefix for all output log filenames
    -overwrite Allow overwriting of log files
    -errors Specify maximum number of numerical errors before stopping
    -threads The number of computational threads to use (default auto)
    -java Use Java only, no native implementations
    -beagle Use beagle library if available
    -beagle_info BEAGLE: show information on available resources
    -beagle_order BEAGLE: set order of resource use
    -beagle_instances BEAGLE: divide site patterns amongst instances
    -beagle_CPU BEAGLE: use CPU instance
    -beagle_GPU BEAGLE: use GPU instance if available
    -beagle_SSE BEAGLE: use SSE extensions if available
    -beagle_single BEAGLE: use single precision if available
    -beagle_double BEAGLE: use double precision if available
    -beagle_scaling BEAGLE: specify scaling scheme to use
    -help Print this information and stop

  Example: beast test.xml
  Example: beast -window test.xml
  Example: beast -help

The CUNY HPC Center also provides a GPU-accelerated version of BEAST. This version can be run ONLY on ANDY (the serial version can also be run on ANDY). A PBS batch script for running the GPU-accelerated version of BEAST follows:

#!/bin/bash
#PBS -q production_gpu
#PBS -N BEAST_gpu
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb:accel=fermi
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BEAST GPU Run ..."
echo ""
/share/apps/beast/default/bin/beast_gpu -m 2880 -beagle -beagle_GPU  -beagle_single -seed 666 ./testMC3.xml
echo ""
echo ">>>> End   BEAST GPU Run ..."

This script has several unique features. First, the queue to which the job is submitted is 'production_gpu' which is the GPU-enabled queue on ANDY. This is required for a GPU-accelerated run. Next, the '-l select' line includes requests for GPU-related resources. Both 1 processor (npus=1) and 1 GPU (ngpus=1) are requested. You need both. The memory requested is larger because ANDY has more CPU memory per processor (core). And the type of GPU accelerator is specified as an NVIDIA Fermi GPU which has 448 processors to apply to this work load (accel=fermi). These GPU processing cores, while less powerful individually that a CPU core, in concert are what deliver the performance of the highly parallel MCMC algorithm.

In addition, GPU-specific command-line options are required to invoke the GPU version of the BEAST. Here we have requested that the 'BEAGLE' GPU-library be used and that the computation be run in single precision (32-bits as opposed to 64-bits) on the GPU which is 2X faster that double-precision if you can get by with single-precision. We expect identical workloads to perform at much as 5X faster when run in GPU mode.

All the programs that are part of the BEAST 1.6.2 distribution are available, even though we have only discussed 'beast' itself in detail here. The other programs include:

beauti  loganalyser  logcombiner  treeannotator  treestat

Scripts similar in form to the ones above could be used to run any of these programs as well.

BEST

The Bayesian Estimation of Species Trees application (BEST) implements a Bayesian hierarchical model to jointly estimate gene trees and species tree from multilocus DNA molecular sequence data. It provides a new approach for estimating the mutation-rate- based, phylogenetic relationships among species. Its method accounts for deep coalescence, but not for other complicating issues such as horizontal transfer or gene duplication. The program works in conjunction within the popular Bayesian phylogenetics package, MrBayes (Ronquist and Huelsenbeck, Bioinformatics, 2003). BEST's parameters are defined using the 'prset' command from MrBayes. Details on BEST's capabilities and options are avialable at the BEST web site here [28]

Currently, BEST version 2.2.0 and 2.3.1 are available on ATHENA, ANDY, and BOB at the CUNY HPC Center. BEST version 2.2.0 is the current default because a special large memory version is available that it not yet available in version 2.3.1. Both versions can be run in either a parallel or serial mode.

To run BEST, first a NEXUS-formatted, DNA sequence comparison input file (e.g. a '.nex' file) must be created using MrBayes. See the section on MrBayes below for this. Here is an example NEXUS input file:

#NEXUS

begin data;
   dimensions ntax=17 nchar=432;
   format datatype=dna missing=?;
   matrix
   human       ctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacgtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagaggttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatcac
   tarsier     ctgactgctgaagagaaggccgccgtcactgccctgtggggcaaggtagacgtggaagatgttggtggtgaggccctgggcaggctgctggtcgtctacccatggacccagaggttctttgactcctttggggacctgtccactcctgccgctgttatgagcaatgctaaggtcaaggcccatggcaaaaaggtgctgaacgcctttagtgacggcatggctcatctggacaacctcaagggcacctttgctaagctgagtgagctgcactgtgacaaattgcacgtggatcctgagaatttcaggctcttgggcaatgtgctggtgtgtgtgctggcccaccactttggcaaagaattcaccccgcaggttcaggctgcctatcagaaggtggtggctggtgtggctactgccttggctcacaagtaccac
   bushbaby    ctgactcctgatgagaagaatgccgtttgtgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttctttgactcctttggggacctgtcctctccttctgctgttatgggcaaccctaaagtgaaggcccacggcaagaaggtgctgagtgcctttagcgagggcctgaatcacctggacaacctcaagggcacctttgctaagctgagtgagctgcattgtgacaagctgcacgtggaccctgagaacttcaggctcctgggcaacgtgctggtggttgtcctggctcaccactttggcaaggatttcaccccacaggtgcaggctgcctatcagaaggtggtggctggtgtggctactgccctggctcacaaataccac
   hare        ctgtccggtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgagaccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtccactgcttctgctgttatgggcaaccctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcaccttcgctaagctgagtgaactgcattgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcactttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
   rabbit      ctgtccagtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtcctctgcaaatgctgttatgaacaatcctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcacctttgctaagctgagtgaactgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcattttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
   cow         ctgactgctgaggagaaggctgccgtcaccgccttttggggcaaggtgaaagtggatgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagtcctttggggacttgtccactgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagattcctttagtaatggcatgaagcatctcgatgacctcaagggcacctttgctgcgctgagtgagctgcactgtgataagctgcatgtggatcctgagaacttcaagctcctgggcaacgtgctagtggttgtgctggctcgcaattttggcaaggaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgtggccaatgccctggcccacagatatcat
   sheep       ctgactgctgaggagaaggctgccgtcaccggcttctggggcaaggtgaaagtggatgaagttggtgctgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagcactttggggacttgtccaatgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagactcctttagtaacggcatgaagcatctcgatgacctcaagggcacctttgctcagctgagtgagctgcactgtgataagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtggttgtgctggctcgccaccatggcaatgaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgttgccaatgccctggcccacaaatatcac
   pig         ctgtctgctgaggagaaggaggccgtcctcggcctgtggggcaaagtgaatgtggacgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttcttcgagtcctttggggacctgtccaatgccgatgccgtcatgggcaatcccaaggtgaaggcccacggcaagaaggtgctccagtccttcagtgacggcctgaaacatctcgacaacctcaagggcacctttgctaagctgagcgagctgcactgtgaccagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgatagtggttgttctggctcgccgccttggccatgacttcaacccgaatgtgcaggctgcttttcagaaggtggtggctggtgttgctaatgccctggcccacaagtaccac
   elephseal   ttgacggcggaggagaagtctgccgtcacctccctgtggggcaaagtgaaggtggatgaagttggtggtgaagccctgggcaggctgctggttgtctacccctggactcagaggttctttgactcctttggggacctgtcctctcctaatgctattatgagcaaccccaaggtcaaggcccatggcaagaaggtgctgaattcctttagtgatggcctgaagaatctggacaacctcaagggcacctttgctaagctcagtgagctgcactgtgaccagctgcatgtggatcccgagaacttcaagctcctgggcaatgtgctggtgtgtgtgctggcccgccactttggcaaggaattcaccccacagatgcagggtgcctttcagaaggtggtagctggtgtggccaatgccctcgcccacaaatatcac
   rat         ctaactgatgctgagaaggctgctgttaatgccctgtggggaaaggtgaaccctgatgatgttggtggcgaggccctgggcaggctgctggttgtctacccttggacccagaggtactttgatagctttggggacctgtcctctgcctctgctatcatgggtaaccctaaggtgaaggcccatggcaagaaggtgataaacgccttcaatgatggcctgaaacacttggacaacctcaagggcacctttgctcatctgagtgaactccactgtgacaagctgcatgtggatcctgagaacttcaggctcctgggcaatatgattgtgattgtgttgggccaccacctgggcaaggaattcaccccctgtgcacaggctgccttccagaaggtggtggctggagtggccagtgccctggctcacaagtaccac
   mouse       ctgactgatgctgagaagtctgctgtctcttgcctgtgggcaaaggtgaaccccgatgaagttggtggtgaggccctgggcaggctgctggttgtctacccttggacccagcggtactttgatagctttggagacctatcctctgcctctgctatcatgggtaatcccaaggtgaaggcccatggcaaaaaggtgataactgcctttaacgagggcctgaaaaacctggacaacctcaagggcacctttgccagcctcagtgagctccactgtgacaagctgcatgtggatcctgagaacttcaggctcctaggcaatgcgatcgtgattgtgctgggccaccacctgggcaaggatttcacccctgctgcacaggctgccttccagaaggtggtggctggagtggccactgccctggctcacaagtaccac
   hamster     ctgactgatgctgagaaggcccttgtcactggcctgtggggaaaggtgaacgccgatgcagttggcgctgaggccctgggcaggttgctggttgtctacccttggacccagaggttctttgaacactttggagacctgtctctgccagttgctgtcatgaataacccccaggtgaaggcccatggcaagaaggtgatccactccttcgctgatggcctgaaacacctggacaacctgaagggcgccttttccagcctgagtgagctccactgtgacaagctgcacgtggatcctgagaacttcaagctcctgggcaatatgatcatcattgtgctgatccacgacctgggcaaggacttcactcccagtgcacagtctgcctttcataaggtggtggctggtgtggccaatgccctggctcacaagtaccac
   marsupial   ttgacttctgaggagaagaactgcatcactaccatctggtctaaggtgcaggttgaccagactggtggtgaggcccttggcaggatgctcgttgtctacccctggaccaccaggttttttgggagctttggtgatctgtcctctcctggcgctgtcatgtcaaattctaaggttcaagcccatggtgctaaggtgttgacctccttcggtgaagcagtcaagcatttggacaacctgaagggtacttatgccaagttgagtgagctccactgtgacaagctgcatgtggaccctgagaacttcaagatgctggggaatatcattgtgatctgcctggctgagcactttggcaaggattttactcctgaatgtcaggttgcttggcagaagctcgtggctggagttgcccatgccctggcccacaagtaccac
   duck        tggacagccgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgactgtggagctgaggccctggccaggctgctgatcgtctacccctggacccagaggttcttcgcctccttcgggaacctgtccagccccactgccatccttggcaaccccatggtccgtgcccatggcaagaaagtgctcacctccttcggagatgctgtgaagaacctggacaacatcaagaacaccttcgcccagctgtccgagctgcactgcgacaagctgcacgtggaccctgagaacttcaggctcctgggtgacatcctcatcatcgtcctggccgcccacttcaccaaggatttcactcctgactgccaggccgcctggcagaagctggtccgcgtggtggcccacgctctggcccgcaagtaccac
   chicken     tggactgctgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgaatgtggggccgaagccctggccaggctgctgatcgtctacccctggacccagaggttctttgcgtcctttgggaacctctccagccccactgccatccttggcaaccccatggtccgcgcccacggcaagaaagtgctcacctcctttggggatgctgtgaagaacctggacaacatcaagaacaccttctcccaactgtccgaactgcattgtgacaagctgcatgtggaccccgagaacttcaggctcctgggtgacatcctcatcattgtcctggccgcccacttcagcaaggacttcactcctgaatgccaggctgcctggcagaagctggtccgcgtggtggcccatgccctggctcgcaagtaccac
   xenlaev     tggacagctgaagagaaggccgccatcacttctgtatggcagaaggtcaatgtagaacatgatggccatgatgccctgggcaggctgctgattgtgtacccctggacccagagatacttcagtaactttggaaacctctccaattcagctgctgttgctggaaatgccaaggttcaagcccatggcaagaaggttctttcagctgttggcaatgccattagccatattgacagtgtgaagtcctctctccaacaactcagtaagatccatgccactgaactgtttgtggaccctgagaactttaagcgttttggtggagttctggtcattgtcttgggtgccaaactgggaactgccttcactcctaaagttcaggctgcttgggagaaattcattgcagttttggttgatggtcttagccagggctataac
   xentrop     tggacagctgaagaaaaagcaaccattgcttctgtgtgggggaaagtcgacattgaacaggatggccatgatgcattatccaggctgctggttgtttatccctggactcagaggtacttcagcagttttggaaacctctccaatgtctccgctgtctctggaaatgtcaaggttaaagcccatggaaataaagtcctgtcagctgttggcagtgcaatccagcatctggatgatgtgaagagccaccttaaaggtcttagcaagagccatgctgaggatcttcatgtggatcccgaaaacttcaagcgccttgcggatgttctggtgatcgttctggctgccaaacttggatctgccttcactccccaagtccaagctgtctgggagaagctcaatgcaactctggtggctgctcttagccatggctacttc
   ;
end;

begin mrbayes;
   charset non_coding = 1-90 358-432;
   charset coding     = 91-357;
   partition region = 2:non_coding,coding;
   set partition = region;
   lset applyto=(2) nucmodel=codon;
   prset ratepr=variable;
   mcmc ngen=5000 nchains=1 samplefreq=10;
end;

Next, a PBS Pro batch script must be created to run your job. The first script below shows a MPI parallel run of the above '.nex' input file. Note that the number of processors that can be used by the job is limited to the number of chains in the input file. Here, we have just 2 chains and therefore can only request 2 processors. If you make the mistake of asking for more processors that input file chains, you will get the following error message at the end of your PBS output file:

      The number of chains must be at least as great
      as the number of processors (in this case 4)

Here is the MPI parallel PBS batch script for BEST that request 2 processors, one for each chain in the input file:

#!/bin/bash
#PBS -q production
#PBS -N BEST_parallel
#PBS -l select=2:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin BEST Parallel Run ..."
echo ""
mpirun -np 2 -machinefile $PBS_NODEFILE /share/apps/best/default/bin/mbbest ./bglobin.nex
echo ""
echo ">>>> End   BEST Parallel Run ..."

This script can be dropped into a file (say 'best_mpi.job) on either BOB, ATHENA, or ANDY and run with:

qsub best_mpi.job

It should take less five minutes to run and will produce PBS output and error files beginning with the job name 'BEST_parallel'. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=2:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 2 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The CUNY HPC Center also provides a serial version of BEST. A PBS batch script for running the serial version of BEST follows:

#!/bin/bash
#PBS -q production
#PBS -N BEST_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin BEST Serial Run ..."
echo ""
/share/apps/best/default/bin/mbbest_serial ./bglobin.nex
echo ""
echo ">>>> End   BEST Serial Run ..."

BPP2

BPP2 uses a Bayesian modeling approach to generate the posterior probabilities of species assignments taking into account uncertainties due to unknown gene trees and the ancestral coalescent process. For tractability, it relies on a user-specified guide tree to avoid integrating over all possible species delimitations. Additional information can be found at the download site here [29].

At the CUNY HPC Center BPP2 version 2.1a is installed on BOB and ATHENA. BPP2 is a serial code that takes its input from a simple text file provided on the command line. Here is an example PBS script that will run the fence lizard test case provided with the distribution archive (/share/apps/bpp2/default/examples):

#!/bin/bash
#PBS -q production
#PBS -N BPP2_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory and executable to run
echo ">>>> Begin BPP2 Serial Run ..."
echo ""
/share/apps/bpp2/default/bin/bpp2 ./lizard.bpp.ctl
echo ""
echo ">>>> End   BPP2 Serial Run ..."

This script can be dropped in to a file (say bpp2.job) and started with the command:

qsub bpp2.job

Running the fence lizard test case should take less than 15 minutes and will produce PBS output and error files beginning with the job name 'BPP2_serial'. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

DL_POLY

DL_POLY is a general purpose molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith, T.R. Forester and I.T. Todorov. Both serial and parallel versions are available. The original package was developed by the Molecular Simulation Group (now part of the Computational Chemistry Group, MSG) at Daresbury Laboratory under the auspices of the Engineering and Physical Sciences Research Council (EPSRC) for the EPSRC's Collaborative Computational Project for the Computer Simulation of Condensed Phases ( CCP5). Later developments were also supported by the Natural Environment Research Council through the eMinerals project. The package is the property of the Central Laboratory of the Research Councils, UK.

DL_POLY versions 2.20 and 3.10 are installed on ANDY under the user applications directory

/share/apps/dlpoly

The default is currently version 2.20, but users can select the more current version by referencing the 3.10 directory explicitly in their scripts.

To run DL_POLY the user needs to provide a set of several files. Those that are required include:

1) The CONTROL file, which indicates to DL_POLY what kind of simulation you want to run, how much data you want to gather, and for how long you want the simulation to run.

2) The CONFIG file, which contains the atom positions, and, depending on how the file was created (e.g. whether this is a configuration created from ‘scratch’ or the end point of another run), the atom's velocities and forces.

3) The FIELD file, which specifies the nature of the intermolecular interactions, the molecular topology, and the atomic properties, such as charge and mass.

Sometimes you may require a fourth file: TABLE, which contains short-ranged potential and force arrays for functional forms not available within DL_POLY (usually because they are too complex e.g. spline potentials) and/or a fifth file: TABEAM, which contains metal potential arrays for non- analytic or too complex functional forms and/or a sixth file: REFERENCE, which is similar to the CONFIG file and contains the ”perfect” crystalline structure of the system.

Several directories are included in the installation tree. The primary executable for DL_POLY and a number of other supporting scripts are located in the directory:

/share/apps/dlpoly/default/bin

A collection of example input files that you may use as test cases are located in the directory:

/share/apps/dlpoly/default/data

The user and installation guide in PDF format are located in the directory:

/share/apps/dlpoly/default/man

Support utilities and programs are found in:

/share/apps/dlpoly/default/utility
/share/apps/dlpoly/default/public

To test DL_POLY, copy the files in

/share/apps/dlpoly/default/data/TEST10/LF

to a working directory (e.g. dlpoly) in your $HOME directory and run the PBS script provided below using the PBS submission command:

qsub dlpoly.job
#!/bin/bash
# Simple MPI PBS Pro batch to run
# DL_POLY on 8 cpus allowing PBS to
# freely select which cpus to use.
#PBS -q production
#PBS -N testdlpoly
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

cd $PBS_O_WORKDIR

echo "Starting DLPOLY Job ... "

/share/apps/openmpi-intel/default/bin/mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/dlpoly/default/bin/dlpoly

echo "DLPOLY Job is done .... results in OUTPUT"

Please refer to the DL_POLY manual for more detailed information on DL_POLY and to the general PBS section in this Wiki for more details on the PBS queuing system. There are a number of tutorials and further information to be found online by Googling.

GAUSS

An easy-to-use data analysis, mathematical and statistical environment based on the powerful, fast and efficient GAUSS Matrix Programming Language. GAUSS is used to solve real world problems and data analysis problems of exceptionally large scale. GAUSS version 3.2.27 is currently available on ATHENA and BOB. At the CUNY HPC Center GAUSS is typically run in serial mode. (Note: GAUSS should not be confused with the computational chemistry application Gaussian.)

A PBS Pro submit script for GAUSS that runs on 1 processor (core) follows:

#!/bin/bash
#PBS -q production
#PBS -N GAUSS_job
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# You must explicitly change to the working directory in PBS

cd $HOME/my_GAUSS_work

/share/apps/gauss/default/tgauss < ./pxyz.e > pxyz.out

Here, the file pxyz.e was taken from the GAUSS examples in /share/apps/gauss/examples. Upon successful completion, a run file "graphic.tkf" should be created in working directory.

pxyz.e:

library pgraph;
graphset;

let v = 100 100 640 480 0 0 1 6 15 0 0 2 2;
wxyz = WinOpenPQG( v, "XYZ Plot", "XYZ" );
call WinSetActive( wxyz );

begwind;
makewind(9,6.855,0,0,1);
makewind(9/2.9,6.855/2.9,0,0,0);
makewind(9/2.9,6.855/2.9,0,3.8,0);
_psurf = 0;
title("\202XYZ Curve - \201Toroidal Spiral");
fonts("simplex complex");
xlabel("X");
ylabel("Y");
zlabel("Z");

setwind(1);
t = seqa(0,.0157,401);
a = .2; b=.8; c=20;
x = 3*((a*sin(c*t)+b) .* cos(t));
y = 3*((a*sin(c*t)+b) .* sin(t));
z = a*cos(c*t);
margin(.5,0,0,0);
ztics(-.3,.3,.3,0);
_pcolor = 10;
view(-3,-2,4);
volume(1,1,.7);
_plwidth = 5;
xyz(x,y,z);

nextwind;
margin(0,0,0,0);
title("");
x = x .* (sin(z)/10);
_paxes = 0;
_pframe = 0;
_pbox = 13;
_pcolor = 11;
_plwidth = 0;
view(15,2,10);
xyz(x,y,z);

nextwind;
_pcolor = 9;
a = .4; b=.4; c=15;
x = 3*((a*sin(c*t)+b) .* cos(t));
y = 3*((a*sin(c*t)+b) .* sin(t));
z = a*cos(c*t);
volume(1,1,.4);
xyz(x,y,z);

endwind;

call WinSetActive( 1 );

Gaussian09 and Gaussian 03

The Gaussian series of programs is used by chemists, chemical engineers, biochemists, physicists and others for research in established and emerging areas of chemical interest. Starting from the basic laws of quantum mechanics, Gaussian predicts the energies, molecular structures, and vibrational frequencies of molecular systems, along with numerous molecular properties derived from these basic computation types. It can be used to study molecules and reactions under a wide range of conditions, including both stable species and compounds which are difficult or impossible to observe experimentally such as short-lived intermediates and transition structures.

Gaussian09 is the latest in the Gaussian series of electronic structure programs. Gaussian03 is the most recent prior version. The CUNY HPC Center currently supports both versions each of which is licensed under different terms. Gaussian03 is licensed for use by users from any CUNY campus. Guassian09 is currently licensed on a borough-by-borough basis with a cross-licensing clause. At the moment, CUNY institutions in all New York City boroughs, except the Bronx, have licensed Gaussian09 and may use it at the CUNY Center under the cross-license. Gaussian03 is available to all HPC Center users in the CUNY system. The jobs of users not licensed to run Gaussian09 that attempt to do so will find that their jobs fail. Access to either version requires that the user be placed a specific Gaussian application Unix group ('gauss03' and/or 'gauss09'). You can check to see if you are in these Unix groups with the following command:

andy$
andy$ groups
gauss03 gauss09
andy$

Gaussian Compute Resources at CUNY HPC Center

Gaussian09 and Gaussian03 run as a series of executables, the collection of which depends on the type of simulation requested in the input file. As such, while the core, numerically intensive routines are highly optimized, performance of Gaussian on any computer system is highly dependent on the particular problem being solved. Generally speaking, Gaussian can be run as a single core job, an SMP job, or as a distributed parallel job using Linda (a proprietary distributed parallel programming library similar to MPI). At the CUNY HPC Center, Parallel Linda is NOT supported, and therefore parallel runs are limited to the number of cores available on a single compute node. The maximum number is 8 on ZEUS, ANDY, and BOB. Parallel runs with fewer than 8 cores are possible, as are serial runs, and these may end up being scheduled to run by PBS sooner than full 8-way parallel jobs. In constructing runs for fewer that 8 cores users should reduce the resources requested with the '-l select' state in a proportional manner from the compute-node maximums on each the system being used.

At the HPC Center, Gaussian (03 and 09) is installed only on ZEUS, ANDY, and BOB. ZEUS has eight, 8-core Intel Clovertown-based nodes (2 sockets x 4 cores) available for Gaussian jobs. In addition, eight Woodcrest nodes with just 2 cores each are available on ZEUS. On ANDY twelve nodes, each with 8 cores, are available for Gaussian work. ANDY's cores are substantially faster than those on ZEUS (at least 2X), and they include more memory per core (~3 GBytes compared to ~2 Gbytes on ZEUS). On BOB eight nodes each with 8 cores and 2 GBs of memory per core are now also available. BOB's nodes are AMD Shanghai processors and should be faster than those on ZEUS and slower than those on ANDY. The Gaussian resources on ZEUS, ANDY, and BOB are accessed by submitting to the special queue 'production_gau' (see PBS script below).

Gaussian Scratch File Storage Space

Scratch space for Gaussian's temporary files is handled somewhat differently between ANDY, and ZEUS and BOB. On ZEUS and BOB, each 8-core node has 850 GBytes of node-local scratch space for Gaussian scratch files (integral file, read-write file, etc.). ZEUS's 2-core Woodcrest nodes have less, only 300 GBytes of local scratch space. The path to these compute-node directories is '/state/partition1/[g03_scr,g09_scr]' depending on whether the job is a Gaussian03 or Gaussian09 job. This path is used by the PBS script to create a dated subdirectory specific to each job (see script below). If a single Gaussian job is using all the cores on a particular node (this is often the case) then that entire space is available to that job, assuming files from previous jobs have been cleaned up. No more than this is available. On ZEUS and BOB, this space is NOT included in the user's storage quota.

On ANDY Gaussian scratch space is handled differently. On ANDY there is NO compute-node-local scratch file space. Instead, the Gaussian PBS script creates its job-specific Gaussian scratch directory in '/home/gaussian/[g03_scr,g09_scr]'. This location in '/home' is part of ANDY's Lustre parallel file system mounted by ALL nodes (login and compute) on ANDY. When compared to ZEUS and BOB this implies several differences. The first is that on ANDY Gaussian scratch files ARE counted when determining storage use for quotas. If a user had 25 Gbytes in their home directory and 100 GBytes in '/home/gaussian/g09_scr', the total (125 GBytes) would be counted against their quota. If their quota were 80 GBytes, they would be over their quota, and this could result in a Gaussian job failure. Gaussian jobs that cannot write their scratch files (for whatever reason) typically present errors of the following type in their log files:

Erroneous write. Write 39494144 instead of 325185280.
fd = 4
orig len = 325185280 left = 325185280
g_write

Because Gaussian scratch files are generally large, active Gaussian users on ANDY are given a larger initial quota of 250 GBytes (the default is 50 GBytes). The additional 200 GBytes is to be used for the temporary scratch files that Gaussian creates while running and that would typically be removed at the end of each run. It should NOT be used for extra storage in the user's home directory; this will compromise the space available to the user for Gaussian scratch files. Even larger quotas are available on ANDY to Gaussian users by special request when it is clear that their runs will require even more scratch space than 250 GBytes.

A second difference is that on ANDY, all Gaussian jobs on the system are writing into the same general location. While the Lustre parallel file is system is very fast, these jobs are competing with each other for bandwidth and storage, along with the space in use for user home directories. ANDY's ~25 TBytes of disk space is plenty if users are economical in their use of it. Users are encouraged to ensure that their scratch file data is removed after each completed Gaussian run. The example PBS script below for submitting Gaussian jobs includes a line to remove scratch files, but this is not always successful. You may have to manually remove your scratch files. The examples script prints out the unique name of each Gaussian job's scratch directory. Please police your own use of Gaussian scratch space on ANDY by going to '/home/gaussian/[g09_scr, g03_scr]' and looking for directories that begin with your name and the date that the directory was created. For example:

andy$
andy$ cd /home/gaussian/g09_scr
andy$
andy$ ls
a.eisenberg_09.27.11_12:54:15_25320  a.eisenberg_10.04.11_15:54:18_27212  ffernandez_09.25.11_19:15:55_2297    jarzecki_09.30.11_15:04:28_1661
a.eisenberg_09.27.11_14:37:11_19643  a.eisenberg_10.04.11_15:54:20_29710  ffernandez_09.26.11_12:33:13_15542  michael.green_09.21.11_15:19:23_5986
andy$
andy$ /bin/rm -r michael.green_09.21.11_15:19:23_5986
andy$

Above, a job that created a scratch directory on 9.21.11 is removed because the user (michael.green) knows that this job has completed and the files are no longer needed. If you are not sure when your job started, you can get this information from the full listing of your job's PBS output (qstat -f JID) and looking for the 'stime' or the start time for the job. Clearly, you do not wish to remove the directories of jobs that are currently running.

The HPC Center has created a special PBS resource ('lscratch') to determine the amount of scratch space available at runtime and to start jobs ONLY if that amount of scratch space is available. Users need to be able to accurately estimate the scratch space they require to efficiently set this flag in their PBS script. Jobs requiring the maximum available (~800 GBytes) should allocate an entire, 8-core compute node to themselves and use all eight cores for the run. Finally, Gaussian users should note that Gaussian scratch files are NOT backed up. Users are encouraged to save their checkpoint files in their PBS 'working directories' on /home/user.name if they will be needed for future work. From this location, they will be backed up. Again, Gaussian scratch files in /home/gaussian are NOT backed up.

NOTE: If other users have failed to clean up after themselves, and you request the maximum amount of Gaussian scratch space, it may not be available and your job may sit in the queue.

Gaussian PBS Job Submission

The Gaussian resources on ZEUS, ANDY, and BOB are accessed by submitting to the special queue 'production_gau' (see PBS script below). The same ~800 GByte scratch space limit is enforced on each systems. Similarly, Gaussian parallel jobs are limited by the number of cores on a single compute node and must be forced to run on a single node (-l place=pack). Eight (8) is the maximum processor (core) count on ZEUS, ANDY and BOB. ANDY, and ZEUS and BOB have different amounts of memory available per core as mentioned above, and the path to the scratch directories is slightly different on each system. These differences mean that the PBS scripts required to submit work to ANDY and ZEUS are slightly different. These differences are outline below.

First, we provide a simple Gaussian input file (a Hartree-Fock geometry optimization of methane), and the companion PBS batch submit script that would allocate all 8 cores on a single ZEUS, ANDY, or BOB compute node and 200 GBytes of compute-node-local storage.

SPECIAL NOTE ON Gaussian03: The PBS batch script below is written to select Gaussian09, but would work for Gaussian03 if all the '09' strings it contains were edited to '03'. Also, the following additional line (fix) also MUST be included in the Gaussian03 script to get it to work. This is because the Gaussian03 binaries are very old, and now only work with a much older release of the Portland Group Compiler that we happened to have saved (release 10.3) on BOB and ZEUS. This line can be added to the script any place before the executable 'g09' is invoked. This older release of the compiler does not exist on ANDY anymore, and therefore Gaussian 03 will NOT run on ANDY at this time.

setenv LD_LIBRARY_PATH /share/apps/pgi/10.3/linux86-64/10.3/libso:"$LD_LIBRARY_PATH"

The Gaussian 09 methane input deck is:

%chk=methane.chk
%mem=16GB
%nproc=8
# hf/6-31g

Title Card Required

0 1
 C                  0.80597015   -1.20895521    0.00000000
 H                  1.16262458   -2.21776521    0.00000000
 H                  1.16264299   -0.70455702    0.87365150
 H                  1.16264299   -0.70455702   -0.87365150
 H                -0.26402985   -1.20894202    0.00000000

END

Notice that we have explicitly requested 16 GBytes of memory with the '%mem=16GB' directive. This will allow the job to make full use of all the memory available on a single ZEUS or BOB compute node (ANDY offers up to 24 GBytes). The input file also instructs Gaussian to use 8 processors which will ensure that all of Gaussian's parallel executables (i.e. links) will run in SMP mode with 8 cores. For this simple methane geometry optimization, requesting these resources (both here and in the PBS script) is a bit extravagant, but both the input file and script can be adapted to other more substantial molecular systems running more accurate calculations.

Here is the Gaussian PBS script:

#!/bin/csh
# This script runs a 8-cpu (core) Gaussian 09 job
# with the 8 cpus packed onto a single compute node 
# to ensure that it will run as an SMP parallel job.
#PBS -q production_gau
#PBS -N methane_opt
#PBS -l select=1:ncpus=8:mem=15360mb:lscratch=200gb
#PBS -l place=pack
#PBS -V

# print out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# set the G09 root directory

setenv g09root /share/apps/gaussian

# set the name and location of the G09 scratch directory
# on the compute node.  This is where one needs to go
# to remove left-over script files.

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

setenv GAUSS_SCRDIR /state/partition1/g09_scr/${MY_SCRDIR}_$$
mkdir -p $GAUSS_SCRDIR

# On Andy you would use:
#setenv GAUSS_SCRDIR /home/gaussian/g09_scr/${MY_SCRDIR}_$$
#mkdir -p $GAUSS_SCRDIR

echo $GAUSS_SCRDIR

# run the G09 setup script

source $g09root/g09/bsd/g09.login

# users must explicitly change to their working directory with PBS

cd $PBS_O_WORKDIR

# start the G09 job

$g09root/g09/g09 menthae.input

# remove the scratch directory before terminating

/bin/rm -r $GAUSS_SCRDIR

echo 'Job is done!'

To run the job, one must use the standard PBS job submission command as follows:

qsub g09.job

Some of this PBS script's features are worth detailing. First, note that Gaussian scripts conventionally are run using the C-shell. Next, the '-l select' directive requests one PBS resource chunk (see the PBS Pro section below for the definition of a resource chunk) which includes 8 processors (cores) and nearly 16 GBytes of memory. The '-l select' directive also instructs PBS to check to see if there are any compute nodes available with 200 GBytes of storage using the 'lscratch=200gb' directive. As a Gaussian user, you must be able to estimate the amount of scratch storage space your job will need. PBS will keep this job in a queued state until sufficient resources, including sufficient storage space, are found to run the job. Previously completed jobs that have not cleaned up their scratch files may prevent this job from running. The amount of scratch requested is presumed by PBS to be the amount that will be used; therefore, requesting more scratch space than is required by the job may also prevent subsequent jobs from running that might otherwise have the space to run.

Working down further in the script, the '-l place=pack' directive tells PBS to pack the chunk defined in the '-l select' statement onto a single compute node. If no node large enough exists at all or is available at the time of job submission, the job will be queued (perhaps indefinitely). Further along in the script, the Gaussian09 environment variables are set, and the location and name of the job's scratch directory is defined. On BOB and ZEUS, this directory will always be placed in '/state/partition1' on the compute node that PBS assigns to the job. On ANDY it will be in '/home/gaussian'. A job's scratch directory will be given a name composed of the user's name, the date and time of creation, and the process ID unique to the job. Finally, the master Gaussian 09 executable, 'g09', is called to start the job. After job completion, this script should automatically remove the scratch files it created in the scratch directory. Please verify that this has occurred.

Users may choose to run jobs with fewer processors (cores, cpus) and smaller storage space requests than this sample job. This includes one processor jobs and other using a fraction of a compute node (2 processors, 4 processors). On a busy system, these jobs may start sooner that those requesting a full 8 processors. Selecting the most efficient combination of processors, memory, and storage will ensure that resources will not be wasted and will be available to allocate to the next job submitted.

All users of Gaussian that publish based on its results must include the following citation in the publication to be in compliance with the terms of the license:

Gaussian [03,09], Revision C.02, M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A.D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople, Gaussian, Inc., Wallingford CT, 2004.

GENOMEPOP2

GenomePop2 is a newer and specialized version of the older program GenomePop (version 1.0). GenomePop2 (version 2.2) is designed to manage SNPs under more flexible and useful settings that are controlled by the user. If you need models with more than 2 alleles you should use the older GenomePop version of the program.

GenomePop2 allows the forward simulation of sequences of biallelic positions. As in the previous version, a number of evolutionary and demographic settings are allowed. Several populations under any migration model can be implemented. Each population consists of a number N of individuals. Each individual is represented by one (haploids) or two (diploids) chromosomes with constant or variable (hotspots) recombination between binary sites. The fitness model is multiplicative with each derived allele having a multiplicate effect of (1-s * h-E) onto the global fitness value. By default E=0 and h=0.5 in diploids, but 1 in homozygotes or in haploids. Selective nucleotide sites undergoing directional selection (positive or negative) in different populations can be defined. In addition, bottlenecks and/or population expansion scenarios can be settled by the user during a desired number of generations. Several runs can be executed and a sample of user-defined size is obtained for each run and population. For more detail on how to use GenomePop2, please visit the web site here [30].

The CUNY HPC Center has installed GenomePop2 version 2.2 on BOB and ATHENA. GenomePop2 is a serial code that reads its all of its input parameters a file in the users working directory call 'GP2Input.txt'. How to set up such a file is explained in the How-To section at the GenomePop2 web-site here [31]. The following PBS batch script runs the third example given in the How-To which defines different SNPs ancestral alleles in different populations.

NOTE: Version 1.0.6 of the program has also been installed and can be found at '/share/apps/genomepop/1.0.6/bin/genomepop1'

#!/bin/bash
#PBS -q production
#PBS -N GENPOP2_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin GENPOP2 Serial Run ..."
echo ""
/share/apps/genomepop/default/bin/genomepop2
echo ""
echo ">>>> End   GENPOP2 Serial Run ..."

This script can be dropped in to a file (say genomepop2.job) and started with the command:

qsub genomepop2.job

This test case should take less than a minutes to run and will produce PBS output and error files beginning with the job name 'GENPOP2_serial'. Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run the job will be printed in the PBS output file by the 'hostname' command.

While it is not visible in the PBS script, your customized 'GP2Input.txt' file MUST be present in the working directory for the job. When the job completes, GenomePop2 will have created a subdirectory call 'GP2_Results' with the results files in it. One could easily adapt this script to run GenomePop version 1.

GROMACS

GROMACS (Groningen Machine for Chemical Simulations) is a full-featured suite of free software, licensed under the GNU General Public License to perform molecular dynamics simulations -- in other words, to simulate the behavior of molecular systems with hundreds to millions of particles, using Newtonian equations of motion. It is primarily used for research on proteins, lipids, and polymers, but can be applied to a wide variety of chemical and biological research questions.

The CUNY HPC Center has installed GROMACS version 4.5.4 (the support tools and primary executable) in both single (32-bit) and double (64-bit) precision mode on ANDY. The double precision version is the default at the HPC Center. All of the GROMACS double-precision executables end in the suffix '_d' to distinguish them from the single-precision version, as in:

/share/apps/gromacs/default/bin/grompp_d

Single-precision executables have no suffix and are accessed from their specific directory tree here:

/share/apps/gromacs/4.5.4_32bit

The primary MD executable may be run in parallel using an MPI or GPU version on ANDY. These are also provided in both single- and double-precision, and include either the suffix 'mpi' or 'gpu'. See the scripts below which use these executables. Details on the GROMACS MD software suite can be found in the GROMACS on-line manual here: [32].

The following PBS batch scripts demonstrate how to run the MD piece of a typical GROMACS computations on ANDY.

#!/bin/bash
#PBS -N par16_test
#PBS -q production_qdr
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# print out PBS master compute node are you running on 
echo -n "The primary compute node for this job was: "
hostname

# You must explicitly change to your working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel, double-precision executable (mdrun_mpi_d)
mpirun -np 16 -machinefile $PBS_NODEFILE mdrun_mpi_d -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log

This PBS script is fairly typical of others provided here on the HPC Center Wiki for running MPI parallel workloads. The line '-l select=16:ncpus=1:mem=2880mb' requests 16 processors each with 2880 MBs of memory. Next, '-l place=free' instructs PBS to place each processor on the least loaded nodes whereever they happen to be (no packing of processors on a single node is requested). The job is being directed to the 'production_qdr' routing queue which use the half of ANDY that includes the QDR (faster) InfiniBand interconnect.

The comments in the script explain the other sections other than the 'mpirun' command that is used to start the GROMACS 'mdrun_mpi_d' executable with the requested 16 processors. The environment variable '$PBS_NODEFILE' contains the list of the 16 processors that PBS has allocated to this job. To run the single-precision version of MPI parallel GROMACS with this script one would need to change the 'mpirun' line above to the following which points directly into the single precision binary directory.

mpirun -np 16 -machinefile $PBS_NODEFILE /share/apps/gromacs/4.5.4_32bit/bin/mdrun_mpi -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log


The GPU version of the script to run the same problem would look like this:

#!/bin/bash
#PBS -N par_GPU.test
#PBS -q production_gpu
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb:accel=fermi
#PBS -l place=free
#PBS -V

# print out PBS master compute node are you running on 
echo -n "The primary compute node for this job was: "
hostname

# You must explicitly change to your working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel, double-precision executable (mdrun_mpi_d)
mdrun_gpu_d -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log

There are several important differences from the 16 processor MPI script above. First, all GPU jobs must be sent to the 'production_gpu' routing to be allocated GPU resources by PBS. Next, the '-l select' line has changed. It now requests 1 CPU (where the GPU host program runs) and 1 GPU, and names the type of GPU acceleration resource that it needs--in this case a NVIDIA Fermi 2.0 device. ANDY has 96 of these devices on the GPU side of the system. Lastly, the 'mdrun_gpu_d' command line has changed. The 'mpirun' start program is not needed and the double-precision binary built for ANDY's Fermi GPUs is invoked. The options to this version of the program have not change. Only a single GPU is used in this example, although each NVIDIA Fermi GPU has 448 small-cores that will be dedicated to this job.

For the 'mdrun_mpi_d' command-line options used above a short summary of their meaning and of several others is provided here:

Option     Filename  Type         Description
------------------------------------------------------------
  -s      topol.tpr  Input        Run input file: tpr tpb tpa
  -o       traj.trr  Output       Full precision trajectory: trr trj cpt
  -x       traj.xtc  Output, Opt. Compressed trajectory (portable xdr format)
-cpi      state.cpt  Input, Opt.  Checkpoint file
-cpo      state.cpt  Output, Opt. Checkpoint file
  -c    confout.gro  Output       Structure file: gro g96 pdb etc.
  -e       ener.edr  Output       Energy file
  -g         md.log  Output       Log file
-dhdl      dhdl.xvg  Output, Opt. xvgr/xmgr file
-field    field.xvg  Output, Opt. xvgr/xmgr file
-table    table.xvg  Input, Opt.  xvgr/xmgr file
-tablep  tablep.xvg  Input, Opt.  xvgr/xmgr file
-tableb   table.xvg  Input, Opt.  xvgr/xmgr file
-rerun    rerun.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
-tpi        tpi.xvg  Output, Opt. xvgr/xmgr file
-tpid   tpidist.xvg  Output, Opt. xvgr/xmgr file
 -ei        sam.edi  Input, Opt.  ED sampling input
 -eo        sam.edo  Output, Opt. ED sampling output
  -j       wham.gct  Input, Opt.  General coupling stuff
 -jo        bam.gct  Output, Opt. General coupling stuff
-ffout      gct.xvg  Output, Opt. xvgr/xmgr file
-devout   deviatie.xvg  Output, Opt. xvgr/xmgr file
-runav  runaver.xvg  Output, Opt. xvgr/xmgr file
 -px      pullx.xvg  Output, Opt. xvgr/xmgr file
 -pf      pullf.xvg  Output, Opt. xvgr/xmgr file
-mtx         nm.mtx  Output, Opt. Hessian matrix
 -dn     dipole.ndx  Output, Opt. Index file
-multidir    rundir  Input, Opt., Mult. Run directory

Option       Type   Value   Description
------------------------------------------------------
-[no]h       bool   no      Print help info and quit
-[no]version bool   no      Print version info and quit
-nice        int    0       Set the nicelevel
-deffnm      string         Set the default filename for all file options
-xvg         enum   xmgrace  xvg plot formatting: xmgrace, xmgr or none
-[no]pd      bool   no      Use particle decompostion
-dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
-npme        int    -1      Number of separate nodes to be used for PME, -1
                            is guess
-ddorder     enum   interleave  DD node order: interleave, pp_pme or cartesian
-[no]ddcheck bool   yes     Check for all bonded interactions with DD
-rdd         real   0       The maximum distance for bonded interactions with
                            DD (nm), 0 is determine from initial coordinates
-rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
-dlb         enum   auto    Dynamic load balancing (with DD): auto, no or yes
-dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
-gcom        int    -1      Global communication frequency
-[no]v       bool   no      Be loud and noisy
-[no]compact bool   yes     Write a compact log file
-[no]seppot  bool   no      Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
-pforce      real   -1      Print all forces larger than this (kJ/mol nm)
-[no]reprod  bool   no      Try to avoid optimizations that affect binary
                            reproducibility
-cpt         real   15      Checkpoint interval (minutes)
-[no]cpnum   bool   no      Keep and number checkpoint files
-[no]append  bool   yes     Append to previous output files when continuing
                            from checkpoint instead of adding the simulation
                            part number to all file names
-maxh        real   -1      Terminate after 0.99 times this time (hours)
-multi       int    0       Do multiple simulations in parallel
-replex      int    0       Attempt replica exchange every # steps
-reseed      int    -1      Seed for replica exchange, -1 is generate a seed
-[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
                            bombardment on your system

HOOMD

Performs general purpose particle dynamics simulations, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many processor cores on a fast cluster. Unlike some other applications in the particle and molecular dynamics space, HOOMD developers have worked to implement all the of the code's computationally intensive kernels on the GPU, although currently only single node, single-GPU or OpenMP-GPU runs are possible. There is no MPI-GPU or distributed parallel GPU version available at this time.

HOOMD's object-oriented design patterns make it both versatile and expandable. Various types of potentials, integration methods and file formats are currently supported, and more are added with each release. The code is available and open source, so anyone can write a plugin or change the source to add additional functionality. Simulations are configured and run using simple python scripts, allowing complete control over the force field choice, integrator, all parameters, how many time steps are run, etc. The scripting system is designed to be as simple as possible to the non-programmer.

The HOOMD development effort is led by the Glotzer group at the University of Michigan, but many groups from different universities have contributed code that is now part of the HOOMD main package, see the credits page for the full list. The HOOMD website and documentation are available here [33]. HOOMD version 0.9.2 has been installed on ANDY which has NVIDIA's S2050 Fermi GPUs with 448 computational cores. The version installed runs in single-precision (32-bit) mode.

A basic input file in HOOMD's python scripting format is present here:

$cat test.hoomd
from hoomd_script import *

# create 100 random particles of name A
init.create_random(N=100, phi_p=0.01, name='A')

# specify Lennard-Jones interactions between particle pairs
lj = pair.lj(r_cut=3.0)
lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)

# integrate at constant temperature
all = group.all();
integrate.mode_standard(dt=0.005)
integrate.nvt(group=all, T=1.2, tau=0.5)

# run 10,000 time steps
run(10e3)

Here is a PBS script that will run the above test case on a single ANDY GPU:

#!/bin/bash
#PBS -q production_gpu
#PBS -N HOOMDS_test
#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin HOOMD GPU Parallel Run ..."
echo ""
/share/apps/hoomd/default/bin/hoomd test.hoomd 
echo ""
echo ">>>> End   HOOMD GPU Parallel Run ..."

The example above targets one (1) GPU on any compute node with an attached GPU. In the case of ANDY, that is any of the 'gpute-XX' compute nodes on the QDR InfiniBand side of the system. By selecting the '-q production_gpu' PBS routing queue and asking for one (1) GPU with '-l select=1:ncpus=1:ngpus=1:accel=fermi' PBS will ensure that a GPU is available to the HOOMD job. By default if no options are offered to the 'hoomd' command, the executable will first look for a GPU and if it finds one it will use it; otherwise, it will run only on the CPU. GPU-only or CPU-only execution can be requested using the '--mode=gpu' or '--mode=cpu' on the command line above. NOTE: The options to the 'hoomd' command must be placed AFTER the python input script (i.e. hoomd test.hoomd --mode=gpu). The 'hoomd' executable accepts a variety of options to control runtime behavior. These are described in detail here [34].

IMa2

The IMa2 application performs basic calculations ‘Isolation with Migration’ using Bayesian inference and Markov chain Monte Carlo methods. The only major conceptual addition to IMa2 that makes it different from the original IMa program is that it can handle data from multiple populations. This requires that the user specify a phylogenetic tree. Importantly, the tree must be rooted, and the sequence in time of internal nodes must be known and specified. More information on the IMa2 and IMa can be found in the user manual here [35]

IMa2 is a serial program that is currently installed on BOB and ATHENA at the CUNY HPC Center, and requires an input file and potentially several additional data files to run. Here we provide a script that will run the test input program supplied by the authors, 'ima2_testinput.u'. Completing this run may also require the prior file ('ima2_priorfile_4pops.txt') and the nested models file ('ima2_all_nested_models_2_pops.txt'). All these files can be copied out of the IMa2 installation examples directory, as follows:

cp /share/apps/ima2/default/examples/ima2_testinput.u .

A working PBS batch script that will complete an IMa2 run is presented here:

#!/bin/bash
#PBS -q production
#PBS -N IMA2_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin IMa2 Serial Run ..."
echo ""
/share/apps/ima2/default/bin/IMa2 -i ima2_testinput.u -o ima2_testoutput.out -q2 -m1 -t3 -b10000 -l100
echo ""
echo ">>>> End   IMa2 Serial Run ..."

This script can be dropped into a file (say 'ima2_serial.job) on either BOB or ATHENA, and run with:

qsub ima2_serial.job

It should take less than a minute to run and will produce PBS output and error files beginning with the job name 'IMA2_serial'. It also produces IMa2's own output files. Details on the meaning for the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Please take note of the IMa2 options used here. Details on each can be found in the IMa2 manual referenced above.

HONDO PLUS

Hondo Plus 5.1 is a versatile electronic structure code that combines work from the original Hondo application developed by Harry King in the lab of Michel Dupuis and John Rys, and that of numerous subsequent contributers. It is currently distributed from the research lab of Dr. Donald Truhlar at the University of Minnesota. Part of the advantage of Hondo Plus is the availability of source implementations of a wide variety of model chemistries developed over its life time that researchers can adapt to their particular needs. The license to use the code requires a literature citation which is documented in the Hondo Plus 5.1 manual found at:

http://comp.chem.umn.edu/hondoplus/HONDOPLUS_Manual_v5.1.2007.2.17.pdf

The Hondo Plus 5.1 installed at the CUNY HPC Center is the serial version of the application and it is currently available only on ANDY. It was compiled with the Intel Fortran compiler. The installation directory (/share/apps/hondoplus/default) includes a large number of examples in the form of a test suite of input decks and correct outputs in the directory;

/share/apps/hondoplus/default/examples

A simple PBS Pro script to run a Hondo Plus serial job on ANDY is present here using one of the test input decks from the examples directory:

#!/bin/bash
# This script runs a serial HondoPlus job in the 
# PBS qserial queue.  The HondoPlus code was compiled
# with the Intel Fortran compiler and the recommended
# settings. The SCM memory scratch space was left at
# the default size.
#PBS -q production
#PBS -N hondo_job
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

hostname

cd $HOME/hondo

echo 'HondoPlus Job starting ... '

/share/apps/hondoplus/default/bin/hondo test1.0.1315.in test1.0.1315.out

# Clean up scratch files by default

echo 'HondoPlus Job is done!'

Hondo Plus was compiled with the default memory sizes as set in the distribution. With the larger memory available on ANDY and many modern Linux cluster systems compiling a larger version is possible. Those interested should contact CUNY HPC Center help at hpchelp@csi.cuny.edu.

LAMARC

LAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates. It approximates a summation over all possible genealogies that could explain the observed sample, which may be sequence, SNP, microsatellite, or electrophoretic data. LAMARC and its sister program Migrate are successor programs to the older programs Coalesce, Fluctuate, and Recombine, which are no longer being supported. The programs are memory-intensive but can run effectively on workstations; we support a variety of operating systems. For more detail on LAMARC please visit the website here [36], in the this paper [37], and from the documentation here [38].

LAMARC version 2.1.6 is currently installed at the CUNY HPC Center on the system BOB and ATHENA. LAMARC is a serial code that can be compiled with or without a GUI interface. To discourage interactive GUI-based runs on these system's login nodes, LAMARC has been compiled with the GUI disabled and should be run in command-line mode from a PBS batch script. Here is a PBS batch script that will run the sample XML input file that is provided with the distribution, 'sample_infile.xml' in /share/apps/lamarc/default/examples. This run assumes that you have already converted your raw data file into one that is readable by LAMARC. This has already been done with the simple example input. As tutorial on the use of the converter is located here [39].

#!/bin/bash
#PBS -q production
#PBS -N LAMARC_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin LAMARC Serial Run ..."
echo ""
/share/apps/lamarc/default/bin/lamarc ./sample_infile.xml -b
echo ""
echo ">>>> End   LAMARC Serial Run ..."

This script can be dropped into a file (say 'lamarc_serial.job) on either BOB or ATHENA, and run with:

qsub lamarc_serial.job

This sample input file should take less than a minute to run and will produce PBS output and error files beginning with the job name 'LAMARC_serial'. It also produces LAMARC's own output files. Details on the meaning for the PBS script are covered above in the PBS section of this Wiki. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are to be found (freely). The compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Note the presence of the batch-mode option '-b' on the LAMARC command line. This is required to complete a batch submission, but it assumes that the input file that you are using has everything configured as you wish or that you will be using the default settings. If you do not use the '-b' option your batch job will sit forever waiting for input from your terminal that it will never get, because it is a batch job.

You can customize-edit the input file settings manually using a Unix editor like 'vi', although you will have to work through a lot of XML punctuation to do this. Another approach is to run LAMARC interactively on the login node to generate a customized input file from the defaults-based file created by the converter program. Then this customized file can be saved from the interactive menu before it is run. Then your customized file can be used as the input (with its new name) as above using the '-b' batch option.

LAMMPS

LAMMPS is a classical molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous state. It can model atomic, polymeric, biological, metallic, granular, and coarse-grained systems using a variety of force fields and boundary conditions. LAMMPS runs efficiently on single-processor desktop or laptop machines, but is designed for parallel computers. It will run on any parallel machine that compiles C++ and supports the MPI message-passing library. This includes distributed- or shared- memory parallel machines and Beowulf-style clusters. LAMMPS can model systems with only a few particles up to millions or billions. LAMMPS is a freely-available open-source code, distributed under the terms of the GNU Public License, which means you can use or modify the code however you wish. LAMMPS is designed to be easy to modify or extend with new capabilities, such as new force fields, atom types, boundary conditions, or diagnostics.

In the most general sense, LAMMPS integrates Newton's equations of motion for collections of atoms, molecules, or macroscopic particles that interact via short- or long-range forces with a variety of initial and/or boundary conditions. For computational efficiency LAMMPS uses neighbor lists to keep track of nearby particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the node-local density of particles never becomes too large. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small 3d sub-domains, one of which is assigned to each processor. Processors communicate and store "ghost" atom information for atoms that border their sub-domain. LAMMPS is most efficient (in a parallel sense) for systems whose particles fill a 3d rectangular box with roughly uniform density. A complete description of LAMMPS can be found in its on-line manual here [40] or from the full PDF manual here [41]

Beyond what is offered in the basic package, LAMMPS includes a number of both standard and user-provided libraries that offer additional numerical models and parallel compute capability (including the use of GPUs). As such, the CUNY HPC Center has built THREE different MPI versions of the code: basic, standard, and all-inclusive. The basic build includes LAMMPS default built-in methods (KSPACE, MANYBODY, MOLECULE); the standard build includes all the LAMMPS-developer supported libraries (ASPHERE, CLASS2, COLLOID, DIPOLE, GPU, GRANULAR, KSPACE, MANYBODY, MEAM, MC, MOLECULE, OPT, PERI, POEMS, REAX, REPLICA, SHOCK, SRD, XTC); and the all-inclusive build adds to this all the user-provided libraries (USER-MISC, USER-ATC, USER-AWPMD, USER-CG-CMM, USER-CUDA, USER-EFF, USER-EWALDN, USER-REAXC, USER-SPH). The executable for each build has its own name (respectively: lammps_bsc_mpi lammps_std_mpi lammps_all_mpi) which can be used the PBS job script to select the version of interest. Users wishing to create their own particular versions LAMMPS should contact the CUNY HPC Center Helpline at 'hpchelp@csi.cuny.edu'.

The discussion in the previous paragraph refers to the version of LAMMPS downloaded and installed at the CUNY HPC Center as of 9.30.11. This version is currently installed on ANDY, the CUNY HPC Center's CPU-GPU cluster, which allows users to make use of LAMMPS GPU capability. In the near future, LAMMPS will also be available on SALK, the HPC Center's 1280 core Cray XE6.

A LAMMPS input deck (in.lj) from the LAMMPS benchmark suite is provided here to allow the reader to run a job using the information from the CUNY HPC Wiki alone:

 3d Lennard-Jones melt

variable        x index 1
variable        y index 1
variable        z index 1

variable        xx equal 20*$x
variable        yy equal 20*$y
variable        zz equal 20*$z

units           lj
atom_style      atomic

lattice         fcc 0.8442
region          box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box      1 box
create_atoms    1 box
mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve

run             100

Here is a working PBS batch job submission script for the HPC Center's ANDY system that runs the basic version of LAMMPS on 8 processors (cores):

#!/bin/bash
#PBS -q production
#PBS -N LAMMPS_test
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin LAMMPS MPI Parallel Run ..."
echo ""
mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/lammps/default/bin/lammps_bsc_mpi < in.lj
#mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/lammps/default/bin/lammps_std_mpi < in.lj
#mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/lammps/default/bin/lammps_all_mpi < in.lj
echo ""
echo ">>>> End   LAMMPS MPI Parallel Run ..."

This PBS script will run the basic version of LAMMPS which the KSPACE, MANYBODY, MOLECULE models in ANDY's 'production' queu on the DDR InfiniBand half of the system (ANDY is a divided system). By changing the queue designation to 'production_qdr' the same script would run the job on the QDR InfiniBand side of the system. QDR (quad-data-rate) InfiniBand has better latency and bandwidth features, and should in general scale better that the older DDR InfiniBand implementation, but both should provide good performance relative to Ethernet based cluster systems.

To run the other versions (i.e. standard, all-inclusive), comment out the first 'mpirun' line above and comment in one that follows it. Notice that the executable has a different name in each case. In the case of the all-inclusive version which has its GPU-code enabled you must also modify the job submission queue to 'production_gpu' and edit the '-l select' line to request one GPU for every CPU allocated by PBS. The modified '-l select' for a GPU run would look like this:

#PBS -l select=8:ncpus=1:ngpus=1:mem=2880mb

The installation on the Cray XE6 (SALK) will not include the GPU parallel models because the Cray does not include GPU hardware.

MATHEMATICA

“Mathematica” is a fully integrated technical computing system that combines fast, high-precision numerical and symbolic computation with data visualization and programming capabilities. Mathematica version 8.0 is currently installed only on ATHENA and KARLE. The basics of running Mathematica on CUNY HPC systems are presented in detail in a separate section below. Additional information on how to use Mathematica can be found at http://www.wolfram.com/learningcenter

MATLAB

The MATLAB is a high-performance language for technical computing that integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. At the CUNY HPC Center, MATLAB jobs can be run only on BOB and should be initiated from a Linux or Windows client on the CSI-campus or (for those users either on or off campus) from the CUNY HPC Center gateway machine KARLE. When configured correctly, MATLAB generates and places the batch submit scripts required to run a MATLAB job in the user's working directory on BOB's head node and completes the entire batch submission process, returning the results to the client. Additional detail on MATLAB client set up and job submission is provided in a separate MATLAB section below.

Migrate

Migrate estimates population parameters, effective population sizes and migration rates of n populations, using genetic data. It uses a coalescent theory approach taking into account the history of mutations and the uncertainty of the genealogy. The estimates of the parameter values are achieved by either a Maximum likelihood (ML) approach or Bayesian inference (BI). Migrate's output is presented in an TEXT file and in a PDF file. The PDF file eventually will contain all possible analyses including histograms of posterior distributions. Currently only the main tables (ML + BI), profile likelihood tables (ML), percentiles tables (ML), and posterior histograms (BI) are supported in the PDF. For more detail on Migrate please visit the Migrate web site here [42], the manual here [43], and in the introductory README files in /share/apps/migrate/default/docs.

The current default version of Migrate installed at the CUNY HPC Center on BOB and ATHENA is version 3.2.17. This version can be run in serial mode, in threaded parallel mode, or in MPI parallel mode. In the directory '/share/apps/migrate/example' you can find some example data sets. We demonstrate the execution of the 'parmfiles.testml' example using a PBS batch script suitable to each mode of execution. Two input files from the above directory are required ('infile.msat' and 'parmfile.testml') to complete this simulation, and the '-nomenu' command-line option is generally required to run a batch job to eliminate the normal command-line prompt.

A PBS Pro batch script must be created to run your job. The first script shown initiates a MIGRATE MPI parallel run. It requests 8 processors to complete its work.

i#!/bin/bash
#PBS -q production
#PBS -N MIGRATE_mpi
#PBS -l select=8:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo ">>>> Begin Migrate MPI Parallel Run ..."
echo ""
mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/migrate/default/bin/migrate-n-mpi ./parmfile.testml -nomenu
echo ""
echo ">>>> End   Migrate MPI Parallel Run ..."

This script can be dropped into a file (say 'migrate_mpi.job) on either BOB or ATHENA, and run with:

qsub migrate_mpi.job

It should take less than 10 minutes to run and will produce PBS output and error files beginning with the job name 'MIGRATE_mpi', as well as output files specific to MIGRATE. Details on the meaning of the PBS script are covered in the PBS section of this Wiki. The most important lines are the lines '#PBS -l select=8:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 8 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are to be found (freely). The PBS master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes will potentially be used as well. See the PBS section for details

The CUNY HPC Center also provides a serial version of MIGRATE. A PBS batch script for running the serial version of MIGRATE (migrate_serial.job) follows:

#!/bin/bash
#PBS -q production
#PBS -N MIGRATE_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo ">>>> Begin Migrate Serial Run ..."
echo ""
/share/apps/migrate/default/bin/migrate-n-serial ./parmfile.testml -nomenu
echo ""
echo ">>>> End   Migrate Serial Run ..."

The only changes appear in the new name for the job (MIGRATE_serial), the '-l select' line, which requests only 1 resource 'chunk' instead of 8, and in the name of the MIGRATE executable used, which is now 'migrate-n-serial' instead of 'migrate-n-mpi'.

The threaded version of the PBS script can be created by making similar substitutions and a change to 'packed' rather than 'free' placement (again see PBS section for details):

3,5c3,5
< #PBS -N MIGRATE_serial
< #PBS -l select=1:ncpus=1:mem=1920mb
< #PBS -l place=free
---
> #PBS -N MIGRATE_threads
> #PBS -l select=1:ncpus=8:mem=15360mb
> #PBS -l place=pack
15c15
< echo ">>>> Begin Migrate Serial Run ..."
---
> echo ">>>> Begin Migrate Pthreads Parallel Run ..."
17c17
< /share/apps/migrate/default/bin/migrate-n-serial ./parmfile.testml -nomenu
---
> /share/apps/migrate/default/bin/migrate-n-threads ./parmfile.testml -nomenu
19c19
< echo ">>>> End   Migrate Serial Run ..."
---
> echo ">>>> End   Migrate Pthreads Parallel Run ..."

NOTE: HPC Center staff has noticed that the performance of MIGRATE on the 'parmfile.testml' test case used here appears to be slow relative to MIGRATE web site benchmark performance data. The threaded version of the code seems particularly slow. We are investigating this to see if it is a real issue that needs correction or is related to an important difference in the input files (11-4-11). You may wish to inquire about the state of this issue with HPC Center staff before running your MIGRATE jobs.

MRBAYES

MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on certain observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

MrBayes version 3.2.1 is installed on ANDY, ATHENA, and BOB. MrBayes is also part of the Rocks Bio Roll on BOB and ATHENA, which provides a collection of bioinformatics applications in the directory /opt/bio, but this is an older version. It could be run from this location on ZEUS as well. Running MrBayes is a two-step process that first requires the creation of the NEXUS-formatted MrBayes input file and then the PBS Pro script to run it. MrBayes can be run in serial, MPI-parallel, or GPU-accelerated (ANDY only) mode.

Here is NEXUS input file (primates.nex) which includes both a DATA block and a MRBAYES block. The MRBAYES block simply contains the MrBayes runtime commands terminated with a semi-colon. The example below shows 12 mitochondrial DNA sequences of primates and yields at least 1,000 samples from the posterior probability distribution. If you need more detail on generating the NEXUS file or on MrBayes in general, please check the MrBayes Wiki here [44] and the online manual here on-line manual.

#NEXUS

begin data;
dimensions ntax=12 nchar=898;
format datatype=dna interleave=no gap=-;
matrix
Tarsius_syrichta	AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCTCCCTATTATTTTGCCTAGCAAATACAAACTACGAACGAGTCCACAGTCGAACAATAGCACTAGCCCGTGGCCTTCAAACCCTATTACCTCTTGCAGCAACATGATGACTCCTCGCCAGCTTAACCAACCTGGCCCTTCCCCCAACAATTAATTTAATCGGTGAACTGTCCGTAATAATAGCAGCATTTTCATGGTCACACCTAACTATTATCTTAGTAGGCCTTAACACCCTTATCACCGCCCTATATTCCCTATATATACTAATCATAACTCAACGAGGAAAATACACATATCATATCAACAATATCATGCCCCCTTTCACCCGAGAAAATACATTAATAATCATACACCTATTTCCCTTAATCCTACTATCTACCAACCCCAAAGTAATTATAGGAACCATGTACTGTAAATATAGTTTAAACAAAACATTAGATTGTGAGTCTAATAATAGAAGCCCAAAGATTTCTTATTTACCAAGAAAGTA-TGCAAGAACTGCTAACTCATGCCTCCATATATAACAATGTGGCTTTCTT-ACTTTTAAAGGATAGAAGTAATCCATCGGTCTTAGGAACCGAAAA-ATTGGTGCAACTCCAAATAAAAGTAATAAATTTATTTTCATCCTCCATTTTACTATCACTTACACTCTTAATTACCCCATTTATTATTACAACAACTAAAAAATATGAAACACATGCATACCCTTACTACGTAAAAAACTCTATCGCCTGCGCATTTATAACAAGCCTAGTCCCAATGCTCATATTTCTATACACAAATCAAGAAATAATCATTTCCAACTGACATTGAATAACGATTCATACTATCAAATTATGCCTAAGCTT
Lemur_catta		AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCATCCATATTATTCTGTCTAGCCAACTCTAACTACGAACGAATCCATAGCCGTACAATACTACTAGCACGAGGGATCCAAACCATTCTCCCTCTTATAGCCACCTGATGACTACTCGCCAGCCTAACTAACCTAGCCCTACCCACCTCTATCAATTTAATTGGCGAACTATTCGTCACTATAGCATCCTTCTCATGATCAAACATTACAATTATCTTAATAGGCTTAAATATGCTCATCACCGCTCTCTATTCCCTCTATATATTAACTACTACACAACGAGGAAAACTCACATATCATTCGCACAACCTAAACCCATCCTTTACACGAGAAAACACCCTTATATCCATACACATACTCCCCCTTCTCCTATTTACCTTAAACCCCAAAATTATTCTAGGACCCACGTACTGTAAATATAGTTTAAA-AAAACACTAGATTGTGAATCCAGAAATAGAAGCTCAAAC-CTTCTTATTTACCGAGAAAGTAATGTATGAACTGCTAACTCTGCACTCCGTATATAAAAATACGGCTATCTCAACTTTTAAAGGATAGAAGTAATCCATTGGCCTTAGGAGCCAAAAA-ATTGGTGCAACTCCAAATAAAAGTAATAAATCTATTATCCTCTTTCACCCTTGTCACACTGATTATCCTAACTTTACCTATCATTATAAACGTTACAAACATATACAAAAACTACCCCTATGCACCATACGTAAAATCTTCTATTGCATGTGCCTTCATCACTAGCCTCATCCCAACTATATTATTTATCTCCTCAGGACAAGAAACAATCATTTCCAACTGACATTGAATAACAATCCAAACCCTAAAACTATCTATTAGCTT
Homo_sapiens		AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCTCATTACTATTCTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATCCTCTCTCAAGGACTTCAAACTCTACTCCCACTAATAGCTTTTTGATGACTTCTAGCAAGCCTCGCTAACCTCGCCTTACCCCCCACTATTAACCTACTGGGAGAACTCTCTGTGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGACTCAACATACTAGTCACAGCCCTATACTCCCTCTACATATTTACCACAACACAATGGGGCTCACTCACCCACCACATTAACAACATAAAACCCTCATTCACACGAGAAAACACCCTCATGTTCATACACCTATCCCCCATTCTCCTCCTATCCCTCAACCCCGACATCATTACCGGGTTTTCCTCTTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTTA-CGACCCCTTATTTACCGAGAAAGCT-CACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACCATGCACACTACTATAACCACCCTAACCCTGACTTCCCTAATTCCCCCCATCCTTACCACCCTCGTTAACCCTAACAAAAAAAACTCATACCCCCATTATGTAAAATCCATTGTCGCATCCACCTTTATTATCAGTCTCTTCCCCACAACAATATTCATGTGCCTAGACCAAGAAGTTATTATCTCGAACTGACACTGAGCCACAACCCAAACAACCCAGCTCTCCCTAAGCTT
Pan	  		AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCTCATTATTATTCTGCCTAGCAAACTCAAATTATGAACGCACCCACAGTCGCATCATAATTCTCTCCCAAGGACTTCAAACTCTACTCCCACTAATAGCCTTTTGATGACTCCTAGCAAGCCTCGCTAACCTCGCCCTACCCCCTACCATTAATCTCCTAGGGGAACTCTCCGTGCTAGTAACCTCATTCTCCTGATCAAATACCACTCTCCTACTCACAGGATTCAACATACTAATCACAGCCCTGTACTCCCTCTACATGTTTACCACAACACAATGAGGCTCACTCACCCACCACATTAATAACATAAAGCCCTCATTCACACGAGAAAATACTCTCATATTTTTACACCTATCCCCCATCCTCCTTCTATCCCTCAATCCTGATATCATCACTGGATTCACCTCCTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTCA-CGACCCCTTATTTACCGAGAAAGCT-TATAAGAACTGCTAATTCATATCCCCATGCCTGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCCATCCGTTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACCATGTATACTACCATAACCACCTTAACCCTAACTCCCTTAATTCTCCCCATCCTCACCACCCTCATTAACCCTAACAAAAAAAACTCATATCCCCATTATGTGAAATCCATTATCGCGTCCACCTTTATCATTAGCCTTTTCCCCACAACAATATTCATATGCCTAGACCAAGAAGCTATTATCTCAAACTGGCACTGAGCAACAACCCAAACAACCCAGCTCTCCCTAAGCTT
Gorilla   		AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCATCATTATTATTCTGCCTAGCAAACTCAAACTACGAACGAACCCACAGCCGCATCATAATTCTCTCTCAAGGACTCCAAACCCTACTCCCACTAATAGCCCTTTGATGACTTCTGGCAAGCCTCGCCAACCTCGCCTTACCCCCCACCATTAACCTACTAGGAGAGCTCTCCGTACTAGTAACCACATTCTCCTGATCAAACACCACCCTTTTACTTACAGGATCTAACATACTAATTACAGCCCTGTACTCCCTTTATATATTTACCACAACACAATGAGGCCCACTCACACACCACATCACCAACATAAAACCCTCATTTACACGAGAAAACATCCTCATATTCATGCACCTATCCCCCATCCTCCTCCTATCCCTCAACCCCGATATTATCACCGGGTTCACCTCCTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGATAACAGAGGCTCA-CAACCCCTTATTTACCGAGAAAGCT-CGTAAGAGCTGCTAACTCATACCCCCGTGCTTGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGACCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACTATGTACGCTACCATAACCACCTTAGCCCTAACTTCCTTAATTCCCCCTATCCTTACCACCTTCATCAATCCTAACAAAAAAAGCTCATACCCCCATTACGTAAAATCTATCGTCGCATCCACCTTTATCATCAGCCTCTTCCCCACAACAATATTTCTATGCCTAGACCAAGAAGCTATTATCTCAAGCTGACACTGAGCAACAACCCAAACAATTCAACTCTCCCTAAGCTT
Pongo     		AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCTCCCTACTGTTCTGCCTAGCAAACTCAAACTACGAACGAACCCACAGCCGCATCATAATCCTCTCTCAAGGCCTTCAAACTCTACTCCCCCTAATAGCCCTCTGATGACTTCTAGCAAGCCTCACTAACCTTGCCCTACCACCCACCATCAACCTTCTAGGAGAACTCTCCGTACTAATAGCCATATTCTCTTGATCTAACATCACCATCCTACTAACAGGACTCAACATACTAATCACAACCCTATACTCTCTCTATATATTCACCACAACACAACGAGGTACACCCACACACCACATCAACAACATAAAACCTTCTTTCACACGCGAAAATACCCTCATGCTCATACACCTATCCCCCATCCTCCTCTTATCCCTCAACCCCAGCATCATCGCTGGGTTCGCCTACTGTAAATATAGTTTAACCAAAACATTAGATTGTGAATCTAATAATAGGGCCCCA-CAACCCCTTATTTACCGAGAAAGCT-CACAAGAACTGCTAACTCTCACT-CCATGTGTGACAACATGGCTTTCTCAGCTTTTAAAGGATAACAGCTATCCCTTGGTCTTAGGATCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAACAGCCATGTTTACCACCATAACTGCCCTCACCTTAACTTCCCTAATCCCCCCCATTACCGCTACCCTCATTAACCCCAACAAAAAAAACCCATACCCCCACTATGTAAAAACGGCCATCGCATCCGCCTTTACTATCAGCCTTATCCCAACAACAATATTTATCTGCCTAGGACAAGAAACCATCGTCACAAACTGATGCTGAACAACCACCCAGACACTACAACTCTCACTAAGCTT
Hylobates 		AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTTCCCTGCTATTCTGCCTTGCAAACTCAAACTACGAACGAACTCACAGCCGCATCATAATCCTATCTCGAGGGCTCCAAGCCTTACTCCCACTGATAGCCTTCTGATGACTCGCAGCAAGCCTCGCTAACCTCGCCCTACCCCCCACTATTAACCTCCTAGGTGAACTCTTCGTACTAATGGCCTCCTTCTCCTGGGCAAACACTACTATTACACTCACCGGGCTCAACGTACTAATCACGGCCCTATACTCCCTTTACATATTTATCATAACACAACGAGGCACACTTACACACCACATTAAAAACATAAAACCCTCACTCACACGAGAAAACATATTAATACTTATGCACCTCTTCCCCCTCCTCCTCCTAACCCTCAACCCTAACATCATTACTGGCTTTACTCCCTGTAAACATAGTTTAATCAAAACATTAGATTGTGAATCTAACAATAGAGGCTCG-AAACCTCTTGCTTACCGAGAAAGCC-CACAAGAACTGCTAACTCACTATCCCATGTATGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGACCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAGCAATGTACACCACCATAGCCATTCTAACGCTAACCTCCCTAATTCCCCCCATTACAGCCACCCTTATTAACCCCAATAAAAAGAACTTATACCCGCACTACGTAAAAATGACCATTGCCTCTACCTTTATAATCAGCCTATTTCCCACAATAATATTCATGTGCACAGACCAAGAAACCATTATTTCAAACTGACACTGAACTGCAACCCAAACGCTAGAACTCTCCCTAAGCTT
Macaca_fuscata		AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTTCCATATATTTCTGCCTAGCCAATTCAAACTATGAACGCACTCACAACCGTACCATACTACTGTCCCGAGGACTTCAAATCCTACTTCCACTAACAGCCTTTTGATGATTAACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATCAATCTACTAGGTGAACTCTTTGTAATCGCAACCTCATTCTCCTGATCCCATATCACCATTATGCTAACAGGACTTAACATATTAATTACGGCCCTCTACTCTCTCCACATATTCACTACAACACAACGAGGAACACTCACACATCACATAATCAACATAAAGCCCCCCTTCACACGAGAAAACACATTAATATTCATACACCTCGCTCCAATTATCCTTCTATCCCTCAACCCCAACATCATCCTGGGGTTTACCTCCTGTAGATATAGTTTAACTAAAACACTAGATTGTGAATCTAACCATAGAGACTCA-CCACCTCTTATTTACCGAGAAAACT-CGCAAGGACTGCTAACCCATGTACCCGTACCTAAAATTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAACATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCCATCATTATAACAACCCTTATCTCCCTAACTCTCCCAATTTTTGCCACCCTCATCAACCCTTACAAAAAACGTCCATACCCAGATTACGTAAAAACAACCGTAATATATGCTTTCATCATCAGCCTCCCCTCAACAACTTTATTCATCTTCTCAAACCAAGAAACAACCATTTGGAGCTGACATTGAATAATGACCCAAACACTAGACCTAACGCTAAGCTT
M_mulatta		AAGCTTTTCTGGCGCAACCATCCTCATGATTGCTCACGGACTCACCTCTTCCATATATTTCTGCCTAGCCAATTCAAACTATGAACGCACTCACAACCGTACCATACTACTGTCCCGGGGACTTCAAATCCTACTTCCACTAACAGCTTTCTGATGATTAACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATCAACCTACTAGGTGAACTCTTTGTAATCGCGACCTCATTCTCCTGGTCCCATATCACCATTATATTAACAGGATTTAACATACTAATTACGGCCCTCTACTCCCTCCACATATTCACCACAACACAACGAGGAGCACTCACACATCACATAATCAACATAAAACCCCCCTTCACACGAGAAAACATATTAATATTCATACACCTCGCTCCAATCATCCTCCTATCTCTCAACCCCAACATCATCCTGGGGTTTACTTCCTGTAGATATAGTTTAACTAAAACATTAGATTGTGAATCTAACCATAGAGACTTA-CCACCTCTTATTTACCGAGAAAACT-CGCGAGGACTGCTAACCCATGTATCCGTACCTAAAATTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAATATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCTATCATAATAACAACCCTTATCTCCCTAACTCTCCCAATTTTTGCCACCCTCATCAACCCTTACAAAAAACGTCCATACCCAGATTACGTAAAAACAACCGTAATATATGCTTTCATCATCAGCCTCCCCTCAACAACTTTATTCATCTTCTCAAACCAAGAAACAACCATTTGAAGCTGACATTGAATAATAACCCAAACACTAGACCTAACACTAAGCTT
M_fascicularis		AAGCTTCTCCGGCGCAACCACCCTTATAATCGCCCACGGGCTCACCTCTTCCATGTATTTCTGCTTGGCCAATTCAAACTATGAGCGCACTCATAACCGTACCATACTACTATCCCGAGGACTTCAAATTCTACTTCCATTGACAGCCTTCTGATGACTCACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATTAATCTACTAGGCGAACTCTTTGTAATCACAACTTCATTTTCCTGATCCCATATCACCATTGTGTTAACGGGCCTTAATATACTAATCACAGCCCTCTACTCTCTCCACATGTTCATTACAGTACAACGAGGAACACTCACACACCACATAATCAATATAAAACCCCCCTTCACACGAGAAAACATATTAATATTCATACACCTCGCTCCAATTATCCTTCTATCTCTCAACCCCAACATCATCCTGGGGTTTACCTCCTGTAAATATAGTTTAACTAAAACATTAGATTGTGAATCTAACTATAGAGGCCTA-CCACTTCTTATTTACCGAGAAAACT-CGCAAGGACTGCTAATCCATGCCTCCGTACTTAAAACTACGGTTTCCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAACATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCCATCATAATAACAACCCTCATCTCCCTGACCCTTCCAATTTTTGCCACCCTCACCAACCCCTATAAAAAACGTTCATACCCAGACTACGTAAAAACAACCGTAATATATGCTTTTATTACCAGTCTCCCCTCAACAACCCTATTCATCCTCTCAAACCAAGAAACAACCATTTGGAGTTGACATTGAATAACAACCCAAACATTAGACCTAACACTAAGCTT
M_sylvanus		AAGCTTCTCCGGTGCAACTATCCTTATAGTTGCCCATGGACTCACCTCTTCCATATACTTCTGCTTGGCCAACTCAAACTACGAACGCACCCACAGCCGCATCATACTACTATCCCGAGGACTCCAAATCCTACTCCCACTAACAGCCTTCTGATGATTCACAGCAAGCCTTACTAATCTTGCTCTACCCTCCACTATTAATCTACTGGGCGAACTCTTCGTAATCGCAACCTCATTTTCCTGATCCCACATCACCATCATACTAACAGGACTGAACATACTAATTACAGCCCTCTACTCTCTTCACATATTCACCACAACACAACGAGGAGCGCTCACACACCACATAATTAACATAAAACCACCTTTCACACGAGAAAACATATTAATACTCATACACCTCGCTCCAATTATTCTTCTATCTCTTAACCCCAACATCATTCTAGGATTTACTTCCTGTAAATATAGTTTAATTAAAACATTAGACTGTGAATCTAACTATAGAAGCTTA-CCACTTCTTATTTACCGAGAAAACT-TGCAAGGACCGCTAATCCACACCTCCGTACTTAAAACTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGCCTTAGGAGTCAAAAATATTGGTGCAACTCCAAATAAAAGTAATAATCATGTATACCCCCATCATAATAACAACTCTCATCTCCCTAACTCTTCCAATTTTCGCTACCCTTATCAACCCCAACAAAAAACACCTATATCCAAACTACGTAAAAACAGCCGTAATATATGCTTTCATTACCAGCCTCTCTTCAACAACTTTATATATATTCTTAAACCAAGAAACAATCATCTGAAGCTGGCACTGAATAATAACCCAAACACTAAGCCTAACATTAAGCTT
Saimiri_sciureus	AAGCTTCACCGGCGCAATGATCCTAATAATCGCTCACGGGTTTACTTCGTCTATGCTATTCTGCCTAGCAAACTCAAATTACGAACGAATTCACAGCCGAACAATAACATTTACTCGAGGGCTCCAAACACTATTCCCGCTTATAGGCCTCTGATGACTCCTAGCAAATCTCGCTAACCTCGCCCTACCCACAGCTATTAATCTAGTAGGAGAATTACTCACAATCGTATCTTCCTTCTCTTGATCCAACTTTACTATTATATTCACAGGACTTAATATACTAATTACAGCACTCTACTCACTTCATATGTATGCCTCTACACAGCGAGGTCCACTTACATACAGCACCAGCAATATAAAACCAATATTTACACGAGAAAATACGCTAATATTTATACATATAACACCAATCCTCCTCCTTACCTTGAGCCCCAAGGTAATTATAGGACCCTCACCTTGTAATTATAGTTTAGCTAAAACATTAGATTGTGAATCTAATAATAGAAGAATA-TAACTTCTTAATTACCGAGAAAGTG-CGCAAGAACTGCTAATTCATGCTCCCAAGACTAACAACTTGGCTTCCTCAACTTTTAAAGGATAGTAGTTATCCATTGGTCTTAGGAGCCAAAAACATTGGTGCAACTCCAAATAAAAGTAATA---ATACACTTCTCCATCACTCTAATAACACTAATTAGCCTACTAGCGCCAATCCTAGCTACCCTCATTAACCCTAACAAAAGCACACTATACCCGTACTACGTAAAACTAGCCATCATCTACGCCCTCATTACCAGTACCTTATCTATAATATTCTTTATCCTTACAGGCCAAGAATCAATAATTTCAAACTGACACTGAATAACTATCCAAACCATCAAACTATCCCTAAGCTT
;
end;

begin mrbayes; 
    set autoclose=yes nowarn=yes; 
    lset nst=6 rates=gamma; 
    mcmc nruns=1 ngen=10000 samplefreq=10; 
end;

A PBS Pro batch script must be created to run your job. The first script below shows a MPI parallel run of the above '.nex' input file. This script selects 4 processors (cores) and allows PBS to put them on any compute node. Note, that when running any parallel program one must be cognizant of the scaling properties of its parallel algorithm; in other words, how much does a given job's running time drop as one doubles the number of processors used. All parallel programs arrive at point of diminishing returns that depend on the algorithm, size of the problem being solved, and the performance features of the system that it is being run on. We might have chosen to run this job on 8, 16, or 32 processors (cores), but would only do so if the improvement in performance scales. Improvements of less than 25% after a doubling are an indication of a reasonable maximum number of processors under those particular set of circumstances

Here is the 4 processor MPI parallel PBS batch script:

#!/bin/bash
#PBS -q production
#PBS -N MRBAYES_mpi
#PBS -l select=4:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin MRBAYES MPI Run ..."
echo ""
mpirun -np 4 -machinefile $PBS_NODEFILE /share/apps/mrbayes/default/bin/mb ./primates.nex
echo ""
echo ">>>> End   MRBAYES MPI Run ..."

This script can be dropped into a file (say 'mrbayes_mpi.job) on either BOB, ATHENA, or ANDY and run with:

qsub mrbayes_mpi.job

This test case should take no more than a couple of minutes to run and will produce PBS output and error files beginning with the job name 'MRBAYES_mpi'. Other MrBayes specific outputs will also be produced. Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=4:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 4 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job (on ANDY as much as 2,880 MBs might have been selected). The second line instructs PBS to place this job wherever the least used resources are found (i.e. freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes may also be called into service to complete this job.

The CUNY HPC Center also provides a serial version of MrBayes. A PBS batch script for running the serial version is easy to prepare from the above by making a few changes. Here is a listing of the differences between the above MPI script and the serial script:

3,4c3,4
< #PBS -N MRBAYES_mpi
< #PBS -l select=4:ncpus=1:mem=1920mb
---
> #PBS -N MRBAYES_serial
> #PBS -l select=1:ncpus=1:mem=1920mb
16c16
< echo ">>>> Begin MRBAYES MPI Run ..."
---
> echo ">>>> Begin MRBAYES Serial Run ..."
18c18
< mpirun -np 4 -machinefile $PBS_NODEFILE /share/apps/mrbayes/default/bin/mb ./primates.nex
---
> /share/apps/mrbayes/default/bin/mb-serial ./primates.nex
20c20
< echo ">>>> End   MRBAYES MPI Run ..."
---
> echo ">>>> End   MRBAYES Serial Run ..."

Finally, it is possible to run MrBayes in GPU-accelerated mode on ANDY. This is an experimental version of the code and users are cautioned to check their results and note their performance to be sure they are getting accurate answers in shorter time periods. Nothing is worse in HPC than going in the wrong direction, more slowly (this principle applies to NYC Taxi rides as well). Here is yet another script that will run the GPU-accelerated version of MrBayes (again, on ANDY only).

#!/bin/bash
#PBS -q production_gpu
#PBS -N MRBAYES_gpu
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb:accel=fermi
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the GPU parallel executable to run
echo ">>>> Begin MRBAYES GPU Run ..."
echo ""
/share/apps/mrbayes/default/bin/mb-gpu ./primates.nex
echo ""
echo ">>>> End   MRBAYES GPU Run ..."

There are several differences worth pointing out. First, this job is submitted to the 'production_gpu' queue which ensures that PBS select those ANDY compute nodes with attached GPUs (i.e. the compute nodes beginning with 'gpute-'). Second, the resource request line '-l select' request more that just a processor and some memory, but also a GPU (ncpus=1) and a particular flavor of GPU (accel=fermi). The NVIDIA Fermi GPUs on ANDY (96 in all) each have 448 processors. In requesting 1 GPU here, we are getting an 448 processors assign to our task. Individually GPU processors are less powerful that CPU processors, but in such numbers (if they can be used in parallel), they can deliver significant performance improvements.

The last difference worth noting is the name of the executable, 'mb-gpu', which selects the GPU-accelerated version of the code.

MSMS

MSMS is a tool to generate sequence samples under both neutral models and single locus selection models. MSMS permits the full range of demographic models provided by its relative MS (Hudson, 2002). In particular, it allows for multiple demes with arbitrary migration patterns, population growth and decay in each deme, and for population splits and mergers. Selection (including dominance) can depend on the deme and also change with time. The program is designed to be command line compatible to MS, however no prior knowledge of MS is assumed for this document.

Applications of MSMS include power studies, analytical comparisons, approximated Bayesian computation among many others. Because most applications require the generation of a large number of independent replicates, the code is designed to be efficient and fast. For the neutral case, it is comparable to MS and even faster for large recombination rates. For selection, the performance is only slightly slower, making this one of the fastest tools for simulation with selection. MSMS was developed in Java and can run on any hardware that supports Java 1.6.

MSMS version 1.3 has been installed at the CUNY HPC Center on BOB and ATHENA. It can be run in serially (1 core) or in multi-threaded parallel mode on a multi-core compute node (e.g. 8 on BOB and 4 on ANDY). MSMS is a command-line only program; there is no GUI, and you cannot use a mouse to set up simulations. The command line may look intimidating, but in reality it is quite easy to build up very complicated models if need be. The trick is to build the model up one step at a time. MSMS generates sample sequences outputs, and such does not generally require inputs from files. All of MSMS's command-line options are summarized here [45] and a more complete user manual can be found here [46].

Here is a PBS batch script that will will start a serial MSMS job on one processor (core) of a single BOB or ATHENA compute node:

#!/bin/bash
#PBS -q production
#PBS -N MSMS_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin MSMS Serial Run ..."
echo ""
/share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1
echo ""
echo ">>>> End   MSMS Serial Run ..."

This PBS batch script can be dropped into a file (say 'msms_serial.job') and started with the command:

qsub msms_serial.job

This test case should take no more than a minute to run and will produce PBS output and error files beginning with the job name 'MSMS_serial'. The MSMS specific output will be written to the PBS output file. Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory. The second line instructs PBS to place this job wherever the least used resources are to be found (i.e. freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The MSMS command itself considers a single diploid population:

msms -N 10000 -ms 10 1000 -t 1

This command tells msms to use an effective population size of 10000 with the -N option. This option is unique to msms and is important even when not considering selection. Generally, its important to use a large number. While selection is not included in this parameter, it does not affect run times in any way. The -ms 10 1000 option is the same as the first two options to MS. The first is the number of samples; the second is the number of replicates. After this option, all the normal options of ms can be used and have the same meanings as per MS. The last option is -t 1 and specifies the theta parameter. We have assumed a diploid population, so theta is (4 * N * mutation rate). All parameters are scaled with N in some way.

To run the same simulation in 4-way thread parallel mode, a few minor changes to the serial script above are required:

3,5c3,5
< #PBS -N MSMS_serial
< #PBS -l select=1:ncpus=1:mem=1920mb
< #PBS -l place=free
---
> #PBS -N MSMS_threads
> #PBS -l select=1:ncpus=4:mem=7680mb
> #PBS -l place=pack
18c18
< /share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1
---
> /share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1 -threads 4

The '-l select' line requests a PBS 'chunk' of 4 cores and 4 times as much memory. In the threaded job, we ask PBS to 'pack' the 4 cores on the same node. Thread-based parallel programs can make use only of processors (cores) on the same physical node. On BOB, there are 8 cores per compute node. On ATHENA, there are 4 cores per compute node. NOTE: Before running in thread-parallel mode generally, please compare the total run time and results between a serial run and the identical thread-parallel run to make sure that the run time is less and the results are the same.

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit ethernet or InfiniBand. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR. NAMD is distributed free of charge with source code. For more detailed information please go to the NAMD website here [47].

NAMD version 2.7b2 is installed on ANDY and BOB for use with their InfiniBand interconnects and on ATHENA and ZEUS for use with their GigaBit Ethernet interconnects. In addition, ZEUS and ANDY have versions of NAMD installed that will compute the non-bonded molecular interactions on the GPUs on these systems. A recent short simulation (500 time steps) of the HIV virus showed that a 4 CPU-only run was three times slower (5364 seconds) than the 4 CPU-GPU run with the non-bonded forces compute on the GPU (1737 seconds). We encourage users to explore using the GPU version of NAMD after making some comparison runs to be sure the runs results compare equal.

A PBS Pro submit script for NAMD that runs the CPU-only version using 'mpirun' on 16 processors, 4 to a compute node, follows. (Note: The script is revised and simplified from earlier versions and no longer uses or requires the 'charmrun' wrapper. Please convert all your NAMD scripts to use 'mpirun' instead of 'charmrun')


#!/bin/bash
#PBS -q production
#PBS -N namd.test
#PBS -l select=4:ncpus=4:mpiprocs=4
#PBS -l place=free
#PBS -V

# PBS Pro requires that you explicitly change to your working directory

cd $HOME/namd

mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/namd/default/Linux-x86_64-icc/namd2  ./alanin.conf > alanin.out

To run a similar job, but one that uses 4 CPUs for the bonded interactions and an additional 4 GPUs for the non-bonded interactions, the following script could be used:


#!/bin/bash
#PBS -q production_gpu
#PBS -N namd_cuda.test
#PBS -l select=2:ncpus=2:ngpus=2:accel=fermi:mpiprocs=2
#PBS -l place=free
#PBS -V

# PBS Pro requires that you explicitly change to your working directory

cd $HOME/namd

mpirun -np 4  -machinefile $PBS_NODEFILE /share/apps/namd/default/Linux-x86_64-CUDA/namd2  +idlepoll +devices 0,1,0,1   ./hivrt.conf > hivrt.out

Network Simulator-2 (NS2)

NS2 is a discrete event simulator targeted at networking research. NS2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. Version 2.31 and 2.33 are installed on ATHENA and BOB at the CUNY HPC Center. For more detailed information look here here.

Running NS2 is a four step process.

Prepare a Tcl script for NS2 like the example (ex.tcl) shown below. This example has 2 nodes with 1 link and uses UDP agent with the CBR traffic generator.

set ns [new Simulator]
set tr [open trace.out w]
$ns trace-all $tr

proc finish {} {
        global ns tr
        $ns flush-trace
        close $tr
        exit 0
}

set n0 [$ns node]
set n1 [$ns node]

$ns duplex-link $n0 $n1 1Mb 10ms DropTail

set udp0 [new Agent/UDP]
$ns attach-agent $n0 $udp0
set cbr0 [new Application/Traffic/CBR]
$cbr0 set packetSize_ 500
$cbr0 set interval_ 0.005
$cbr0 attach-agent $udp0
set null0 [new Agent/Null]
$ns attach-agent $n1 $null0
$ns connect $udp0 $null0  

$ns at 0.5 "$cbr0 start"
$ns at 4.5 "$cbr0 stop"
$ns at 5.0 "finish"

$ns run

Create a PBS batch submit script like the one shown here:

#!/bin/bash
#PBS -q production
#PBS -N NS2-job
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# You must explictly change to your working
# directory in PBS

cd $HOME/my_NS2_wrk

/share/apps/ns2/ns-allinone-2.31/ns-2.31/ns ./ex.tcl

Submit the job with:

qsub submit

Graph the result. At the HPC Center, 'nam' files can be produced, but cannot be run because they require a graphical environment for execution. Trace Graph is a free network trace file analyzer developed for NS2 trace processing. Trace Graph can support any trace format if converted to its own or NS2 trace format. Supported NS2 trace file formats include:

wired,satellite,wireless,new trace,wired-wireless. 

For more information on Trace Graph look here [48].

Users graphing results in Trace Graph must use Linux with X window system and perform the following steps:

1. SSH to Athena with X11 forwarding.

ssh -X  your.name@bob.csi.cuny.edu

2. Start tracegraph by typing the command "trgraph".

trgraph

NWChem

NWChem is an ab initio computational chemistry software package which also includes molecular dynamics (MM, MD) and coupled, quantum mechanical and molecular dynamics functionality (QM-MD). It was designed to run on high-performance parallel supercomputers as well as conventional workstation clusters. It aims to be scalable both in its ability to treat large problems efficiently, and in its usage of available parallel computing resources, both processors and memory, whether local or distributed. NWChem has been developed by the Molecular Sciences Software group of the Theory, Modeling & Simulation program of the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL). Most of the implementation has been funded by the EMSL Construction Project. The CUNY HPC Center is currently running NWChem 6.0 built using its InfiniBand communications interface. It is installed on both ANDY and BOB.

A sample NWChem input file which does Hartree-Fock energy calculation on water at thye 6-31g* basis set is shown here:

echo
start water2
title "an example simple water calculation"

# The memory options are system specific

memory total 2880 mb global 2160 mb

geometry units au
 O 0       0              0
 H 0       1.430    -1.107
 H 0     -1.430    -1.107
end

basis
  O library 6-31g*
  H library 6-31g*
end

task scf gradient

For details on the content and structure of each section of the NWChem input deck users should consult the NWChem Users Manual at http://www.emsl.pnl.gov/capabilities/computing/nwchem/docs/usermanual.pdf. The memory section above should be described here because it affects job execution and performance. In this example, the standard (and maximum) per-core quantity of memory available on ANDY (2880 mb) has been requested and a portion of it (75%) has been partitioned to be used as part of the NWChem Global Arrays parallel computing model for parallel computation and communication. These settings provide NWChem with the maximum amount of scratch space for in-memory work and should deliver the best performance on ANDY. The characteristic memory settings for jobs submitted on BOB would be 'memory total 1920 mb global 1440 mb'


A PBS batch submit script to run this job on 16 processors (cores) is shown here:

#!/bin/csh
#PBS -q production_qdr
#PBS -N water2_631g
# These statements select 16 chunks of 1 core and
# 2880 mb of memory each, and instruct PBS to freely
# place each chunk on the least loaded nodes.
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

echo "This job's process 0 host is: " `hostname`; echo ""

# Must explicitly change to your working directory under PBS

cd $PBS_O_WORKDIR

# Set up NWCHEM environment, permanent, and scratch directory

setenv NWCHEM_ROOT /share/apps/nwchem/default/
setenv NWCHEM_BASIS_LIBRARY ${NWCHEM_ROOT}/data/libraries/

setenv PERMANENT_DIR $PBS_O_WORKDIR

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

setenv SCRATCH_DIR /home/nwchem/nw5.1_scr/${MY_SCRDIR}_$$
mkdir -p $SCRATCH_DIR

echo "The scratch directory for this run is: $SCRATCH_DIR"; echo ""

# Name and insert scratch directory into input file

# Start NWCHEM job

mpirun -np 16 -machinefile $PBS_NODEFILE ${NWCHEM_ROOT}/bin/nwchem ./water2_631g.nw > water2_631g.out

# Clean up scratch files by default

/bin/rm -r $SCRATCH_DIR

echo 'Job is done!'

The memory selected on the '-l select' line above is sized for ANDY's maximum memory per core. BOB would use 1920mb instead of 2880mb. The rest of the script describes its action in comments. Larger jobs on 32 or more processors can be run. Please consult the sections on the PBS Pro Batch scheduling system for information on how to modify this sample deck for different processor counts.

Users should become aware of the scaling properties of their work by taking note of the run times at various processor counts. When doubling processor count improves SCF cycle time only by a modest percentage then further increases in process counts should be avoided. The ANDY system has two distinct interconnects. One is a DDR InfiniBand network that delivers 20 Gbits per second of performance and the other is a QDR InfiniBand network that delviers 40 Gbits per second. Either will serve NWChem users well, but the QDR network should provide somewhat better scaling. In the example above, the 'production_qdr' queue has been requested, but by dropping the terminating '_qdr' one can select the DDR interconnect. This might be a better choice if the QDR side of the ANDY system is crowded with jobs. On BOB only a SDR InfiniBand network is available which runs on 10 Gbits per second.

To get their NWChem jobs to run each user will need to copy or create a symbolic link to a ".nwchemrc" file in their $HOME directory to the site specific "default.nwchemrc" file located in:

/share/apps/nwchem/default/data/

An example of creating such a symbolic link is as follows:

andy$ ln -s /share/apps/nwchem/default/data/default.nwchemrc $HOME/.nwchemrc

(Note: When running NWCHEM on systems with IB interconnects, packing multiple cores on a single node can cause problems for the ARMCI communications conduit. That is why in the example above the resource chunks have just 1 core in them.)

Octopus

Octopus is a pseudopotential real-space package aimed at the simulation of the electron-ion dynamics of one-, two-, and three-dimensional finite systems subject to time-dependent electromagnetic fields. The program is based on time-dependent density-functional theory (TDDFT) in the Kohn-Sham scheme. All quantities are expanded in a regular mesh in real space, and the simulations are performed in real time. The program has been successfully used to calculate linear and non-linear absorption spectra, harmonic spectra, laser induced fragmentation, etc. of a variety of systems. Complete information about the octopus package can be found at its homepage, http://www.tddft.org/programs/octopus. The on-line user manual is available at http://www.tddft.org/programs/octopus/wiki/index.php/Manual.

The MPI parallel version Octopus 4.0.0 has been installed on ANDY (the older 3.2.0 release is also installed) with all its associated libraries (metis, netcdf, sparsekit, etsfio, etc.). It was built with an Intel compiled version of the OpenMPI 1.5.1 and has passed all its internal test cases.

As sample Octopus input file (require to have the name 'inp') is provided here:

# Sample data file:
#
# This is a simple data file. It will complete a gas phase ground-state
# calculation for the sodium dimer. Consult the manual for a brief
# explanation of each section and the variables.
#

CalculationMode = gs
Units = ev_angstrom
FromScratch = yes

# Explicitly set the parallelizaton strategy
# to the default (domain decomposition).
ParallelizationStrategy = par_domains

%Coordinates
"Na" | 0.0 | 0.0 |  1.7 | yes
"Na" | 0.0 | 0.0 | -1.7 | yes
%

BoxShape = sphere
Radius  = 8.0
Spacing = 0.3

XCFunctional = lda_x + lda_c_pz

MaximumIter = 200
ConvAbsDens = 1e-6

LCAOStart = lcao_states

EigenSolver = cg
EigenSolverInitTolerance = 1e-2
EigenSolverFinalTolerance = 1e-6
EigenSolverFinalToleranceIteration = 6
EigenSolverMaxIter = 25

TypeOfMixing = broyden

Octopus offers its users two distinct and combinable strategies to parallelize its runs. The first and default is to parallelize by domain decomposition of the mesh (METIS is used). In the input deck above, this method is chosen explicitly (ParallelizationStrategy = par_domains). The second is to compute the entire doman on each processor, but to do so for some number of distinct temporal states (ParallelizationStrategy = par_states). Users wishing to control the details of Octopus when run in parallel are advised to consult the advanced options section of the manual athttp://www.tddft.org/programs/octopus/wiki/index.php/Manual:Advanced_ways_of_running_Octopus.

A sample PBS Pro batch job submission script is show here:

#!/bin/csh
#PBS -q production_qdr
#PBS -N sodium_gstate
# The next statements select 8 chunks of 1 core and
# 2880mb of memory each (the pro-rated limit per
# core on ANDY), and allow PBS to freely place 
# those resource chunks on the least loaded nodes.
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

echo "This job's process 0 host is: " `hostname`; echo ""

# Must explicitly change to your working directory under PBS

cd $PBS_O_WORKDIR

# Set up OCTOPUS environment, working, and temporary directory

setenv OCTOPUS_ROOT /share/apps/octopus/default

setenv OCT_WorkDir \'$PBS_O_WORKDIR\'

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

setenv SCRATCH_DIR  /home/octopus/oct3.2_scr/${MY_SCRDIR}_$$
mkdir -p $SCRATCH_DIR
setenv OCT_TmpDir \'/home/octopus/oct3.2_scr/${MY_SCRDIR}_$$\'

echo "The scratch directory for this run is: $OCT_TmpDir"; echo ""

# Start OCTOPUS job

echo 'Your Octopus job is starting!'

mpirun -np 8 -machinefile $PBS_NODEFILE ${OCTOPUS_ROOT}/bin/octopus_mpi > sodium_gstate.out

# Clean up scratch files by default

/bin/rm -r $SCRATCH_DIR

echo 'Your Octopus job is done!'

The memory selected on the '-l select' line above is sized for ANDY's pro-rated maximum memory per core. Please consult the sections on the PBS Pro Batch scheduling system below for information on how to modify this sample deck for different processor counts. The rest of the script describes its action in comments.

Users should become aware of the scaling properties of their work by taking note of the run times at various processor counts. When doubling processor count improves SCF cycle time only by a modest percentage then further increases in process counts should be avoided. The ANDY system has two distinct interconnects. One is a DDR InfiniBand network that delivers 20 Gbits per second of performance and the other is a QDR InfiniBand network that delviers 40 Gbits per second. Either will serve Octopus users well, but the QDR network should provide somewhat better scaling.

In the example above, the 'production_qdr' queue has been requested, but by dropping the terminating '_qdr' one can select the DDR interconnect. This might be a better choice if the QDR side of the ANDY system is crowded with jobs.

PHOENICS

PHOENICS is an integrated Computational Fluid Dynamics (CFD) package for the preparation, simulation, and visualization of processes involving fluid flow, heat or mass transfer, chemical reaction, and/or combustion in engineering equipment, building design, and the environment. More detail is available at the CHAM website, here http://www.cham.co.uk.

Although we expect most users to pre- and post-process their jobs on office-local clients, the CUNY HPC Center has installed the Unix version of the entire PHOENICS package on ANDY. PHOENICS is installed in /share/apps/phoenics/default where all the standard PHOENICS directories are located (d_allpro, d_earth, d_enviro, d_photo, d_priv1, d_satell, etc.). Of particular interest is the MPI parallel version of the 'earth' executable (parexe) which makes full use of the parallel processing power of the ANDY cluster for larger individual jobs. While the parallel scaling properties of PHOENICS jobs will vary depending on the job size, processor type, and the cluster interconnect, larger work loads will generally scale and run efficiently on from 8 to 32 processors, while smaller problems will scale efficiently only up to about 4 processors. More detail on parallel PHOENICS is available at http://www.cham.co.uk/products/parallel.php. Aside from the tightly coupled MPI parallelism of 'parexe', users can run multiple instances of the non-parallel modules on ANDY (including the serial 'earexe' module) when a parametric approach can be used to solve their problems.

As suggested, the entire PHOENICS package is installed on ANDY and users can run the X11 version of the PHOENICS Commander display tool from ANDY's head node if they have connected using 'ssh -X andy.csi.cuny.edu' where the '-X' option ensures that X11 images are passed back to the original client. Still, CUNY has licensed a number of seats for office-local desktop installations of PHOENICS (for either Windows or Linux). Job preparation and post-processing work is generally most efficiently accomplished on the desktop using the Windows version of PHOENICS VR, which can be run directly or from PHOENICS Commander. A rough general outline of the PHOENICS work cycle is:

1.  The user runs VR Editor (preprocessor) on their workstation (or on ANDY) and
    perhaps selects a library case (e.g. 274) making changes to this case to match
    his/her specific requirements.
 
2.  The user leaves the VR editor where input files 'q1' and 'eardat' are created.  
    If the user is preprocessing from their desktop, these files would then be 
    transferred to ANDY using the 'scp' command or via the 'PuTTy' utility for 
    Windows.
 
3.  The user runs the solver on ANDY (typically the parallel version) from their
    working directory using the PBS batch submit script presented below.  This
    script reads the files 'q1' and 'eardat' (and potentially some other input files)
    and writes the key output files 'phi' and 'result'. 
 
4.  The user copies the output files back to their desktop (or not) and runs VR
    Viewer (postprocessor) which reads the graphics output file 'phi', or the user
    views tabular results manually in the 'result' file.

POLIS, available in Linux and Windows, has further useful information on running PHOENICS tutorials, viewing documentation, and on all PHOENICS commands and topics http://www.cham.co.uk/phoenics/d_polis/polis.htm. Graphical monitoring should be deactivated during parallel runs in ANDY's batch queue. To do this users should place two leading spaces in front of the command TSTSWP in the 'q1' file. The TSTSWP command is present in most library cases, including case 274 which is a useful test case. Graphical monitoring can be left turned on when running sequential 'earth' on the desktop. This will give useful realtime information on sweeps, values, and the convergence progress.

Details on the use of the display and non-parallel PHOENICS tools can be found at the CHAM website and in the CHAM Encyclopaedia at http://www.cham.co.uk/phoenics/d_polis/polis.htm.

Here, the process of setting up a PHOENICS working directory and running the parallel version of 'earth' (parexe) on ANDY is described. As a first step, users would typically create a directory called 'phoenics' in their $HOME directory as follows:

cd; mkdir phoenics

Next, the PHOENICS installation root directory named above should be symbolically linked to the 'lp36' subdirectory:

cd phoenics
ln -s /share/apps/phoenics/default ./lp36

The user must then generate the required input files for the 'earth' module which, as mentioned above in the PHOENICS work cycle section, are the 'q1' and 'eardat' files created by the VR Editor. These can be generated on ANDY, but it is generally easier to do this from the user's desktop installation of PHOENICS.

Once the input files are created and placed (transferred to) in the working directory on ANDY, the following PBS Pro batch script can be used run the job on ANDY. The progress of the job can be tracked with the 'qstat' command.

#!/bin/bash
#PBS -q production_qdr
#PBS -N phx_test
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

cd $HOME/phoenics

echo $PBS_NODEFILE
cat  $PBS_NODEFILE

echo  "Running:  mpirun -np 8 -machinefile $PBS_NODEFILE lp36/d_earth/parexe"

mpirun -np 8 -machinefile $PBS_NODEFILE ./lp36/d_earth/parexe

echo "Finished ... "

Constructing a PBS batch script is described in detail elsewhere in this Wiki document, but in short this script requests the QDR Infiniband production queue ('production_qdr') which runs the job on the side of ANDY with the fastest interconnect. It asks for 8 processors (cores) each with 2880 Mbytes of memory and allows PBS to select those processors based on least loaded criteria. Because this is just an 8 processor job, it could be packed onto a single physical node on ANDY for better scaling using '-l place=pack'.

During the run, 'parexe' creates (N-1) directories (named Proc00#) where N is the number of processors requested (note: if the Proc00# directories do not exist already they will be created, but there will be an error message in the PBS error log, which can be ignored). The output from process zero is written into the working directory and that from each of the other MPI processes is written into its associated 'Proc00#' directory. Upon successful completion the 'result' file should show that the requested number of iterations (sweeps) was completed and print the starting and ending wall-clock time. At this point, the results (the 'phi' and 'results' files) of the job can be copied back to the users desk top for post processing.

NOTE: A bug is present in the non-graphical, batch version of PHOENICS that is used on the CUNY HPC Clusters. This problem does not occur in Windows runs. To avoid the problem a go-around modification to the 'q1' input file is required. The problem occurs only in jobs that require SWEEP counts that are greater than 10,000 (i.e. SWEEP=20000). Users requesting larger SWEEP counts must include the following in their 'q1' input files to avoid having their jobs terminated at 10,000 SWEEPS.

USTEER=F

This addition forces a bypass of the graphical IO monitoring capability in PHOENICS and prevents that section of code from capping the SWEEP count at 10,000 SWEEPs.

Finally, PHOENICS has been licensed broadly by the CUNY HPC Center, and it can provide activation keys for any desktop copies whose annual activation keys expire.

R

General Notes

R is a free software environment for statistical computing and graphics. R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis.

R is part of the GNU project.

R is available on the following HPCC's servers: Athena, Bob, Andy, Karle. Karle is the only machine where R can be used without submitting jobs to PBS manager. On all other systems users must submit their R jobs into PBS queue.

Complete R documentation may be found at http://www.r-project.org/

Running R on Karle

The following is the "Hello World" program written in R:

# Hello World example
a <- c("Hello, world!")
print(a)

To run R job on Karle server, save your R script into the file (for example "hello.R") and use the following command to launch it:

/share/apps/r/default/bin/R --vanilla --slave < helloworld.R
R GUI

GUI for R is installed on Karle. To use it login to Karle with "ssh -X" and type in

jgr

Running R on cluster machines

In order to run R job on any of HPCC's cluster machines (Athena, Bob or Andy) users should use PBS manager. Submitting serial R job to the PBS queue is exactly the same as submitting any other serial job.

Consider the above example. To run this simple "hello-world" R job users need PBS script:

#!/bin/bash
#PBS -q production
#PBS -N R_job
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

echo "Starting R job ..."

cd $PBS_O_WORKDIR

/share/apps/r/default/bin/R --vanilla --slave < helloworld.R

echo "R job is finished."

R jobs may be run in parallel (i.e. with the help of "multicore" package). To run SMP-parallel job PBS script should be modified as explained here :

#!/bin/bash
#PBS -q production
#PBS -N R_job
#PBS -l select=1:ncpus=8
#PBS -l place=pack
#PBS -V

echo "Starting SMP-parallel R job ..."

cd $PBS_O_WORKDIR

/share/apps/r/default/bin/R --vanilla --slave < myparalleljob.R

echo "R job is finished."

R packages

In order to install R package start R and run the following command:

install.packages("package.name")

and pick a mirror from the list. After package is installed use it starting with

library(package.name)

Please note, that the following packages are already available on "karle":

locfit
VGAM
network
sna
RGraphics
rgl
DMwR
RMySQL
randomForest
xts
tseries
jgr

RAXML

Randomized Axelerated Maximum Likelihood (RAxML) is a program for sequential and parallel maximum likelihood based inference of large phylogenetic trees. It is a descendent of fastDNAml which in turn was derived from Joe Felsentein’s DNAml which is part of the PHYLIP package. RAxML 7.2.8 is the latest version and is installed at the CUNY HPC Center on BOB, ATHENA, and ANDY. RAxML is available in both serial and MPI parallel versions. The MPI-parallel version should be run on four or more cores. Examples of running both a parallel and serial job are presented below. More information can be found [49]

To run RAxML first a PHYLIP file of aligned DNA or amino-acid sequences similar to the one here must be created. This file, 'alg.phy', is in interleaved format:

5 60
Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAG
Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGG
Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGG
Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGG
Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGG

GAAATGGTCAATATTACAAGGT
GAAATGGTCAACATTAAAAGAT
GAAATCGTCAATATTAAAAGGT
GAAATGGTCAATCTTAAAAGGT
GAAATGGTCAATATTAAAAGGT

For more detail about PHYLIP formatted files, please check look at the RAxML manual here at the web site referenced above.

Next create a PBS batch script. Below is an example script that will run the serial version of RAxML. The program options -m,-n,-s are all required. In order they specify the substitution model (-m), the output file name (-n), and the sequence file name (-s). Additional options are discussed in the manual.

#!/bin/bash
#PBS -q production
#PBS -N RAXML_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin RAXML Serial Run ..."
echo ""
/share/apps/raxml/default/bin/raxmlHPC -m GTRCAT -n TEST1 -s alg.phy
echo ""
echo ">>>> End   RAXML Serial Run ..."

This script can be dropped into a file (say raxml_serial.job) and submitted to PBS with the following command:

qsub raxml_serial.job

RAxML produces the following output files

  1. Parsimony starting tree is written to RAxML_parsimonyTree.TEST1.
  2. Final tree is written to RAxML_result.TEST1.
  3. Execution Log File is written to RAxML_log.TEST1.
  4. Execution information file is written to RAxML_info.TEST1.

RAxML also is available in a MPI-parallel version called raxmlHPC-MPI. The MPI-parallelized version can be run on all types of clusters to perform rapid parallel bootstraps, or multiple inferences on the original alignment. The MPI-version is for executing large production runs (i.e. 100 or 1,000 bootstraps). You can also perform multiple inferences on larger datasets in parallel to find a best-known ML tree for your dataset. Finally, the novel rapid BS algorithm and the associated ML search have also been parallelized with MPI.

The following MPI script script selects 4 processors (cores) and allows PBS to put them on any compute node. Note, that when running any parallel program one must be cognizant of the scaling properties of its parallel algorithm; in other words, how much does a given job's run time drop as one doubles the number of processors used. All parallel programs arrive at point of diminishing returns that depend on the algorithm, size of the problem being solved, and the performance features of the system that it is being run on. We might have chosen to run this job on 8, 16, or 32 processors (cores), but would only do so if the improvement in performance scales. Improvements of less than 25% after a doubling are an indication of a reasonable maximum number of processors under those particular set of circumstances.

#!/bin/bash
#PBS -q production
#PBS -N RAXML_mpi
#PBS -l select=4:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin RAXML MPI Run ..."
echo ""
mpirun -np 4 -machinefile $PBS_NODEFILE /share/apps/raxml/default/bin/raxmlHPC-MPI -m GTRCAT -n TEST2 -s alg.phy -N 4
echo ""
echo ">>>> End   RAXML MPI Run ..."

This test case should take no more than a minute to run and will produce PBS output and error files beginning with the job name 'RAXML_mpi'. Other RAxML-specific outputs will also be produced Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=4:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 4 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job (on ANDY as much as 2,880 MBs might have been selected). The second line instructs PBS to place this job wherever the least used resources are found (i.e. freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes may also be called into service to complete this job. Note that the name of the parallel executable is 'raxmlHPC-MPI' and the in the parallel run we are complete four simulations (-N 4).

SAS

SAS (pronounced "sass", originally Statistical Analysis System) is an integrated system of software products provided by SAS Institute Inc. that enables the programmer to perform:

  • data entry, retrieval, management, and mining
  • report writing and graphics
  • statistical analysis
  • business planning, forecasting, and decision support
  • operations research and project management
  • quality improvement
  • applications development
  • data warehousing (extract, transform, load)
  • platform independent and remote computing

In addition, SAS has many business solutions that enable large scale software solutions for areas such as IT management, human resource management, financial management, business intelligence, customer relationship management and more.

SAS software is currently installed at the Neptune server. In order to run it users need to

  • Login to "Karle" server. The procedure is described here. Note that X11 forwarding should be enabled. Read this article for details.
  • start SAS by typing the "sas_en" command.


Stata/MP

Stata is a complete, integrated statistical package that provides tools for data analysis, data management, and graphics. Stata/MP takes advantage of multiprocessor computers. CUNY HPC Center is licensed to use Stata on up to 8 cores.

Currently Stata/MP is available for users on Karle (karle.csi.cuny.edu).

Stata can be run in two regimes:

  • using Command Line Interface
  • using GUI

To start Stata session on Karle:

1) login to "Karle" server. The procedure is described here. Note that to run Stata in GUI mode X11 forwarding should be enabled. Read this article for details.

2) set PATH for your user:

# export PATH=$PATH:/share/apps/stata/stata12

3) start Stata using

  • stata-mp for CLI
  • xstata-mp for GUI


4) after Stata is successfully started (in either CLI or GUI mode) welcome message will be printed to the screen:

./stata-mp 

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   12.0   Copyright 1985-2011 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
     MP - Parallel Edition            College Station, Texas 77845 USA
                                      800-STATA-PC        http://www.stata.com
                                      979-696-4600        stata@stata.com
                                      979-696-4601 (fax)

2-user 8-core Stata network perpetual license:
       Serial number:  50120553010
         Licensed to:  CUNY HPCC
                       New York

Notes:
      1.  (-v# option or -set maxvar-) 5000 maximum variables
      2.  Command line editing enabled


.

5) Stata command prompt '.' is waiting for the input. As an example consider

. use /share/apps/stata/stata12/auto.dta

This will load '/share/apps/stata/stata12/auto.dta' into Stata session.

Now Stata routines may be applied to this data:

. describe

Contains data from /share/apps/stata/stata12/auto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          13 Apr 2011 17:45
 size:         3,182                          (_dta has notes)
-------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------
Sorted by:  foreign

. summarize price, detail

                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188

. 

and so on...


5) Complete documentation on Stata usage can be found under

/share/apps/stata/stata12/docs
  • Users will need to copy pdf documents from this directory to their local workstations.

Structurama

Structurama is a program for inferring population structure from genetic data. The program assumes that the sampled loci are in linkage equilibrium and that the allele frequencies for each population are drawn from a Dirichlet probability distribution. The method implements two different models for population structure. First, Structurama implements the method of Pritchard et al. (2000) in which the number of populations is considered fixed. The program also allows the number of populations to be a random variable following a Dirichlet process prior (Pella and Masuda, 2006; Huelsenbeck and Andolfatto, 2007). Importantly, the program can estimate the number of populations under the Dirichlet process prior. Markov chain Monte Carlo (MCMC) is used to approximate the posterior probability that individuals are assigned to specific populations. Structurama also allows the individuals to be admixed. Structurama implements a number of methods for summarizing the results of a Bayesian MCMC analysis of population structure. Perhaps most interestingly, the program finds the mean partition, a partitioning of individuals among populations that minimizes the squared distance to the sampled partitions. More detailed information about Structurama can be found at the web site here [50] and in the manual here [51].

The October 2011 version of the Structurama is installed on BOB and ATHENA. Structurama is a serial program with only an interactive command-line interface; therefore, making PBS batch serial runs requires that the user to supply the exact and complete list of commands that an interactive use of the program would have required within the PBS batch script. In addition, an executable Structurama data file must be present in the PBS working directory. The following PBS batch script shows how this is done using the Unix 'here-document' construction (i.e <<):

#!/bin/bash
#PBS -q production
#PBS -N STRAMA_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin STRUCTURE RAMA Serial Run ..."
echo ""

/share/apps/structurama/default/bin/st2 << EOF
execute test.inp
yes
quit
EOF

echo ""
echo ">>>> End   STRUCTURE RAMA Serial Run ..."

This script can be dropped into a file (say 'strama_serial.job') and submitted for execution using the following PBS command:

qsub strama_serial.job

A basic test input file should take less than a minute to run and will produce PBS output and error files beginning with the job name 'STRAMA_serial'. Additional, Structurama specific output files can also be request. This job will write an Structurama output file call 'strout.p. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The lines following the full-unix-path reference to the Structurama executable 'str2' show what is required to deliver input to an interactive program in a batch script. The input-equivalent sequence of commands should be placed, one per line, between the first and last 'EOF' which demarcates the entire pseudo-interactive session. NOTE: If you forget to include the final command 'quit', your PBS job will never complete, as it will be waiting for its final termination instructions and will never received them. Such, a job should be deleted with the PBS command 'qdel JID', where JID is the numerical PBS job identification number. If you would like a print out of all the Structurama options include the line 'help' in your command stream.

Structure

The program Structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs. More detailed information about Structure can be found at the web site here [52].

Version 2.3.3 of Structure is installed on BOB and ATHENA at the CUNY HPC Center. Structure is a serial program. The following PBS batch script shows how to run a single, basic Structure serial job:

#!/bin/bash
#PBS -q production
#PBS -N STRUCT_simple
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Set the root directory for the 'structure' binary
STROOT=/share/apps/structure/default/bin

# Point to the execution directory to run
echo ">>>> Begin STRUCTURE Serial Run ..."
echo ""
${STROOT}/structure -K 1 -m mainparams -i ./sim.str -o ./sim_k1_run1.out
echo ""
echo ">>>> End   STRUCTURE Serial Run ..."

This script can be dropped into a file (say 'struct_serial.job') and submitted for execution using the following PBS command:

qsub struct_serial.job

This test input file should take less than 5 minutes to run and will produce PBS output and error files beginning with the job name 'STRUCT_simple'. Additional, Structure-specific output files will also be created, including an output file called 'sim_k1_run1.out_f'. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The Structure program requires its own input and data files, properly configured, to run successfully. For the example above these include the input file ('sim.str' above), the 'mainparams' file ('mainparams.10mil.k1' above), and the 'extraparams' file (the default name, 'extraparams' is used in the example above). The user is responsible for configuring these files correctly for each run, but the data files for this example and others can be found in the directory:

/share/apps/structure/default/examples

on BOB and ATHENA.

Often, Structure users are interested in making multiple runs over a large simulation regime-space. This requires appropriately configured input and parameter files for each individual run. Data file configuration can be done manually or with the help of the Python-based tool StrAuto. The HPC Center has installed StrAuto to support running multiple Structure runs. StrAuto is documented at its download site here [53] and all the files, including the primary Python-based tool, 'strauto-0.3.1.py' are available in:

/share/apps/strauto/default

In this process, the StrAuto script, 'strauto-0.3.1.py', (found in '/share/apps/strauto/default/bin') is run in the presence of a user-created, regime-space configuration file called 'input.py'. This produces a Unix script file called 'runstructure' that can then be used to run the user-defined spectrum of cases, one after another. NOTE: the 'strauto-0.3.1.py' script requires Python 2.7.2 to run correctly. This version is NOT the default version of Python installed on either ATHENA or BOB; and therefore, users of StrAuto must invoke the 'strauto-0.3.1.py' script using a specially installed version of Python, as follows:

/share/apps/python/2.7.2/bin/python ./strauto-0.3.1.py

The above command assumes that 'strauto-0.3.1.py' has been copied into the user's directory and that the required 'input.py' file is also present there. The contents of the 'runstructure' file produced can then be integrated into a PBS batch script similar to the simple, single-run script shown above, but designed to run each case in the simulation regime-space in succession. Here is an example of just such a runstructure-adapted PBS script:

#!/bin/bash
#PBS -q production
#PBS -N STRUCT_cmplx
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

#-----------------------------------------------------------------------------------
# This PBS batch script is based on the 'runstructure' script generated by 
# Vikram Chhatre's setup and pre-processing program 'strauto-0.3.1.py' written
# in Python at Texas A&M University to be used with the 'structure' application.
#
# Each 'runstructure' script is custom-generated by the 'strauto-0.3.0.py' python 
# based on a custom input file.  It completes a series of runs over a regime defined 
# by the 'structure' user for that custom input file only.  This means it will only 
# work for that input data file. 
#                   Email: crypticlineage (at) tamu.edu                        
#-----------------------------------------------------------------------------------

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Setup a directory structure for the multiple 'structure' runs
mkdir results_f log harvester
mkdir k1
mkdir k2
mkdir k3
mkdir k4
mkdir k5

cd log
mkdir k1
mkdir k2
mkdir k3
mkdir k4
mkdir k5

cd ..

# Set the root directory for the 'structure' binary
STROOT=/share/apps/structure/default/bin

# Point to the execution directory to run
echo ">>>> Begin Multiple STRUCTURE Serial Runs ..."
echo ""

${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run1 2>&1 | tee log/k1/sim_k1_run1.log
${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run2 2>&1 | tee log/k1/sim_k1_run2.log
${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run3 2>&1 | tee log/k1/sim_k1_run3.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run1 2>&1 | tee log/k2/sim_k2_run1.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run2 2>&1 | tee log/k2/sim_k2_run2.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run3 2>&1 | tee log/k2/sim_k2_run3.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run1 2>&1 | tee log/k3/sim_k3_run1.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run2 2>&1 | tee log/k3/sim_k3_run2.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run3 2>&1 | tee log/k3/sim_k3_run3.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run1 2>&1 | tee log/k4/sim_k4_run1.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run2 2>&1 | tee log/k4/sim_k4_run2.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run3 2>&1 | tee log/k4/sim_k4_run3.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run1 2>&1 | tee log/k5/sim_k5_run1.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run2 2>&1 | tee log/k5/sim_k5_run2.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run3 2>&1 | tee log/k5/sim_k5_run3.log

# Consolidate all results in a single 'zip' file
mv k1 k2 k3 k4 k5  results_f/
cd results_f/
cp k*/*_f . && zip sim_Harvester-Upload.zip *_f && rm *_f
mv sim_Harvester-Upload.zip ../harvester/
cd ..

echo ""
echo ">>>> Zip Archive: sim_Harvester-Upload.zip is Ready ... "
echo ">>>> End  Multiple  STRUCTURE Serial Runs ..."

This script can be dropped into a file (say 'struct_cmplx.job') and submitted for execution using the following PBS command:

qsub struct_cmplx.job

The 'struct_cmplx.job' script runs one Structure job after another each with a slightly different set of input parameters. All the associated files and directories from a successful StrAuto-supported run of Structure using this script can be found on either ATHENA or BOB in:

/share/apps/strauto/default/examples

WRF

The Weather Research and Forecasting (WRF) model is a specific computer program with dual use for weather forecasting and research. It was created through a partnership that includes the National Oceanic and Atmospheric Administration (NOAA), the National Center for Atmospheric Research (NCAR), and more than 150 other organizations and universities in the United States and abroad. WRF is the latest numerical model and application to be adopted by NOAA's National Weather Service as well as the U.S. military and private meteorological services. It is also being adopted by government and private meteorological services worldwide.

There are two distinct WRF development trees and versions, one for production forecasting and another for research and development. NCAR's experimental, advanced research version, called ARW (Advanced Research WRF) features very high resolution and is being used to explore ways of improving the accuracy of hurricane tracking, hurricane intensity, and rainfall forecasts, among a host of other meteorological questions. It is ARW version 3.3.0, along with its pre- and post- processing modules (WPS and WPP), and the MET and GRaDS display tools that are supported here at the CUNY HPC Center. ARW version 3.3.0 is supported on both the the CUNY HPC Center SGI (ANDY) and Cray (SALK).

A complete start-to-finished use of ARW requires a significant number of steps of pre-processing, parallel production modeling, and post-processing and display. There several alternative paths that can be taken through each stage. In particular, ARW itself offers users the ability to process either real or idealized weather data. Completing one type of simulation or the other requires different steps and even different versions of the ARW executable. To help our users familiarize themselves with running ARW at the CUNY HPC Center, the steps required to complete a start-to-finish, real-case forecast are presented below. For more complete coverage, the CUNY HPC Center recommends that new users study the detailed description of the ARW package and how to use it at the University Corporation for Atmospheric Research (UCAR) website http://www.mmm.ucar.edu/wrf/OnLineTutorial/Basics/index.html.

WRF Pre-Processing with WPS

The WPS part of the WRF package is responsible for mapping time-equals-zero simulation input data onto the simulation domain's terrain. This process involves the execution of the preprocessing applications geogrid.exe, ungrib.exe, and metgrid.exe. Each of these applications reads its input parameters from the 'namelist.wps' input specifications file. In the example presented here, we will run a weather simulation based on input data provided from January of 2000 for the eastern United States. These steps should work both on ANDY and SALK with minor differences as noted. To begin this example, create a working WPS directory and copy the test case namelist file into it.

mkdir -p $HOME/wrftest/wps
cd $HOME/wrftest/wps
cp /share/apps/wrf/default/WPS/namelist.wps .

Next, you should edit the 'namelist.wps' to point to the sample data made available in the WRF installation tree. This involves making sure that the geog_data_path assignment in the geogrid section of the namelist file points to the sample data tree. From an editor make the following assignment:

geog_data_path = '/share/apps/wrf/default/WPS_DATA/geog_v3.1'

Once this is completed, you must symbolically link or copy the geogrid data table directory to your working directory ($HOME/wrftest/wps here).

ln -sf /share/apps/wrf/default/WPS/geogrid ./geogrid

Now, you can run 'geogrid.exe', the geogrid executable, which defines the simulation domains and interpolates the various terrestrial data sets between the model's grid lines. The global environment on ANDY has been set to include the path to all the WRF-related executables including 'geogrid.exe'. On SALK, you must load the WRF module ('module load wrf') first to set the environment. The geogrid executable is an MPI parallel program which could be run in parallel as part of a PBS batch script to complete the combined WRF preprocessing and execution steps, but often it runs only a short while and can be run interactively on ANDY's head node before submitting a full WRF batch job.

From the $HOME/wrftest/wps working directory run:

geogrid.exe > geogrid.out

If you are on SALK (Cray XE6), you will have to run:

 aprun -n 1 geogrid.exe > geogrid.out

Two domain files should be produced (geo_em.d01.nc geo_em.d02.nc) for this basic test case, as well as a log and output file which indicates success at the end with:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of geogrid.        !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The next required preprocessing step is to run 'ungrib.exe', the ungrib executable. The purpose of ungrib is to unpack 'GRIB' ('GRIB1' and 'GRIB2') meteorological data and pack it into an intermediate file format usable by 'metgrid.exe' in the final preprocessing step.

The data for the January 2000 simulation being documented here has already been downloaded and placed in the WRF installation tree in /share/apps/wrf/default/WPS_DATA. Before running 'ungrib.exe', the WRF installation 'Vtable' file must first be symbolically linked into the working directory with:

$ln -sf /share/apps/wrf/default/WPS/ungrib/Variable_Tables/Vtable.AWIP Vtable
$ls
geo_em.d01.nc  geo_em.d02.nc  geogrid  geogrid.log  namelist.wps  Vtable

The Vtable file specifies which fields to unpack from the GRIB files. The Vtables list the fields and their GRIB codes that must be unpacked. For this test case the required Vtable file has already been defined, but users may have to construct a custom Vtable file for their data.

Next, the GRIB files themselves must also be symbolically linked into the working directory. WRF provides a script to do this.

$link_grib.csh /share/apps/wrf/default/WPS_DATA/JAN00/2000012
$ls
geo_em.d01.nc  geogrid      GRIBFILE.AAA  GRIBFILE.AAC  GRIBFILE.AAE  GRIBFILE.AAG  GRIBFILE.AAI  GRIBFILE.AAK  GRIBFILE.AAM  namelist.wps
geo_em.d02.nc  geogrid.log  GRIBFILE.AAB  GRIBFILE.AAD  GRIBFILE.AAF  GRIBFILE.AAH  GRIBFILE.AAJ  GRIBFILE.AAL  GRIBFILE.AAN  Vtable

Note 'ls' shows that the 'GRIB' files are now present.

Next, more edits to the 'namelist.wps' file are required--one to set the start and end dates for the simulation to our January 2000 time frame, and the second to set the number of simulation seconds to complete (21600 / 3600 = 6.0 hours in this case). Edit the 'namelist.wps' file by setting the following in the shared section of the file:

 start_date = '2000-01-24_12:00:00','2000-01-24_12:00:00',
 end_date   = '2000-01-25_12:00:00','2000-01-25_12:00:00',
interval_seconds = 21600

Now you can run 'ungrib.exe' to create the intermediate files required by 'metgrid.exe':

$ungrib.exe > ungrib.out
$ls
FILE:2000-01-24_12  FILE:2000-01-25_06  geo_em.d02.nc  GRIBFILE.AAA  GRIBFILE.AAD  GRIBFILE.AAG  GRIBFILE.AAJ  GRIBFILE.AAM  ungrib.log
FILE:2000-01-24_18  FILE:2000-01-25_12  geogrid        GRIBFILE.AAB  GRIBFILE.AAE  GRIBFILE.AAH  GRIBFILE.AAK  GRIBFILE.AAN  ungrib.out
FILE:2000-01-25_00  geo_em.d01.nc       geogrid.log    GRIBFILE.AAC  GRIBFILE.AAF  GRIBFILE.AAI  GRIBFILE.AAL  namelist.wps  Vtable

After a successful 'ungrib.exe' run you should get the familiar message at the end of the output file:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Successful completion of ungrib.!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Like geogrid, the metgrid executable, 'metgrid.exe' needs to be able to find its table directory in the preprocessing working directory. The metgrid table directory may either be copied or symbolically linked into the working directory location.

ln -sf /share/apps/wrf/default/WPS/metgrid ./metgrid

Finally, all the files required for a successful run of 'metgrid.exe' have been provided. Like 'geogrid.exe', 'metgrid.exe' is an MPI parallel program that could be run in PBS batch mode, but often runs for only a short time and can be run on ANDY's head node, as follows:

$metgrid.exe > metgrid.out 
$ls
FILE:2000-01-24_12  geogrid       GRIBFILE.AAF  GRIBFILE.AAM                       met_em.d02.2000-01-24_12:00:00.nc  metgrid.out
FILE:2000-01-24_18  geogrid.log   GRIBFILE.AAG  GRIBFILE.AAN                       met_em.d02.2000-01-24_18:00:00.nc  namelist.wps
FILE:2000-01-25_00  GRIBFILE.AAA  GRIBFILE.AAH  met_em.d01.2000-01-24_12:00:00.nc  met_em.d02.2000-01-25_00:00:00.nc  ungrib.log
FILE:2000-01-25_06  GRIBFILE.AAB  GRIBFILE.AAI  met_em.d01.2000-01-24_18:00:00.nc  met_em.d02.2000-01-25_06:00:00.nc  ungrib.out
FILE:2000-01-25_12  GRIBFILE.AAC  GRIBFILE.AAJ  met_em.d01.2000-01-25_00:00:00.nc  met_em.d02.2000-01-25_12:00:00.nc  Vtable
geo_em.d01.nc       GRIBFILE.AAD  GRIBFILE.AAK  met_em.d01.2000-01-25_06:00:00.nc  metgrid
geo_em.d02.nc       GRIBFILE.AAE  GRIBFILE.AAL  met_em.d01.2000-01-25_12:00:00.nc  metgrid.log

If you are on SALK (Cray XE6), you will have to run:

 aprun -n 1 metgrid.exe > metgrid.out

Successful runs will produce an output file that includes:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of metgrid.  !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Note that the met files required by WRF are now present (see the 'ls' output above). At this point, the preprocessing phase of this WRF sample run is complete. We can move on to actually running this real (not ideal) WRF test case using the PBS Pro batch scheduler in MPI parallel mode.

Running a WRF Real Case in Parallel Using PBS

Our frame of reference now turns to running 'real.exe' and 'wrf.exe' in parallel on ANDY or SALK via PBS Pro. As you perhaps noticed in walking through the preprocessing steps above, the preprocessing files are all installed in their own subdirectory (WPS) under the WRF installation tree root (/share/apps/wrf/default). The same is true for the files to run WRF. They reside under the WRF install root in the 'WRFV3' subdirectory.

Within this 'WRFV3' directory, the 'run' subdirectory contains the all common files needed for a 'wrf.exe' run except the 'met' files that were just created in the preprocessing section above and those that are produced by 'real.exe' which is run before 'wrf.exe' in real-data weather forecasts.

Note that the ARW version of WRF allows one to produce a number of different executables depending on the type of run that is needed. Here, we are relying on the fact that the 'em_real' version of the code has already been built. Currently, the CUNY HPC Center has only compiled this version of WRF. Other versions can be compiled upon request. The subdirectory 'test' underneath the 'WRFV3' directory contains additional subdirectories for each type of WRF build (em_real, em_fire, em_hill2d_x, etc.).

To complete an MPI parallel run of this WRF real data case, a 'wrfv3/run' working directory for your run should be created, and it must be filled with the required files from the installation root's 'run' directory, as follows:

$cd $HOME/wrftest
$mkdir -p wrfv3/run
$cd wrfv3/run
$cp /share/apps/wrf/default/WRFV3/run/* .
$rm *.exe
$
$ls
CAM_ABS_DATA     ETAMPNEW_DATA_DBL  LANDUSE.TBL            ozone_lat.formatted   RRTM_DATA          RRTMG_SW_DATA      tr49t85
CAM_AEROPT_DATA  GENPARM.TBL        namelist.input         ozone_plev.formatted  RRTM_DATA_DBL      RRTMG_SW_DATA_DBL  tr67t85
co2_trans        grib2map.tbl       namelist.input.backup  README.namelist       RRTMG_LW_DATA      SOILPARM.TBL       URBPARM.TBL
ETAMPNEW_DATA    gribmap.txt        ozone.formatted        README.tslist         RRTMG_LW_DATA_DBL  tr49t67            VEGPARM.TBL
$

Note that the '*.exe' files were removed in the sequence above after the copy because they are already pointed to by ANDY's and SALK's system PATH variable.

Next, the 'met' files produced during the preprocessing phase above need to be copied or symbolically linked into the 'wrv3/run' directory.

$
$pwd
/home/guest/wrftest/wrfv3/run
$
$cp ../../wps/met_em* .
$ls
CAM_ABS_DATA       LANDUSE.TBL                        met_em.d02.2000-01-25_00:00:00.nc  README.namelist    SOILPARM.TBL
CAM_AEROPT_DATA    met_em.d01.2000-01-24_12:00:00.nc  met_em.d02.2000-01-25_06:00:00.nc  README.tslist      tr49t67
co2_trans          met_em.d01.2000-01-24_18:00:00.nc  met_em.d02.2000-01-25_12:00:00.nc  RRTM_DATA          tr49t85
ETAMPNEW_DATA      met_em.d01.2000-01-25_00:00:00.nc  namelist.input                     RRTM_DATA_DBL      tr67t85
ETAMPNEW_DATA_DBL  met_em.d01.2000-01-25_06:00:00.nc  namelist.input.backup              RRTMG_LW_DATA      URBPARM.TBL
GENPARM.TBL        met_em.d01.2000-01-25_12:00:00.nc  ozone.formatted                    RRTMG_LW_DATA_DBL  VEGPARM.TBL
grib2map.tbl       met_em.d02.2000-01-24_12:00:00.nc  ozone_lat.formatted                RRTMG_SW_DATA
gribmap.txt        met_em.d02.2000-01-24_18:00:00.nc  ozone_plev.formatted               RRTMG_SW_DATA_DBL
$

The user may have edits to complete on the WRF 'namelist.input' file listed to craft the exact job they wish to run. The default namelist file copied into our working directory is in large part what is needed for this test run, but we will reduce the total simulation time (for the weather model, not the job) from from 12 to 1 hour by setting the 'run_hours' variable to 1.

At this point we are ready to submit a PBS job. The PBS Pro batch script below first runs 'real.exe' which creates the WRF input files 'wrfbdy_d01' and 'wrfinput_d01', and then runs 'wrf.exe' itself. Both executables are MPI parallel programs, and here they are both run on 16 processors. Here is the 'wrftest.job' PBS script that will run on ANDY:

#!/bin/bash
#PBS -q production_qdr
#PBS -N wrf_realem
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

cd $HOME/wrftest/wrfv3/run

cat $PBS_NODEFILE

echo 'Running real.exe prep program'

mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/wrf/default/WRFV3/run/real.exe

echo 'Running wrf.exe itself'

mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/wrf/default/WRFV3/run/wrf.exe

echo 'Job is done!'

The full path to each executable is used for illustrative purposes, but both binaries are in the WRF install tree run directory and would be picked up from the system PATH environmental variable without the full path. This job requests 8 resource chunks, each with 1 processor and 2880 MBytes of memory. This job asks to be run on the QDR InfiniBand (faster interconnect) side of the ANDY system. Details on the use and meaning of the PBS option section of the job are available elsewhere in the CUNY HPC Wiki.

To submit the job type:

qsub wrftest.job

A slightly difference version of the script is required to run the same job on SALK (the Cray):

#!/bin/bash
#PBS -q production
#PBS -N wrf_realem
#PBS -l select=16:ncpus=1:mem=2000mb
#PBS -l place=free
#PBS -j oe
#PBS -o wrf_test16_O1.out
#PBS -V

cd $PBS_O_WORKDIR

export XT_SYMMETRIC_HEAP_SIZE=150M

export MALLOC_MMAP_MAX=0
export MALLOC_TRIM_THRESHOLD=536870912

export MPICH_RANK_ORDER 3

echo 'Running real.exe prep program'

aprun -n 16 -N 16 /share/apps/wrf/default/WRFV3/run/real.exe

echo 'Running wrf.exe itself'

aprun -n 16 -N 16 /share/apps/wrf/default/WRFV3/run/wrf.exe

echo 'Job is done!'

A successful run on either ANDY or SALK will produce an 'rsl.out' and 'rsl.error' file for each processor on which the job ran. So for this test case there will be 8 of each such files. The 'rsl.out' files reflect the run settings requested in the namelist file and then time-stamp the progress the job is making until the total simulation time is completed. The tail end of an 'rsl.out' file for a successful run should look like this:

:
:
v
Timing for main: time 2000-01-24_12:45:00 on domain   1:    0.06060 elapsed seconds.
Timing for main: time 2000-01-24_12:48:00 on domain   1:    0.06300 elapsed seconds.
Timing for main: time 2000-01-24_12:51:00 on domain   1:    0.06090 elapsed seconds.
Timing for main: time 2000-01-24_12:54:00 on domain   1:    0.06340 elapsed seconds.
Timing for main: time 2000-01-24_12:57:00 on domain   1:    0.06120 elapsed seconds.
Timing for main: time 2000-01-24_13:00:00 on domain   1:    0.06330 elapsed seconds.
 d01 2000-01-24_13:00:00 wrf: SUCCESS COMPLETE WRF
taskid: 0 hostname: gpute-2
taskid: 0 hostname: gpute-2

Post-Processing and Displaying WRF Results

MET (Model Evaluation Tools)

MET was developed by the National Center for Atmospheric Research (NCAR) Developmental Testbed Center (DTC) through the generous support of the U.S. Air Force Weather Agency (AFWA) and the National Oceanic and Atmospheric Administration (NOAA).

Description

MET is designed to be a highly-configurable, state-of-the-art suite of verification tools. It was developed using output from the Weather Research and Forecasting (WRF) modeling system but may be applied to the output of other modeling systems as well.

MET provides a variety of verification techniques, including:

  • Standard verification scores comparing gridded model data to point-based observations
  • Standard verification scores comparing gridded model data to gridded observations
  • Spatial verification methods comparing gridded model data to gridded observations using neighborhood, object-based, and intensity-scale decomposition approaches
  • Probabilistic verification methods comparing gridded model data to point-based or gridded observations


Usage

MET is a collection of components. Each of them requires a special input deck and generates outputs upon successful run.


1. PB2NC. This tool is used to create NetCDF files from input PrepBufr files containing point observations.

  • Input: One PrepBufr point observation file and one configuration file.
  • Output: One NetCDF file containing the observations that have been retained.

2. ASCII2NC tool is used to create NetCDF files from input ASCII point observations. These NetCDF files are then used in the statistical analysis step.

  • Input: One ASCII point observation file that has been formatted as expected.
  • Output: One NetCDF file containing the reformatted observations.

3. Pcp-Combine Tool (optional) accumulates precipitation amounts into the time interval selected by the user – if a user would like to verify over a different time interval than is included in their forecast or observational dataset.

  • Input: Two or more gridded model or observation files in GRIB1 format

containing accumulated precipitation to be combined to create a new accumulation interval.

  • Output: One NetCDF file containing the summed accumulation interval.

4. Gen-Poly-Mask Tool will create a bitmapped masking area from a user specified polygon, i.e. a text file containing a series of latitudes / longitudes. This mask can then be used to efficiently limit verification to the interior of a user specified region.

  • Input: One gridded model or observation file in GRIB1 format and one ASCII

file defining a Lat/Lon masking polyline.

  • Output: One NetCDF file containing a bitmap for the masking region defined

by the polyline over the domain of the gridded input file.

5. Point-Stat Tool is used for grid-to-point verification, or verification of a gridded forecast field against a point-based observation (i.e., surface observing stations, ACARS, rawinsondes, and other observation types that could be described as a point observation).

  • Input: One model file either in GRIB1 format or in the NetCDF format output

from the Pcp-Combine tool, at least one point observation file in NetCDF format (as the output of the PB2NC or ASCII2NC tool), and one configuration file.

  • Output: One STAT file containing all of the requested line types, and several

ASCII files for each line type requested.

6. Grid-Stat Tool produces traditional verification statistics when a gridded field is used as the observational dataset.

  • Input: One model file and one observation file either in GRIB1 format or in the

NetCDF format output from the Pcp-Combine tool, and one configuration file.

  • Output: One STAT file containing all of the requested line types, several

ASCII files for each line type requested, and one NetCDF file containing the matched pair data and difference field for each verification region and variable type/level being verified.


7. The MODE (Method for Object-based Diagnostic Evaluation) tool also uses gridded fields as observational datasets.

  • Input: One model file and one observation file either in GRIB1 format or in the

NetCDF format output from the Pcp-Combine tool, and one or two configuration files.

  • Output: One ASCII file containing contingency table counts and statistics,

one ASCII file containing single and pair object attribute values, one NetCDF file containing object indices for the gridded simple and cluster object fields, and one PostScript plot containing a summary of the features-based

8. The Wavelet-Stat tool decomposes two-dimensional forecasts and observations according to the Intensity-Scale verification technique described by Casati et al. (2004).

  • Input: One model file and one gridded observation file either in GRIB1 format

or in the NetCDF format output from the Pcp-Combine tool, and one configuration file.

  • Output: One STAT file containing the ‘ISC” line type, one ASCII file

containing intensity-scale information and statistics, one NetCDF file containing information about the wavelet decomposition of forecast and observed fields and their differences, and one PostScript file containing plots and summaries of the intensity-scale verification.

9. The Stat-Analysis tool reads the STAT output of Point-Stat, Grid-Stat, and Wavelet-Stat and can be used to filter the STAT data and produce aggregated continuous and categorical statistics.

  • Input: One or more STAT files output from the Point-Stat and/or Grid-Stat

tools and, optionally, one configuration file containing specifications for the analysis job(s) to be run on the STAT data.

  • Output: ASCII output of the analysis jobs will be printed to the screen unless

redirected to a file using the “-out” option.

10. The MODE-Analysis tool reads the ASCII output of the MODE tool and can be used to produce summary information about object location, size, and intensity (as well as other object characteristics) across one or more cases.

  • Input: One or more MODE object statistics files from the MODE tool and,

optionally, one configuration file containing specification for the analysis job(s) to be run on the object data.

  • Output: ASCII output of the analysis jobs will be printed to the screen unless

redirected to a file using the “-out” option.


Detailed documentation of all MET tools can be found at http://www.dtcenter.org/met/users/docs/overview.php

Running MET at Andy with PBS

MET tools are available under

/share/apps/met/default/bin

As an example of running MET tools on Andy consider the following. We will run gen_polly_mask. This tool requires two input files. They can be taken from '/'share/apps/met/default/data directory.

 mkdir ~/met_test
cd ~/met_test
 cp /share/apps/met/default/data/poly/CONUS.poly ./
cp /share/apps/met/default/data/sample_fcst/2005080700/wrfprs_ruc13_24.tm00_G212 ./

Now one need to construct a PBS script that would send a job to a PBS queue. Use your favorite text editor and create file "sendpds" with the following content:

#!/bin/bash                                                                                                                                                                  
# Simple MPI PBS Pro batch job                                                                                                                                               
#PBS -N testMET                                                                                                                                                              
#PBS -q production                                                                                                                                                          
#PBS -l select=1:ncpus=1:mpiprocs=1                                                                                                                                          
#PBS -l place=free                                                                                                                                                           
#PBS -V                                                                                                                                                                      
                                                                                                                                                                             
cd $PBS_O_WORKDIR                                                                                                                                                            
                                                                                                             
export METHOME=/share/apps/met/default                                                                                                                                       
                                                                                                                                                                             
echo "*** Running Gen-Poly-Mask to generate a polyline mask file for the Continental United States ***"                                                                      
$METHOME/bin/gen_poly_mask ./wrfprs_ruc13_24.tm00_G212 CONUS.poly CONUS_poly.nc -v 2                                                                                         
echo "*** Job is done! ***"                                                          

Submit the job using

qsub sendpbs

Upon successful completion 3 files will be generated:

  • testMET.eXXXX -- file with stderr. Should be empty if everything goes right
  • testMET.oXXXX -- file with stdout. In this example should contain the following:
*** Running Gen-Poly-Mask to generate a polyline mask file for the Continental United States ***
Input Data File:        ./wrfprs_ruc13_24.tm00_G212
Input Poly File:        CONUS.poly
Parsed Grid:            Lambert Conformal (185 x 129)
Parsed Polyline:        CONUS containing 243 points
Points Inside Mask:     5483 of 23865
Output NetCDF File:     CONUS_poly.nc
*** Job is done! ***
  • CONUS_poly.nc -- NetCDF file containing a bitmap for the masking region defined by the polyline over the domain of the gridded input file. In this example should contain the following:
CDF
lat#lon#


        FileOrigins{File CONUS_poly.nc generated 20100831_183200 UTC on host r1i0n7 by the gen_poly_mask tool from the polyline file CONUS.poly
ProjectionLambert Conformalp1_deg25.000000 degrees_northp2_deg25.000000 degrees_northp0_deg12.190000 degrees_northl0_deg-133.459000 degrees_easlcen_deg-95.000000 degrees_eastd_km
     40.635000 kmr_km6367.470000 kmnx185 grid_pointsny129 grid_points
                                                                     CONUS

Some Applications In Depth

Mathematica

General notes

“Mathematica” is a fully integrated technical computing system that combines fast, high-precision numerical and symbolic computation with data visualization and programming capabilities. Mathematica version 8.0 is currently installed on the CUNY HPC Center's ATHENA cluster (athena.csi.cuny.edu) and KARLE standalone server (karle.csi.cuny.edu). The basics of running Mathematica on CUNY HPC systems are present here. Additional information on how to use Mathematica can be found at http://www.wolfram.com/learningcenter/

Modes of Operation in Mathematica

Mathematica can be run locally on an office workstation, directly on a server or cluster from its head node, or across the network between an office-local client and a remote server (a cluster for instance). It can be run serially or in parallel; its licenses can be provided locally or via a network-resident license server; and it can be run in command-line or GUI mode. The details of installing and running Mathematica on a local office workstation are left to the user. Those modes of operation important to the use of CUNY's HPC resources are discussed here.

Selecting Between GUI and Command-Line Mode

The use of command-line mode or GUI mode is determined by the Mathematica command selected. To use the Mathematica GUI, enter the following command to the user prompt:

$mathematica

To use Mathematica Command Line Interface (CLI), enter:

$math

More detail on these and other Mathematica commands is available through man command as in:

$man mathematica
$man math
$man mcc

The lines above provide documentation on the GUI, CLI, and Mathematica C-compiler, respectively.

A Note on Fonts on Unix and Linux Systems

If you have Mathematica installed on your local system, you should already have the correct fonts available for local use, but when displaying the Mathematica GUI (via X11 forwarding). on your local system while running remotely, some additional preparation may be required to provide the fonts that Mathematica requires to X11 locally. The procedure for setting this up is presented here.

The Mathematica GUI interface supports BDF, TrueType, and Type1 fonts. These fonts are automatically installed for local use by the MathInstaller. Your workstation or personal computer will have access to these fonts if you have installed Mathematica for local use. However, if the Mathematica process is installed and running only on a remote system at the CUNY HPC Center (say ATHENA), then X11 and the Mathematica GUI being displayed on your local machine (through X11 port forwarding) must know where to find the Mathematica fonts locally. Typically, the Mathematica fonts must be added to your local workstation's X11 font path using the 'xset' command, as follows.

First, you must create a client-local directory into which to copy the fonts, for example on a Linux system cd $HOME; mkdir Fonts. Next, you must copy the Mathematica font directories into this local directory from their remote location on ATHENA. They are currently stored in the directory:

/share/apps/mathematica7/SystemFiles/Fonts/

To create local copies in the 'Fonts' directory you created, execute the following commands from your local desktop (this assumes that secure copy (scp) is available on your desktop system):

$
$mkdir Fonts
$
$cd Fonts
$scp -r your.account@athena.csi.cuny.edu:/share/apps/mathematica7/SystemFiles/Fonts/*   .
$
$ls -l
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 AFM
drwxr-xr-x 2 your.account users 45056 Nov  3 16:08 BDF
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 SVG
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 TTF
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 Type1
$

After you have copied the remote font directories into your local directory, run the following X11 'xset' commands locally:

xset fp+ ${HOME}/Fonts/Type1; xset fp rehash
xset fp+ ${HOME}/Fonts/BDF;    xset fp rehash

For optimal on-screen performance, the Type1 font path should appear before the BDF font path. Hence, ${HOME}/Fonts/Type1 should appear before ${HOME}/Fonts/BDF in the path. You can check font path order by executing the command:

xset q

Additional information on handling Mathematica fonts can be found at http://reference.wolfram.com/mathematica/tutorial/FontsOnUnixAndLinux.html

Using Mathematica on KARLE

Karle is a standalone, four socket, 4 x 6 = 24 core head-like node and is highly capable system. Karle's 24 Intel E740-based cores run at 2.4 GHz. Karle has a total of 96 Gbytes of memory or 4 Gbytes per core. Users can run GUI applications on Karle following this approach or they can prefer CLI. Selecting Between GUI and Command-Line Mode is described here.

Serial Job Exmaple

If mathematica was started in interactive mode using GUI/CLI users can enter mathematica commands as they would normally do:

$ /share/apps/mathematica/8.0/Executables/math
Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= Print["Hello World!"]
Hello World!

In[2]:= Table[Random[],{i,1,10}]

Out[2]= {0.22979, 0.168789, 0.257107, 0.724029, 0.466558, 0.588178, 0.186516, 
 
>    0.957024, 0.950642, 0.938009}

In[3] = Exit[]
$

Alternatively one may put these commands into a text file:

$ cat test.nb
Print["Hello World!"]
Table[Random[],{i,1,10}]
In[3] = Exit[]

$

and run it using:

/share/apps/mathematica/8.0/Executables/math < test.nb

the following output will be received:

Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= Hello World!

In[2]:= 
Out[2]= {0.67778, 0.737257, 0.862751, 0.623122, 0.253662, 0.541513, 0.776872, 
 
>    0.424682, 0.934039, 0.190007}

In[3]:= 

Parallel Job Example

To run parallel computations in Mathematica on Karle first start required amount kernels (CUNY HPC license allows up to 16 kernels) and then run actual computation. Consider the following example:

$ cat parallel.nb 

LaunchKernels[8]

With[{base = 10^1000, r = 10^10}, WaitAll[Table[ParallelSubmit[
     While[! PrimeQ[p = RandomInteger[{base, base + r}]], Null]; 
     p], {$KernelCount}]] - base]
$
$
$ /share/apps/mathematica/8.0/Executables/math < parallel.nb 
Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= 
In[1]:= 
Out[1]= {KernelObject[1, local], KernelObject[2, local], 
 
>    KernelObject[3, local], KernelObject[4, local], KernelObject[5, local], 
 
>    KernelObject[6, local], KernelObject[7, local], KernelObject[8, local]}

In[2]:= 
In[2]:= 
Out[2]= {4474664203, 8096247063, 9746330049, 4733134789, 2879419863, 
 
>    377023287, 7848087693, 8139999951}

In[3]:= 
$
Statement
LaunchKernels[8]
starts 8 local kernels. Rest on the notebook runs parallel evaluation on those 8 kernels.

Submitting Batch Jobs to the CUNY ATHENA Cluster

Currently, there is no simple and secure method of submitting Mathematica jobs from a remote (user local or desktop) CUNY installation of Mathematica to ATHENA. This is something that is being pursued. In the mean time, both serial and parallel Mathematica jobs can be submitted from ATHENA's head node by constructing either a standard batch job or through Mathematica's CLI and GUI built-in batch submission features. To ease the process of debugging such work, we recommend that user's test their Mathematica command sequences locally on smaller, but similar cases before submitting thier work to the cluster. The standard batch submission process is simple to set up and imposes the smallest burden on ATHENA's head node. Mathematica's built-in batch submission feature can be used from the Mathematica CLI or through its GUI. Using the GUI requires setting up X11 forwarding (potentially through more than one host, which is explained below), and imposes a greater burden on ATHENA's head node.

Serial Batch Jobs Run with 'qsub' Using a Mathematica Command (Text) File

In the following example, a batch job is created around a locally pre-tested Mathematica command sequence that is then submitted to ATHENA's batch queueing system using the qsub command. The simple Mathematica command sequence shown here computes a matrix of integrals and prints out every element of that matrix. Any valid sequence of Mathematica commands provided in a note book file, whether tested on an office Mathematica installation or on the cluster head node itself, could be used in this example.

When working remotely from an office or a classroom, a user would validate their command sequence on their local workstation (via a smaller local test run), modify it incrementally to make use of the additional resources available on ATHENA, and then copy, paste, and save the Mathematica command sequence in a notebook file (file.nb) on ATHENA. This last step would be done through a text editor like 'vi' or 'emacs' from a cluster terminal window. From a Windows desktop, the free, secure Windows-to-Linux terminal emulation package, PuTTY could be used. From a Linux desktop, connecting with secure shell 'ssh' would be the right approach.

Below, a note book file called "test_run.nb" does a serial (single worker-kernel) integral calculation (that might have been tested on the user's office Mathematica installation) has been saved on ATHENA from a 'vi' session. Its contents are listed here:

$
$ cat test_run.nb

Print ["Beginning Integral Calculations"]; p=5;
Timing[matr = Table[Integrate[x^(j+i),{x,0,1}], {i,1,p-1}, {j,1,p-1}]//N];
For[i=1, i<p, i++, For[j=1, j<p, j++, Print[matr[[i]][[j]]]]];
Print ["Finished!"];
Quit[];

$

As a serial Mathematica job, this job executes on just one core of just one of ATHENA's compute nodes. The simple batch script offered to 'qsub' to run this job (we will call it serial_run.math here) is listed below. This script is written in the PBS Pro form, which became the workload manager on ATHENA on 11-18-09. For details on PBS Pro see the section on using the PBS Pro workload manager elsewhere in the CUNY HPC Wiki.

$
$cat serial_run.math

#!/bin/bash
#PBS -N mmat7_serial1
#PBS -q production
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

cd $HOME/my/working/directory

math -run "<<test_run.nb"

$

This script runs on a single processor (core) within a single ATHENA compute node invoking a single Mathematica kernel instance. The '-N mmat7_serial1' option names the job 'mmat7_serial1' The job is directed to ATHENA's production routing queue, which reads the script's resource request information, and places it in the appropriate execution queue. The '-l select=1:ncpus=1:mem=1920mb' option requests one resource 'chunk' composed of 1 processor (core) and 1920 Mbytes of memory. The '-l place=free' option instructs PBS Pro to place the job where it wishes, which will be on the compute node with the lowest load average. The '-V' option ensures that the current local Unix environment is pushed out to the compute node that runs the job. Because this is a batch script with no connection to the terminal, the CLI version of the Mathematica command, 'math', is used.

Save this script in a file for your future use, for example in "serial_run.math". With few modifications, it can be used to run most serial Mathematica batch jobs on ATHENA.

To run this job script use the command:

 qsub serial_run.math 

Like any other batch jobs submitted using 'qsub', you can check the status of your job by running the command 'qstat' or 'qstat -f JID'. Upon completion, the output generated by the job will be written to the file 'mmat7_serial1.oXXXX', where the XXXX is the job ID (number) of the job.

Here is the output from this sample serial batch job:

Mathematica 7.0 for Linux x86 (64-bit)
Copyright 1988-2009 Wolfram Research, Inc.

Beginning Integral Calculations

0.333333
0.25
0.2
0.166667
0.25
0.2
0.166667
0.142857
0.2
0.166667
0.142857
0.125
0.166667
0.142857
0.125
0.111111

Finished!

SMP-Parallel Batch Jobs Run with 'qsub' Using a Mathematica Command (Text) File

The above serial note book file and serial PBS Pro submit script can now be modified to run in parallel (SMP-parallel) within a single ATHENA compute node, but using all of that node's resources--all four cores (on ATHENA) and all of its memory. The changes to the note book file shown are the same as those one would make to run the serial job in parallel on a multicore office system in parallel (SMP-parallel).


$
$ cat test_runp.nb

Print["Beginning Integral Calculations"]; p = 5;
DistributeDefinitions[p];
Timing[matr =   
        ParallelTable[
            Integrate[x^(j + i), {x, 0, 1}], {i, 1, p - 1}, {j, 1, p - 1}]//N];
        For[i = 1, i < p, i++, For[j = 1, j < p, j++, Print[matr[[i]][[j]]]]];
Print["Finished!"];
Quit[];

$

The SMP-parallel modified PBS Pro script looks like this:

$
$cat parallel_run.math

#!/bin/bash
#PBS -N mmat7_smp1
#PBS -q production
#PBS -l select=1:ncpus=4:mem=7680mb
#PBS -l place=pack
#PBS -V

cd $HOME/my/working/directory

math -run "<<test_runp.nb"

$

This script runs within a single ATHENA compute node, but invokes 4 Mathematica kernel instances. The '-N mmat7_smp1' option names the job 'mmat7_smp1' The job is directed to ATHENA's production routing queue just as before, which reads the script's resource request information, and places it in the appropriate execution queue. The '-l select=1:ncpus=4:mem=7680mb' option is different from before. It still requests one resource 'chunk', but now one with 4 processors (4 cores) and 7680 Mbytes of memory. The '-l place=pack' option is also different. The resources requested just fit within a single physical compute node. The '-l place=pack' option does just that instructing PBS Pro to place (pack) the job and the requested 4 core resource 'chunk' onto a single physical compute node. As before, the '-V' option ensures that the environment local to the head node is pushed out to the compute node that runs the job. The CLI version of the Mathematica command, 'math', is used again here.

Like any other batch jobs submitted using 'qsub', you can check the status of your job by running the command 'qstat' or 'qstat -f JID'. Upon completion, the output generated by the job will be written to the file 'mmat7_smp1.oXXXX', where the XXXX is the job ID (number) of the job.

The output from this SMP-parallel batch job is almost identical to the serial job output. The results are the same as one would expect, but there is notification that multiple kernels were used to complete the work:

Mathematica 7.0 for Linux x86 (64-bit)
Copyright 1988-2009 Wolfram Research, Inc.
Beginning Integral Calculations

LaunchKernels::launch: Launching 4 kernels...

SubKernels`Protected`kernelFlush::time:
   Operation LinkWrite timed out after 15. seconds.

SubKernels`Protected`kernelFlush::time:
   Operation LinkWrite timed out after 15. seconds.

SubKernels`Protected`kernelFlush::time:
   Operation LinkWrite timed out after 15. seconds.

General::stop: Further output of SubKernels`Protected`kernelFlush::time
     will be suppressed during this calculation.

0.333333
0.25
0.2
0.166667
0.25
0.2
0.166667
0.142857
0.2
0.166667
0.142857
0.125
0.166667
0.142857
0.125
0.111111
Finished!

Submitting jobs to a cluster that require more parallel resources than are available in a single compute node (4 cores in the case of ATHENA) demands a different approach in which multiple PBS Pro jobs are started for each worker (kernel) instance. Currently, this can be done only within the Mathematica 7.0 CLI or GUI framework. Submitting these non-SMP, distributed parallel jobs is described in the next section.

Distributed Parallel Batch Jobs Run Directly from the Mathematica CLI or GUI

To submit distributed batch work to the ATHENA cluster compute nodes using the Mathematica CLI, users must take the following steps:

1) Login to the ATHENA head node (athena.csi.cuny.edu) using 'ssh' as you normally would:

$ssh your.name@athena.csi.cuny.edu

If you are not on CUNY's CSI campus, you will have to access ATHENA through the CSI gateway system, neptune.csi.cuny.edu. (Note: X11 forwarding is not needed to use the Mathematica CLI.)

2) Start Mathematica using its CLI command:

$math

3) Enter the required notebook commands to the Mathematica CLI prompt to complete your work on the cluster compute nodes.

This involves starting a number of paired PBS Pro jobs and Mathematica 7.0 worker-kernel processes on the compute nodes from within the CLI (or GUI) interface. Two jobs would be needed to run 2-way distributed parallel jobs, 4 jobs would be needed to run 4-way distributed parallel jobs, and so on. Once each PBS job (worker-kernel) is started, subsequent submissions of parallel work from the CLI will be partitioned among the them in a manner similar the process that occurs on a multi-core, office workstation. The difference in this case is that each kernel runs as a distinct PBS job with it own PBS job ID. This is visible in the output from 'qstat' which will show one PBS job for each kernel spawned. The number of kernels spawned can be selected by the user. Initially, while becoming familiar with this process, CUNY HPCC recommends users work with 2 or 4 kernels (PBS jobs). Later, more kernels can be requested, subject to reasonable scaling and proven performance gains.

To assist users in starting these PBS jobs (worker-kernels), a setup template has been provided in:

/share/apps/mathematica7/AddOns/Applications/ClusterIntegration/PBS.m

This template can be referenced as is directly from within a Mathematica 7.0 CLI session or copied to a local directory, customized, and referred there. The steps required to initiate the PBS jobs (worker-kernels) and submit work to them from the Mathematica CLI are presented here:

$
$math

Mathematica 7.0 for Linux x86 (64-bit)
Copyright 1988-2009 Wolfram Research, Inc.

In[1]:= <<ClusterIntegration`

In[2]:= LaunchKernels[PBS["localhost"],2]

Out[2]= {KernelObject[1, localhost], KernelObject[2, localhost]}

In[3]:= ParallelEvaluate[{$MachineID,$ProcessID}]

Out[3]= {{6525-95739-71389, 3075}, {6501-95744-47317, 32631}}

In[4]:= Print["Beginning Integral Calculations"]; p = 5;

Beginning Integral Calculations

In[5]:= DistributeDefinitions[p];

In[6]:= Timing[matr = ParallelTable[Integrate[x^(j + i), {x, 0, 1}], {i, 1, p - 1}, {j, 1, p - 1}] N];

In[7]:= For[i = 1, i < p, i++, For[j = 1, j < p, j++, Print[matr[[i]][[j]]]]];

0.333333
0.25
0.2
0.166667
0.25
0.2
0.166667
0.142857
0.2
0.166667
0.142857
0.125
0.166667
0.142857
0.125
0.111111

In[8]:= CloseKernels[];

In[9]:= ^d

$
$

The work done by each Mathematica command in the CLI session above is described here.

In[1]:= <<ClusterIntegration`

The ClusterIntegration command reads in the default PBS.m template file from its location in the Mathematica 7.0 installation tree. The PBS.m template defines the options that PBS Pro uses when starting up each batch job used by Mathematica.

In[2]:= LaunchKernels[PBS["localhost"],2]

The LaunchKernels command actually starts each PBS job (2 in this case) and their associated Mathematica worker-kernel instances. Note that this step can take a minute or two to complete.

In[3]:= ParallelEvaluate[{$MachineID,$ProcessID}]

The ParallelEvaluate command is a general purpose parallel command that in this case is used simply to print out information on the Mathematica kernel-worker processes that have been started by PBS Pro.

In[4]:= Print["Beginning Integral Calculations"]; p = 5;

This command announces the beginning of a parallel integral calculation and sets the variable 'p' to 5.

In[5]:= DistributeDefinitions[p];

The DistributeDefinitions command pushes out the value of 'p' to each of the worker-kernals.

In[6]:= Timing[matr = ParallelTable[Integrate[x^(j + i), {x, 0, 1}], {i, 1, p - 1}, {j, 1, p - 1}] N];

This command computes the values of the integrals in parallel and times the process.

In[7]:= For[i = 1, i < p, i++, For[j = 1, j < p, j++, Print[matr[[i]][[j]]]]];

The nested For loops print out the results.

In[8]:= CloseKernels[];

The CloseKernels command terminates the PBS worker jobs running on the compute nodes of the cluster. They will then no longer be visible in 'qstat', but could be re-launched with another LaunchKernels command. Prior to termination any number of additional parallel commands may be submitted to the running PBS worker jobs.

In[9]:= ^d

The control-D terminates the Mathematica 7.0 CLI session completely.

The above command sequence is presented just for illustration. Any sequence of parallel commands that a user could run on their multi-core desktop system could be run in this fashion.

To submit work to the ATHENA cluster compute nodes using the Mathematica GUI, the user's local system must be a Linux- or Unix-based system and they must take the following steps:

1) Install Mathematica's fonts on your local Linux machine ( A Note on Fonts on Unix and Linux Systems).

2) Login to the ATHENA head node (athena.csi.cuny.edu) with 'ssh' and with X11 forwarding enabled as you normally would.

$ssh -X your.name@athena.csi.cuny.edu

If you are not on CUNY's CSI campus, you will have to access ATHENA through the CSI gateway system, NEPTUNE (neptune.csi.cuny.edu). The responsiveness of your connection will be reduced because you will have to forward X11 packets twice.

3) Start Mathematica using the GUI command:

$mathematica

The same notebook test command sequence presented above can be used here with the GUI.

Further information on the parallel capabilities of Mathematica 7.0 can be found at Mathematica's website at http://reference.wolfram.com/mathematica/guide/ParallelComputing.html

Submitting Batch Jobs from Remote Locations to CUNY's ATHENA Cluster

A method for doing this is being developed and tested.

For more information on Mathematica:

  • Online documentation is available through the Help menu within the Mathematica notebook front end.
  • The Mathematica Book, 5th Edition (Wolfram Media, Inc., 2003) by Stephen Wolfram.
  • The Mathematica Book is available online.
  • Additional Mathematica documentation is available online.
  • Information on the Parallel Computing Toolkit is available online.
  • Getting Started with Mathematica (Wolfram Research, Inc., 2004).
  • The Wolfram web site http://www.wolfram.com

MATLAB

The MATLAB high-performance language for technical computing integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include:

Math and computation

Algorithm development

Data acquisition

Modeling, simulation, and prototyping

Data analysis, exploration, and visualization

Scientific and engineering graphics

Application development, including graphical user interface building

MATLAB is an interactive system with both a command line interface (CLI) and Graphical User Interface (GUI) whose basic data element is an array that does not require dimensioning. It allows you to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar, non-interactive language such as C or Fortran. Properly licensed and configured, MATLAB's compute engine can be run serially or in parallel, and on a local desktop or client, or on a remote server or cluster. From the CUNY HPC Center's new MATLAB client system, KARLE (karle.csi.cuny.edu), each of these modes of operation is supported. KARLE is a 4 socket system based on Intel E740 processors with 6 cores per socket giving it a total of 24 physical cores (the E740 processor does not support Intel hyper-threading). Each core is clocked at 2.4 GHz. KARLE includes 4 GBytes of memory per core for a total of 96 GBytes. KARLE is directly accessible from any CUNY campus using the secure shell utility (ssh -X karle.csi.cuny.edu). ALL MATLAB work at the CUNY HPC Center should be started on or from KARLE which was purchased to replace the MATLAB functionality of NEPTUNE and TYPHOON. MATLAB jobs should NOT be run from the head nodes of either BOB or ANDY, the destination systems for MATLAB batch jobs.

Starting MATLAB in GUI or CLI Mode on KARLE

As mentioned above, MATLAB can be run either from its Graphical User Interface (GUI) or from its Command Line Interface (CLI). By default MATLAB selects the mode for you based on how you have logged into KARLE. If you have logged in using the '-X' option to 'ssh' which allows your 'ssh' session to support the X11 network graphical interface, then when MATLAB will be started in GUI mode. If the '-X' option is not used, it will be started in CLI mode. The following examples show each approach

Setting up MATLAB to run in GUI mode on KARLE:

local$ 
local$ ssh -X my.account@karle.csi.cuny.edu

Notice:
  Users may not access these CUNY computer resources 
without authorization or use it for purposes beyond 
the scope of authorization. This includes attempting
 to circumvent CUNYcomputer resource system protection 
facilities by hacking, cracking or similar activities,
 accessing or using another person's computer account, 
and allowing another person to access or use 
the user's account. CUNY computer resources may not 
be used to gain unauthorized access to another 
computer system within or outside of CUNY. 
Users are responsible for all actions performed 
from their computer account that they permitted or 
failed to prevent by taking ordinary security precautions.

my.account@karle.csi.cuny.edu's password: 
Last login: Fri Feb 17 11:50:02 2012 from 163.238.130.1

[myaccount@karle ~]$
[myaccount@karle ~]$ matlab

(MATLAB GUI windows are displayed on your screen)

Setting up MATLAB to run in CLI mode on KARLE:

local$ 
local$ ssh my.account@karle.csi.cuny.edu

Notice:
  Users may not access these CUNY computer resources 
without authorization or use it for purposes beyond 
the scope of authorization. This includes attempting
 to circumvent CUNYcomputer resource system protection 
facilities by hacking, cracking or similar activities,
 accessing or using another person's computer account, 
and allowing another person to access or use 
the user's account. CUNY computer resources may not 
be used to gain unauthorized access to another 
computer system within or outside of CUNY. 
Users are responsible for all actions performed 
from their computer account that they permitted or 
failed to prevent by taking ordinary security precautions.

my.account@karle.csi.cuny.edu's password: 
Last login: Fri Feb 17 11:50:02 2012 from 163.238.130.1

[myaccount@karle ~]$
[myaccount@karle ~]$ matlab

Warning: No display specified.  You will not be able to display graphics on the screen.

                                               < M A T L A B (R) >
                                     Copyright 1984-2011 The MathWorks, Inc.
                           Version 7.11.1.866 (R2010b) Service Pack 1 64-bit (glnxa64)
                                                February 15, 2011

 
  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.
 
>> 
>>

(MATLAB has defaulted to CLI mode because it does know where to display things)

Once MATLAB has started in either GUI or CLI on KARLE should should be able to proceed as you might have from your own desktop for interactive work, or according to the instructions for batch on BOB or ANDY work presented in sections below.

Modes of Operation: Local versus Remote (Batch)

Client-local MATLAB jobs (those run directly on KARLE) can be run in serial or in parallel mode. Server-remote MATLAB jobs submitted from KARLE (via either MATLAB's GUI or CLI) can also be run serially or in parallel on the CUNY HPC Center's BOB (bob.csi.cuny.edu) or ANDY (andy.csi.cuny.edu) clusters via the PBS Pro batch scheduler. On KARLE all 24 cores are available, although individual parallel jobs on KARLE should be limited to 8 cores. On BOB and ANDY (using the HPC Center's MATLAB DCS license) up to a total of 32 cores may be used by a single job, or among a collection of competing jobs. Single jobs may be limited to 12 or fewer cores in the future based on demand, total license seats, and evolving usage patterns.

Modes of Operation: Serial versus Parallel

MATLAB also gives its users the option to run jobs serially or in parallel. Parallel jobs (whether local or remote) can be divided into several distinct categories:

Loop-limited parallel work that relies on MATLAB's 'parfor' loop construct to divide the work within looping structures where loop iterations are fully independent. This approach is similar to traditional thread-based parallel programming models such as OpenMP and POSIX Threads.

Distributed or embarrassingly parallel work that relies on MATLAB's 'createTask' construct and divides completely independent workloads among a collection of processors that DO NOT need to communicate. This approach is similar to global parallel search tools MapReduce and Hadoop used by Google and Yahoo.

Single Program Multiple Data (SPMD) parallel work that relies on MATLAB's 'spmd' and 'labindex' contructs to partition the work done on the input data among largely identical, but coupled paths through a single program. This approach is similar the traditional MPI programming model.

Graphical Processing Unit (GPU) parallel work that relies on MATLAB functions and/or user provided routines that are GPU enabled. This approach is MATLAB's method of delivering GPU accelerated performance while limiting the amount of specialized programming that GPUs typically require (i.e. CUDA). This capability is only available from batch jobs submitted to ANDY from KARLE.

Each of these parallel job types (as well as serial work) can be run on KARLE interactively (or in the background) or as jobs submitted from KARLE to BOB or ANDY and its PBS Pro batch scheduler, the one exception by GPU-parallel MATLAB work which cannot be run directly, interactively on KARLE.

Computing PI Serially on KARLE

To illustrate each parallel model, a MATLAB script of the classic algorithm for computing PI by the numerical integration of the function for the arctangent (1/(1+x**2)) will be presented coded using each parallel approach, first for local computation on KARLE and remote submission computation to BOB.

First, we present a serial MATLAB script for computing PI using numerical integration locally on KARLE:

%  ----------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ----------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local Serial Version
%  ----------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple serial code
%  which uses a standard 'for' loop and runs with a matlab pool size of 1.
%  ----------------------------------------------------------------------
%
%  Clear environment and set output format.
%
  clear all; format long eng
%
%  Set processor (lab) pool size
%
  matlabpool open 1
  numprocs = matlabpool('size');
%
%  Open an output file.
%
  fid=fopen('/home/richard.walsh/matlab/serial_PI.txt','wt');
%
%  Define and initialize global variables
%
  mypi = 0.0;
  ttime = 0.0;
%
%  Define and initialize 'for' loop integration variables.
%
  nv = 10000;    %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
%  Start stopwatch timer to measure compute time.
%
  tic;
%
% This serial 'for' loop, loops over all of 'nv', and computes and sums
% the arctangent function's value at every interval into 'ht'.
%
  for i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end
%
% The numerical integration is completed by multiplying the summed
% function values by the constant interval (differential) 'wd' to get
% the area under the curve.
%
  mypi = wd * ht;
%
%  Stop stopwatch timer.
%
  ttime = toc;
%
% Print total time and calculated value of PI.
%
 fprintf('Number of intervals chosen (nv) was: %d\n', nv);
 fprintf('Number of processors (labs) used was: %d\n', numprocs);
 fprintf('Computed value for PI was: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
 fprintf('Time to complete the computation was: %6.6f\n', ttime);
%
%
 fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv);
 fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
 fprintf(fid,'Computed value for PI was: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
 fprintf(fid,'Time to complete the computation was: %6.6f\n', ttime);
%
%   Close output file.
%
 fclose(fid);
%
 matlabpool close;
%
% End of script
%

This script can be entered into the MATLAB CLI or GUI command window (or simple as 'matlab < serial_PI.m'). It will compute PI to an accuracy that depends on the number of intervals (nv is set to 10,000 here). All the work is done by a single processor that MATLAB refers to as a 'lab'. We will not go into the details of the algorithm here, but readers can find many descriptions of it on the Internet. It is completely defined within the scope of the 'for' loop and the statement that follows it. A key feature of the script is the definition of the MATLAB pool size:

matlabpool open 1;

This statement is not actually required for this serial job, but we include it to illustrate the changes that will take place in moving to parallel operation. Here the MATLAB pool size is set to 1 which forces serial operation. The 'mypi' variable will contain the result of the entire integration (rather than just partials) computed by the single processor ('lab') in the pool. This processor completes every iteration in the 'for' loop. The function 'farc()', which computes 1/(1 + x**2) for each x, must be made available in the MATLAB working directory. While this job runs locally on KARLE and will pick up the file where it was created on KARLE, later when we submit the job to BOB, 'farc()' will need to be transferred to BOB as a job dependent file. The job is timed using the 'tic' and 'toc' MATLAB library calls. The accuracy of the computed result is measured by comparing the computed result to MATLAB's internal value for PI (pi) used in the print statements.

Computing PI Using Loop-Local Parallelism on KARLE

Now, a modified version of the script that runs in parallel using the 'parfor' loop construct is presented.

%  -------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  -------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local Thread-Parallel Version
%  -------------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple parallel code
%  which uses a 'parfor' loop and runs with a matlab pool size of 4.
%  -------------------------------------------------------------------------
%
% Clear environment and set output format.
%
 clear all; format long eng;
%
%   Set processor (lab) pool size
%
 matlabpool open 4;
 numprocs = matlabpool('size');
%
%   Open an output file.
%
 fid=fopen('/home/richard.walsh/matlab/parfor_PI.txt','wt');
%
%   Define and initialize global variables
%
 mypi = 0.0;
 ttime = 0.0;
%
%   Define and initialize 'for' loop integration variables.
%
  nv = 10000;    %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
% Start stopwatch timer to measure compute time.
%
  tic;
%
% This parallel 'parfor' loop divides the interval count 'nv' implicitly among the
% processors (labs) and computes partial sums on each of the arctangent function's value
% at the assigned intervals. MATLAB then combines the partial sums implicitly
% as it leaves the 'parfor' loop construct placing the global sum into 'ht'.
%
  parfor i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end
%
% The numerical integration is completed by multiplying the summed
% function values by the constant interval (differential) 'wd' to get
% the area under the curve.
%
  mypi = wd * ht;
%
%  Stop stopwatch timer.
%
  ttime = toc;
%
% Print total time and calculated value of PI.
%
fprintf('Number of intervals chosen (nv) was: %d\n', nv);
fprintf('Number of processors (labs) used was: %d\n', numprocs);
fprintf('Computed value for PI is: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
fprintf('Time to complete the computation was: %6.6f\n', ttime);
%
%
fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv);
fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
fprintf(fid,'Computed value for PI is: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
fprintf(fid,'Time to complete the computation was: %6.6f\n', ttime);
%
%   Close output file.
%
fclose(fid);
%
matlabpool close;
%
% End of script
%

this is to change the text color

Focusing on the changes, first we see that that the MATLAB pool size has been increased to 4 with:

matlabpool open 4;

Next, the 'for' loop has been replaced by the 'parfor' loop, which as the comments make plain, divides the loop's iterations among the 4 processors ('labs') in the pool.

  parfor i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end

The iterations in the loop are assumed to be entirely independent, and by default MATLAB assigns blocks of iterations to each processor (lab) statically and in advance rather than dynamically as each iteration is completed. So, in this case iterations 1 to 2,500 would be assigned to processor 1, iterations 2,501 to 5,000 to processor 2, and so on. Another important feature of the 'parfor' construct is that it automatically generates the global result from each processor's partial result as the loop exits and places that global value in the variable 'ht'.

These are the important differences. When this job is run, the wall-clock time to get the result should be reduced, and the 'numprocs' variable will report that 4 processors were used for the job. An important thing to note is that getting parallel performance gains using the 'parfor' construct requires very few MATLAB script modifications once the serial version of the code has been created. At the same time, this approach is limited to simpler cases where the intrinsic parallelism of the algorithm is confined to loop-level structures and the processors used to do the work are connected to the same memory space (i.e. they are within the same physical compute node).

Compute PI using SPMD Parallelism on KARLE

The next step is to modify the above 'parfor' loop-local parallelism to use MATLAB's much more general SPMD parallel programming model. Here is the same algorithm adapted to use MATLAB SPMD constructs:

%  ----------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ----------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local SPMD Parallel Version
%  ----------------------------------------------------------------------
%  This is a MATLAB SPMD (Single Program Multiple Data) or MPI-like version 
%  of the MPI parallel routine for computing PI using the trapazoidal rule 
%  and the integral of the arctangent (1/(1+x**2)). This is example uses the 
%  MATLAB 'labs' abstraction to ascertain the names of the processor and
%  assigned them their share of the work. Versions of the alogorithm appear
%  "Computational Physics, 2nd Edition" by Landau, Paez, and Bordeianu and
%  "Using MPI" by Gropp, Lusk, and Skjellum.
%  ----------------------------------------------------------------------
%
%  Clear environment and set output format
%
  clear all; format long eng;
%
%  Set processor (lab) pool size
%
  matlabpool open 4;
  numprocs = matlabpool('size');
%
%   Open an output file.
%
  fid=fopen('/home/richard.walsh/matlab/spmd_PI.txt','wt');
%
%  Start the SPMD block which executes the same MATLAB commands on all processors (labs)
%
spmd
%
%  Find out which processor I am using the 'labindex' variable
%
   myid = labindex;
%
%  Define and set composite array variables
%
  mypi = 0.0;
  totpi = 0.0;
  ttime = 0.0;
%
%   Define and initialize 'for' loop integration variables.
%
  nv = 10000;   %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
%  Start stopwatch timer on processor 1 to measure compute time
%  
  if (myid == 1)
     tic;
  end
%
%  This parallel 'for' loop divides the interval count 'nv' explicitly among the processors
%  (labs) using the processor id 'myid' and the loop step size defined by 'numprocs'. 
%  The partial sums from each processor of the arctangent function's value are then 
%  combined explicitly via the call to the 'gplus()' global reduction function. Because this
%  is part of the SPMD block the global sum is generated on each processor.
%
  for i = myid : numprocs : nv
     x = wd * (i - 0.50);
     ht = ht + farc(x);
  end
%
  mypi = wd * ht;
%
%  The variable 'totpi' is a composite array with one storage location for each 
%  processor (lab) each of which gets the grand total generated by the 'gplus()'
%  function. For instance, the grand total delivered to processor (lab) 1 is stored
%  in totpi{1} for instance. 
%
   totpi = gplus(mypi);
%
%  Complete stopwatch timing of computation (including gather by 'gplus()') on 
%  processor (lab) 1. Because the 'gplus()' call is a blocking operation this time
%  is the same as the time to finish the whole calculation. 
%
  if (myid == 1)
     ttime = toc;
  end
%
%  Terminate the SPMD block of the code
%
end
%
%   Print computation time and calculated value of PI. Use the index for processor
%   1 to access processor 1 specific array elements of the composite variables.
%
fprintf('Number of intervals chosen (nv) was: %d\n', nv{1});
fprintf('Number of processors (labs) used was: %d\n', numprocs);
fprintf('Computed value for PI was: %3.20f with error of %3.20f\n', totpi{1}, abs(totpi{1}-pi));
fprintf('Time to complete the computation was: %6.6f seconds\n', ttime{1});
%
%
fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv{1});
fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
fprintf(fid,'Computed value for PI was: %3.20f with error of %3.20f\n', totpi{1}, abs(totpi{1}-pi));
fprintf(fid,'Time to complete the computation was: %6.6f seconds\n', ttime{1});
%
%   Close output file.
%
fclose(fid);
%
matlabpool close;
%
% End of script
%

Looking again at the differences between this SPMD script and the serial and 'parfor' parallel versions above, we see that that SPMD block is marked by the spmd header and terminating end statement much later in the script.

spmd

.
.
.

end

This entire section of the script is run independently by each processor (lab) generated by the 'matlabpool open 4' command at the top of the script. But, if each processor runs the same section of the script, the question is how is the work divided? Would it not just be computed in its entirety and redundantly 4 times? The division is accomplished in the same way that it would be in an MPI parallel implementation of the PI integration algorithm, using processor-unique IDs and the processor count.

MATLAB provides these constructs using the 'labindex' and 'numprocs' variables within the 'spmd' block. The 'labindex' contains a unique value for each processor counting from 1 while the 'numprocs' variable is assigned the MATLAB pool size at the beginning of the script. The values for each can be used to conditionally direct and control the path of each processor through what is the same script. Here, this is most importantly visible in the 'for' loop:

  for i = myid : numprocs : nv
     x = wd * (i - 0.50);
     ht = ht + farc(x);
  end

The simple 'for' loop returns, but with a starting iteration ('myid') set from the 'labindex' of each processor and the processor count (lab pool size) is used as the step size ('numprocs') for the loop. In this way, the 'for' loop's work is explicitly divided among the 4 processors. Processor 1 gets iterations 1, 5, 9, 13, etc. Processor 2 gets iterations 2, 6, 10, 14, etc., and so on. Each processors ends up with its own unique fraction of the 'for' loop's assigned work. The variables within the SPMD block including this loop are MATLAB composite arrays with values and memory locations unique to each processor.

This fact has two important consequences. First, each chunk of 'for' loop work can be run on a physically separate compute node with its own memory space. Second, the sum in the variable 'ht' is only partial on each processor, and the MATLAB programmer (you) must explicitly combine the partial sums to get the correct global result for PI. This is accomplished with the 'gplus()' function in the second line after the 'for' loop with:

totpi = gplus(mypi);

The 'mypi' composite array has a unique value on each processor equal to approximately 1/4 of the value of PI. Processor-specific values can be explicitly referenced using the 'mpi{n}' expression where 'n' is the lab index or processor ID value. The 'gplus()' function is one of a class of global reduction functions that will gather partial results computed by each member of the MATLAB SPMD pool, perform a specific arithmetic operation, and then place the result in each pool member's memory space. In this case, the composite array element 'totpi{n}' on each processor in the pool will receive the global sum of the partial values of PI on each processor. There are other global reduction 'g' functions like 'gprod()', 'gmax()', 'gmin()', etc., each with their own operation type. Refer to the MATLAB website for further information. http://www.mathworks.com/products/parallel-computing/demos.html?file=/products/demos/shipping/distcomp/paralleltutorial_gop.html

The rest of the SPMD script is largely the same as the others; however, a few additional comments are in order. First, note that when the SPMD block is closed, the composite array elements are referenced explicitly in the print statements. The script prints out the results present on processor one (1). Secondly, note that the timing results were only collected from processor one (1). One might wonder that if processor one (1) were to complete its partial result faster that the timing results gathered would be in error. This is prevented by the fact that the 'gplus()' function is blocking, which means that each processor (lab) will wait at the 'gplus()' call until all processors have received the global result. This will make the compute time from processor one (1) representative of the time for all.

This script and the others above can all be run directly from the MATLAB CLI or GUI command window on KARLE. Please explore the use of each and look at the timings generated.

Running Remote Parallel Jobs on BOB

With some preparation of the communications link between KARLE and BOB, and some minor modifications the scripts presented above can also be submitted to the PBS Pro batch queues on BOB. The modified script is transferred to BOB automatically, submitted to BOB's PBS batch queuing system, and the results are automatically returned to KARLE. The entire process can be tracked from the MATLAB GUI or CLI on KARLE, although the jobs are also visible on BOB. This process is made possible by building a $HOME file tree on BOB that mirrors the tree on KARLE and the secure copy ('scp') and secure shell ('ssh') commands. The procedure for setting up and running the above scripts on BOB is present here. With the introduction of KARLE, all MATLAB users within the CUNY family (whether local to the College of Staten Island or not) have equal access to both KARLE's client-local MATLAB capability and BOB's remote cluster MATLAB capability. With this addition, all use of MATLAB from NEPTUNE the older MATLAB client system is deprecated.

In MATLAB, remote parallel jobs can be divided into two basic classes. A Distributed Parallel job is a workload divided among two or more fully independent MATLAB workers (processors) that generate fully independent PBS pro jobs (unique job IDs). No communication is possible or expected between the processes (jobs). When submitted to BOB from KARLE, the MATLAB client submits several independent jobs, one for each worker, to the PBS batch scheduler on BOB. Each worker works on a piece of the same problem, but runs fully independently of the others. They are queued up by PBS Pro as separately scheduled serial jobs in PBS Pro's serial execution queue, qserial. Each Distributed Parallel job has its own job ID and is run on its own compute node. In other contexts, such "Distributed Parallel" work might be referred to as embarrassingly parallel.

On the other hand, MATLAB Coupled Parallel jobs do not produce independent processes run by separate 'workers', but produce a single, coupled, parallel workload run on seperate processors under MATLAB's 'labs' abstraction. Such workloads produce a single PBS batch job with one job ID, even while it runs on multiple processors. Coupled Parallel jobs are run in PBS Pro's parallel production qlong16 queue and have only one job ID. Communication is presumed to be required between the processes (labs) and relies on MPI-like inter-process communication. Such "Couple Parallel" work falls into the same category as the Single Program Multiple Data (SPMD) parallel job class that we introduced above.

Regardless of the type of remote job (Distributed Parallel or Coupled Parallel), one must have set up two-way, passwordless 'ssh' between the submitting client (KARLE or CSI office workstation) and BOB's head node. This is currently possible only from within the CUNY's CSI campus, and this is the reason non-CSI users must use KARLE to submit MATLAB jobs to BOB.

Licensing requirements for client-to-cluster job submission

MATLAB combines its basic tools for manipulating matrices with a large suite of applications-specific libraries or 'toolboxes', including its Parallel Computing Toolbox which is required to submit jobs to a cluster. In order to successfully run parallel MATLAB jobs on BOB, a user must have (or be able to acquire over their campus network) licenses for all the MATLAB components that will be used by their job. At a minimum, users must have a client-local license for MATLAB itself and the Parallel Computing Toolbox. For those wish to submit work from deskside clients on the CSI campus, CSI has 5 combined MATLAB and Parallel Computing Toolbox node-locked client licenses to distribute on a case-by-case and temporary basis. With these two licenses and the license that CUNY HPC provides on BOB for the Distributed Computing Server (DCS), basic MATLAB Distributed or Parallel jobs can be run (governed by the 'ssh' requirement above). If the job makes use of other applications-specific toolboxes (e.g. Aerospace Toolbox, Bioinformatics Toolbox, Econometrics Toolbox, etc.), it will attempt to acquire those licenses from the CSI campus MATLAB license server.

Note: When running locally on KARLE or submitting batch jobs from KARLE to BOB, these license requirements are already met, although not every MATLAB toolbox has been licensed by CSI.

Currently, a properly configured CSI campus client that also requires an application-specific toolbox to complete its work will have two license.lic files installed on their system in ${MATLAB_ROOT}/licenses (the value for the MATLAB_ROOT directory can be determined on the machine of interest by typing 'matlabroot' at the MATLAB command-line prompt). The first will be the node-local license (say, mylocal.lic) for MATLAB and the Parallel Computing Toolbox, and the second will be the network-served license (network.lic) pointing to the campus MATLAB toolbox license server. These are read in alphabetical order upon MATLAB startup to obtain proper licensing. Other licensing schemes are conceivable.

The node-local license for MATLAB and the Parallel Computing Tool Box might look something like this. This first INCREMENT block provides the node-local capability and the second Parallel Computing Tool Box capability:

# BEGIN--------------BEGIN--------------BEGIN
# DO NOT EDIT THIS FILE.  Any changes will be overwritten.
# MATLAB license passcode file.
# LicenseNo: 99999
INCREMENT MATLAB MLM 22 01-jan-0000 uncounted 99C9EC4D3695 \
        VENDOR_STRING=vi=30:at=187:pd=1:lo=GM:lu=200:ei=944275: \
        HOSTID=MATLAB_HOSTID=0015179549BA:000000 PLATFORMS="i86_re \
        amd64_re" ISSUED=30-Sep-2009 SN=000000 TS_OK
INCREMENT Distrib_Computing_Toolbox MLM 22 01-jan-0000 uncounted \
        E77E2F473055 \
        VENDOR_STRING=vi=30:at=187:pd=1:lo=GM:lu=200:ei=944275: \
        HOSTID="0015179549ba 0015179549bb 002219504c4f 002219504c51" \
        PLATFORMS="i86_re amd64_re" ISSUED=30-Sep-2009 SN=000000 TS_OK
# END-----------------END-----------------END

The network license for any required Applications Toolboxes would look something like this:

SERVER 163.238.11.65  000f1f8d5c66 27000
USE_SERVER

(The license files above are for illustration only, and are not functional license files.)

Within the CSI campus, a node-local license file for MATLAB, the Parallel Computing Toolbox license, and the network licenses for the MATLAB applications tool boxes that CSI supports can be obtained from CUNY's HPC group. In addition, installations of MATLAB on a CSI campus client must have included the current on-campus File Installation Key. This discussion does not apply to non-CSI users because the MATLAB installation on KARLE is complete.

In the future, if arrangements are made for non-CSI CUNY sites to have direct 'ssh' access to CUNY's HPC clusters at CSI, those non-CSI sites will need to provide local licensing for MATLAB itself, the Parallel Computing Toolbox, and any Applications Toolboxes they require. For all CUNY users (within and outside of CSI), the CUNY clusters at CSI provide the proper DCS licensing automatically for jobs started on the cluster as long as they arrive with the proper licenses for the Toolboxes they use.

Setting up the client and cluster environment for remote execution

A number of steps must be taken to successfully transfer, submit, and recover MATLAB jobs from the HPC cluster BOB. An important first step is to ensure that the version of MATLAB running locally is identical to the version running on the CUNY cluster, BOB. This has been taken care of for those submitting jobs from KARLE, but could be an issue for those at the CSI campus setting up deskside MATLAB clients. The CUNY HPC Center is currently running MATLAB Version R2010a, but to determine the release generally, login to BOB, run matlab's command-line interface (CLI), and to the >> prompt enter MATLAB's 'version' command. If identical versions are not running, the local MATLAB will detect a mismatch, assume there are potential feature incompatibilities, and not submit your job. The error message produced when this occurs is not very diagnostic.

Note: Again, this does not apply to non-CSI users submitting their work from KARLE where the versions already match.

Next, two-way, passwordless secure login and file transfer in both directions must be working correctly. For Linux-to-Linux transfers this involves following the procedures outlined in the 'ssh-keygen' man page and/or referring to the numerous descriptions on the web. This includes putting the public keys generated with 'ssh-keygen' on both the client (KARLE) and the server (BOB) into the other machine's authorized_keys file. For Windows-to-Linux transfers this is usually accomplished with the help of the Windows remote login utility 'PuTTY'. Please refer to the numerous HOW TOs on the web to complete this. Windows users have trouble with this can send email to the CUNY HPC Center helpline 'hpchelp@csi.cuny.edu' (Note: CSI clients that are behind a firewall or reside on a local subnet behind a router may require special configuration, including port-forwarding of return 'ssh' traffic on port 22 from BOB through the local router to the local client).

In addition, on the cluster, passwordless 'ssh' must be allowed for the user from the head node to all of the compute nodes where the MATLAB job might run. This is the default for user accounts on BOB, but it should be checked by the user before submitting jobs. Because the home directory on the head node is shared with all the compute nodes, accomplishing this is a simple matter of including the head node's public key in the 'authorized_keys' file in the user's '.ssh' directory. Again refer to the ssh-keygen man page or many on-line sources for more detail here.

Once passwordless 'ssh' is operational, the CUNY HPC group recommends studying the sections in MATLAB's Parallel Computing Toolbox User Guide [54]. The sections on 'Programming Distributed Jobs' and 'Programming Parallel Jobs' are particularly useful. The sub-sections titled 'Using the Generic Scheduler Interface' are specific to the topic of submitting remote jobs to the so-called 'Generic Interface', which is the term that MATLAB uses for workload managers generally (PBS, SGE, etc.). Note: Reading through these section of MATLAB's on-line documentation is strongly recommended before submitting the test jobs provided below.

In addition, an important source of information can be found in the README files in the following directory under MATLAB's root directory or installation tree on your campus-client system or on the head node of BOB:

$(MATLAB_ROOT}/toolbox/distcomp/examples/integration/pbs

There are similar directories for other common workload managers at the same level. Since, in a submission from KARLE or campus client to BOB, there is not a shared file system users should pay particularly close attention to the contents of the 'nonshared' subdirectory in the above PBS directory. There is guidance for both Linux and Windows clients on non-shared file systems there. Further information can be found at the MATLAB website here [55] and here [56].

Computing PI Serially Remotely on BOB

Below is a a fully commented MATLAB remote batch job submission script. This can be thought of as the 'boiler-plate' wrapping that is required to run the serial script for computing PI presented above on BOB instead of KARLE. Much of the explicit scripting presented here can be used to define a configuration template for BOB in the MATLAB GUI that reduces the number of commands one must enter in the GUI command window. From the text-driven MATLAB CLI all of these commands would need to be entered to run the job remotely on BOB.

%  ---------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ---------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Remote (BOB) Batch Serial Version 
%  ---------------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple serial code
%  which uses a standard 'for' loop and runs with a matlab pool size of 1.
%  This version includes all the code required to complete the submission
%  of job from the local client (KARLE) to a remote cluster (BOB) for standard
%  serial processing and returns the results to the client for viewing.
%
%  This version is design to run stand-alone from the MATLAB command line window
%  in the MATLAB GUI or from the text-driven command-line interface (CLI). Many 
%  of the commands in this file could be included in a MATLAB GUI "configuration" 
%  template for batch job submission to BOB simplifying the script considerably.
%  Versions of this alogorithm appear in "Computational Physics, 2nd Edition" by 
%  Landau, Paez, and Bordeianu; and "Using MPI" by Gropp, Lusk, and Skjellum.
%  ---------------------------------------------------------------------------
%
%  Define the 2 arguments to the MATLAB SimpleSubmitFcn: the name of the remote 
%  cluster (server, BOB in this case) running PBS and the path to the server
%  (remote) working directory.
%
clusterHost = 'bob.csi.cuny.edu';
remoteDataLocation = '/home/richard.walsh/matlab';
%
%  Inform MATLAB of the type of remote job scheduler to use. The 'generic' 
%  scheduler is the most flexible and customizable.
%
sched = findResource('scheduler', 'type', 'generic');
%
%  Define the path to the client (local) working directory from which MATLAB
%  stages the job and expects to find all required script files. At the CUNY HPC
%  Center, the client is KARLE (karle.csi.cuny.edu).  Also, set other parameters
%  required by the MATLAB job scheduler like the MATLAB root directory on the 
%  cluster, the file system type, and the OS on the cluster.
%
set(sched, 'DataLocation', '/home/richard.walsh/matlab');
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
%
%  Define the names of auxillary remote job sbmission functions
%
set(sched, 'GetJobStateFcn', @pbsGetJobState);
set(sched, 'DestroyJobFcn', @pbsDestroyJob);
%
%  Specify the name of the serial job submission function and its arguments.
%  This function determines the queue and resources used by the job on the server
%  (BOB). MATLAB has two alternative destination queues. Users running test 
%  or development jobs should specific the function with the 'Dev' suffix. Those
%  running production jobs should specific the function with the 'Prod' suffix.
%  Both of these scripts are located on KARLE in the MATLAB tree.
%
set(sched, 'SubmitFcn', {@pbsNonSharedSimpleSubmitFcn_Dev, clusterHost, remoteDataLocation});
%set(sched, 'SubmitFcn', {@pbsNonSharedSimpleSubmitFcn_Prod, clusterHost, remoteDataLocation});
%
%  Create the simple serial job object assigned to the job scheduler function
%
sjob = createJob(sched);
%
%  If this job requires data or function files ('farc.m') to run then they must 
%  be transferred over to the cluster with the main routine at the time of job 
%  submission, unless they are already present in the remote working directory or 
%  they are MATLAB intrinsic functions.  Any file needed to run the job locally 
%  will also be needed to run it remotely.  This is accomplished by defining file 
%  dependencies as shown in the following section.  Put each required file in single
%  quotes and inside {}'s as shown.
%
set(sjob, 'FileDependencies', {'serial_PI_func.m' 'farc.m'});
%
%  Create and name a task (defined here by our serial MATLAB script for computing PI)
%  to be completed by the remote MATLAB job (worker) on BOB. The task which will be executed
%  on one processor should be provided in MATLAB function rather than MATLAB script form
%  which allows the users to indicate which variables must be transferred on input and
%  returned as output.
% 
stask = createTask(sjob,@serial_PI_func,1,{});
%
%   Submit the job to the scheduler on KARLE which moves all files to BOB and initiates
%   the PBS job there.
%
submit(sjob);
%
%   Wait for the remote PBS batch job on BOB to finish. This implies that the
%   batch job has finished successfully and returned its outputs to the the client
%   working directory on KARLE.
%
waitForState(sjob, 'finished');
%
%   Get and print output results from disk.
%
results = getAllOutputArguments(sjob);
%
%   End of PbsPiSerial.m
%

Most of the scripting is fully described in the comment sections above, but several things should be called out. First, jobs can be submitted to either the development or production queues. Short running test jobs like this should be run in the development queue which is reserved and protected from longer running production jobs. Jobs that run longer than the development queue CPU time limit of 64 minutes will be killed automatically by PBS. The development queue allows jobs of no more than 8 processors (8 CPU minutes on each processor). To chose the development queue, use the job submit function listed above with the 'Dev' suffix. To use the production queue for longer (no time limit) and more parallel jobs (limit of 16 processors) use the 'Prod' suffix.

Second, as suggested in the comments, all user files required by the job must be included in the file dependences line and be present in the directory from which MATLAB was run. The example from above is:

set(sjob, 'FileDependencies', {'serial_PI_func.m' 'farc.m'});

Using the MATLAB function format rather than script format for dependent files is a convenient way to deliver inputs to the job and recover outputs from the job. A MATLAB function requires a header line of the following type:

function [output1 output2 output3 ... ] = serial_PI_func(input1 input2, input3 ... )

Functions may have zero or more inputs and zero or more outputs depending out what they are computing.

Finally, the 'createTask' command names the job's driver function (serial_PI_func in this case), sets the number of processors (1 here), and names the function's arguments (none here). The 'submit' command initiates the job, starting one session and one PBS job on BOB for each task defined by each separate 'createTask' command.

Computing PI In Parallel Remotely on BOB

Below is a a fully commented MATLAB parallel remote batch job submission script. It is very similar to the the 'boiler-plate' wrapping presented above for serial job submission to BOB. Like the serial script, much of the explicit scripting presented here can also be used to define a configuration template for parallel job submission to BOB in the MATLAB GUI that reduces the number of commands one must enter in the GUI command window. And similarly, from the text-driven MATLAB CLI all of these commands would need to be entered to run the job remotely on BOB.

%  ---------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ---------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Remote (BOB) Batch SPMD Parallel Version
%  ---------------------------------------------------------------------------
%  This is a MATLAB SPMD (Single Program Multiple Data) or MPI-like version
%  of the parallel algorithm for computing PI using the trapazoidal rule
%  and the integral of the arctangent (1/(1+x**2)). This example generates a
%  MATLAB pool under its 'labs' abstraction, ascertains the names of each processor
%  (lab), and assigns each of them a share of the work. This version of the 
%  algorithm submits the SPMD job from the local client (KARLE) to a remote cluster
%  (BOB) for parallel processing and returns the results to the client for viewing.
%
%  This version is design to run stand-alone from the MATLAB command line window
%  in the MATLAB GUI or from the text-driven command-line interface (CLI). Many 
%  of the commands in this file could be included in a MATLAB GUI "configuration"
%  template for batch job submission to BOB simplifying the script considerably.
%  Versions of this alogorithm appear in "Computational Physics, 2nd Edition" by
%  Landau, Paez, and Bordeianu; and "Using MPI" by Gropp, Lusk, and Skjellum.
%  ---------------------------------------------------------------------------
%
%  Define the 2 arguments to the MATLAB ParallelSubmitFcn: the name of the remote
%  cluster (server, BOB in this case) running PBS and the path to the server
%  (remote) working directory.
%
clusterHost = 'bob.csi.cuny.edu';
remoteDataLocation = '/home/richard.walsh/matlab';
%
%  Inform MATLAB of the type of remote job scheduler to use. The 'generic' 
%  scheduler is the most flexible and customizable.
%
sched = findResource('scheduler', 'type', 'generic');
%
%  Define the path to the client (local) working directory from which MATLAB
%  stages the job and expects to find all required script files. At the CUNY HPC
%  Center, the client is KARLE (karle.csi.cuny.edu).  Also, set other parameters 
%  required by the MATLAB job scheduler like the MATLAB root directory on the 
%  cluster, the file system type,  and the OS on the cluster.
%
set(sched, 'DataLocation', '/home/richard.walsh/matlab');
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
%
%  Define the names of auxillary remote job sbmission functions
%
set(sched, 'GetJobStateFcn', @pbsGetJobState);
set(sched, 'DestroyJobFcn', @pbsDestroyJob);
%
%  Specify the name of the parallel job submission function and its arguments.
%  This function determines the queue and resources used by the job on the server
%  (BOB). MATLAB has two alternative destination queues. Users running test 
%  or development jobs should specific the function with the 'Dev' suffix. Those
%  running production jobs should specific the function with the 'Prod' suffix.
%  Both of these scripts are located on KARLE in the MATLAB tree.
%
set(sched, 'ParallelSubmitFcn', {@pbsNonSharedParallelSubmitFcn_Dev, clusterHost, remoteDataLocation});
%set(sched, 'ParallelSubmitFcn', {@pbsNonSharedParallelSubmitFcn_Prod, clusterHost, remoteDataLocation});
%
%  Create the parallel job object to be assigned to the job scheduler function
%
pjob = createMatlabPoolJob(sched);
%
%  If this job requires data or function files ('farc.m') to run then they must
%  be transferred over to the cluster with the main routine at the time of job 
%  submission, unless they are already present in the remote working directory or
%  they are MATLAB intrinsic functions.  Any file needed to run the job locally
%  will also be needed to run it remotely. This is accomplished by defining file
%  dependencies as shown in the following section.  Put each required file in single
%  quotes and inside {}'s as shown.
%
set(pjob, 'FileDependencies', {'spmd_PI.m' 'farc.m'});
%
%  Define the number of processors (labs, workers) to use for this job. To ensure that
%  you get exactly one processor count, specifiy the maximum and minimum number to be
%  the same.
%
set(pjob, 'MaximumNumberOfWorkers', 4);
set(pjob, 'MinimumNumberOfWorkers', 4);
%
%  Create and name a task (defined here by our parallel SPMD MATLAB script for computing PI)
%  to be completed by the remote MATLAB job (lab, worker) pool. The task which will be executed
%  by each processor (lab, worker) should be provided in MATLAB function rather than MATLAB 
%  script form which allows the users to indicate which variables must be transferred on
%  input and returned as output.
%
ptask = createTask(pjob,@spmd_PI_func,4,{});
%
%   Submit the job to the scheduler on KARLE which moves all files to BOB and initiates
%   the PBS job there.
%
submit(pjob);
%
%   Wait for the remote PBS batch job on BOB to finish. This implies that the
%   batch job has finished successfully and returned its outputs to the the client
%   working directory on KARLE.
%
waitForState(pjob, 'finished');
%
%  Get and print output results from disk
%
results = getAllOutputArguments(pjob);
%
%   End of PbsPiParallel.m
%

The reader will see that much of this parallel job submission script is the same as the serial script above, but there are important differences that need to be explained. First, the job submit function is different:

set(sched, 'ParallelSubmitFcn', {@pbsNonSharedParallelSubmitFcn_Dev, clusterHost, remoteDataLocation});

Above, the 'pbsNonSharedSimpleSubmit' function was used while here it is 'pbsNonSharedParallelSubmitFcn' a function specific to parallel jobs. Another difference is that here a MATLAB pool is requested rather than a serial task and the minimum and maximum pool size is defined. This ensures that the job will use four (4) processors on BOB.

pjob = createMatlabPoolJob(sched);

.
.
.

set(pjob, 'MaximumNumberOfWorkers', 4);
set(pjob, 'MinimumNumberOfWorkers', 4);

The parallel job also has file dependencies, but this script uses function version of the SPMD-Parallel algorithm presented above for running in SPMD parallel mode on KARLE. Finally, this script creates a task that uses four (4) processors (labs) to complete the computation making it a Coupled Parallel job that will be started by PBS with one job ID but using 4 cores.

ptask = createTask(pjob,@spmd_PI_func,4,{});

Other Examples of Parallel Job Submission to BOB

CUNY's HPC group has successfully run both Distributed Parallel and Coupled Parallel jobs from KARLE on BOB, both from the GUI using a configuration for BOB and from the CLI without a configuration. Below, the CUNY's HPC group includes MATLAB scripts that have been used successfully to submit both Distributed Parallel and Coupled Parallel work to BOB from a Linux client. The MATLAB script for Distributed job submission is:

% Define arguments to SubmitFcn
clusterHost = 'bob.csi.cuny.edu';
% This is the path of the working directory on 'bob.csi.cuny.edu'
remoteDataLocation = '/home/<user_id>/matlab_remote';
% Create scheduler object
sched = findResource('scheduler', 'type', 'generic');
% Define a client local working directory in DataLocation on the submitting machine
set(sched, 'DataLocation', '/home/<user_id>/matlab_local');
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
set(sched, 'GetJobStateFcn', @pbsGetJobState);
set(sched, 'DestroyJobFcn', @pbsDestroyJob);
% The SubmitFcn must be a cell array that includes the two additional inputs
set(sched, 'SubmitFcn', {@pbsNonSharedSimpleSubmitFcn_Prod, clusterHost, remoteDataLocation});

j = createJob(sched);
t = createTask(j,@rand,1);
t = createTask(j,@rand,1);
t = createTask(j,@rand,1);
t = createTask(j,@rand,1);

submit(j);

waitForState(j, 'finished');
results = getAllOutputArguments(j);

References to files in the remote working directory are preceded by the '@' sign, and those files are presumed to be have been made available there. Files placed in other locations may be referenced with the full remote file system path or through the MATLAB addpath command using to the path below. The last two commands wait for the job to achieve a 'finished' state on the client and grab the results for display on the client. The runtime functions needed above (and below) can be obtained for customization from their distribution location in:

$(MATLAB_ROOT}/toolbox/distcomp/examples/integration/pbs/nonshared

Further information can be found on submitting MATLAB distributed jobs at the MATLAB website here [57].

The MATLAB script for Parallel job submission is listed here (the function colsum.m must be provided in MATLAB's local working directory):

% Define arguments to ParallelSubmitFcn
clusterHost = 'bob.csi.cuny.edu';
% This is the path of the working directory on 'bob.csi.cuny.edu'
remoteDataLocation = '/home/<user_id>/matlab_remote';
sched = findResource('scheduler', 'type', 'generic');
% Define a client local working directory in DataLocation on the submitting machine
set(sched, 'DataLocation', '/home/<user_id>/matlab_local');
set(sched, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(sched, 'HasSharedFilesystem', false);
set(sched, 'ClusterOsType', 'unix');
set(sched, 'GetJobStateFcn', @pbsGetJobState);
set(sched, 'DestroyJobFcn', @pbsDestroyJob);
% If you want to run parallel jobs, you must specify a ParallelSubmitFcn
set(sched, 'ParallelSubmitFcn', {@pbsNonSharedParallelSubmitFcn_Prod, clusterHost, remoteDataLocation});

pjob = createParallelJob(sched);
% Create a dependency on the parallel function colsum.m to get it transferred to cluster
set(pjob, 'FileDependencies', {'colsum.m'});
% Define the number of processes to use for this job
set(pjob, 'MaximumNumberOfWorkers', 4);
set(pjob, 'MinimumNumberOfWorkers', 4);
t = createTask(pjob,@colsum,1,{});

submit(pjob);

waitForState(pjob, 'finished');
results = getAllOutputArguments(pjob);

Parallel function 'colsum':

function total_sum = colsum
if labindex == 1
    % Send magic square to other labs
    A = labBroadcast(1,magic(numlabs))
else
    % Receive broadcast on other labs
    A = labBroadcast(1)
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex))

% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum)

It is important to point out that any user-authored code residing on the local client will need to be copied over to the cluster remote directory and be made available in the MATLAB path. The setting of the 'FileDependencies' property above in the Parallel job script illustrates how to accomplish this automatically as part of the job submission process. In this example, the job is dependent on the user supplied function 'colsum.m' that is local to the client. The line:

set(pjob, 'FileDependencies', {'colsum.m'});

accomplishes the fine transfer automatically. Because 'colsum.m' is written in MATLAB script, it can be transferred as text. However, user defined functions that need to be compiled (typically ending in the suffix '.mex') must be compiled in the environment in which they will be used. This may mean that users will need to compile their code on the destination machine (the head node of the cluster, BOB, in our case) and provide the compiled result in the remote working directory defined in their submit script (or gather them up on the local system before submission and use the file dependency command as show above for 'colsum.m'. Further information on file dependencies can be found in the MATLAB User's Guide [58].

All of the scripting described above, when tested and functioning can be reduced in the MATLAB GUI to a MATLAB configuration which is a drop down menu job submission tab. Further information can be found on submitting MATLAB distributed parallel jobs at the MATLAB website here [59].

An Outline of the Major Steps Involved in Remote Job Submission

A successful MATLAB job submission ( submit(pjob) ) to BOB from KARLE driven by the commands above completes the following steps:

1. The creation of a client-local job directory 'JobXX' in the current
     MATLAB working directory on the KARLE where XX is the MATLAB
     job number.

2. The transfer of the contents of the local 'JobXX' directory via 'ssh'
     from the client to BOB's head node (server) for execution to a mirror
     working directory on BOB.  Running 'get(job)' from MATLAB on KARLE
     will show a state of 'queued' or 'pending' for the job at this point.

3. The assignment of compute nodes to the job by PBS Pro and the queuing
     of the job for execution.  The job will now be visible as queued when 'qstat'
     is run on BOB.  This command is the PBS job monitoring utility.  A production
     job can remain in the queue state for a while depending on both the MATLAB
     and general level of activity on BOB. Users should be familiar with logging into
     BOB and tracking PBS batch jobs there.

4. The start of a MATLAB processes on the cluster compute nodes.  The
     job will now be listed as running when 'qstat' is run on BOB.  The job
     may be a serial or parallel job.  This will be visible in the 'qstat' output.

5.  Job completion on BOB indicated by a 'finished' state listed in the 'Job.state.mat' 
     file on BOB.  The 'qstat' command will now show the job has completed.

6. Job files are transferred back to the client-local directory on KARLE marking the
    'Job.state.mat' on the client also as 'finished'.  At this point, running the
    'get(job)' command from the client will show a job state of 'finished'.

7. Job results will be available to MATLAB via the 'results = getAllOutputArguments(job)
     command upon successful completion.

There will be slight differences in the protocol between Distributed Parallel and Coupled Parallel jobs with Distributed jobs showing N separate queued jobs with N job IDs, one for each MATLAB 'worker' task running on its own compute node; and Parallel jobs showing one queued job per MATLAB 'lab' running in concert on N compute nodes.

Help

If you have any questions regarding the CUNY HPCC please email us at: HPCHelp@mail.csi.cuny.edu

Contacts

Other Links