Applications Environment

From CUNYHPC
Jump to: navigation, search

Contents

Using Modules to Run your Applications

Modules is a software package that provides for the fast and convenient management of the components of a user's environment via modulefiles. When executed by the module command each module file fully configures the environment for its associated application or application group. The modules configuration language allows for the management of applications environment conflicts and dependencies as well. The modules software allows users to load (and unload and reload) an application and/or system environment that is specific to their needs and avoids the need to set and manage a large, one-size-fits-all, generic environment for everyone at login.

Modules is the default approach to managing the user applications environment. CUNY HPC Center system BOB, currently used almost entirely for Gaussian jobs will NOT be reconfigured with the modules software. Module version 3.2.9 is the default on the CUNY HPC Center systems.

  • Modules, Learning by Example
    • Example 1, Basic Non-Cray System
    • Example 2, Less Basic From SALK (Cray System)

Using the module package users can easily set a collection of environmental variables that are specific to their compilation, parallel programming, and/or application requirements on the HPC Center's systems. The modules system also makes it convenient to advance or regress compiler, parallel programming, or applications versions when defaults are found to have bugs or performance issues. Whatever the task, the modules package can adjust the environment in an orderly way altering or setting of such environmental variables as PATH, MANPATH, LD_LIBRARY_PATH, etc. and providing some basic descriptive information about the application version being loaded and purpose of the modules file through the module help facility.

In addition to each application-specific modulefile, the module package functions through the use of a collection of sub-commands given after the initial module command itself as in "module list" for instance. All these module sub- command are described in detail in the module man page ("man module"), but a list of some of the more important and commonly used sub-commands is provided here:

Module sub-commands:

list
load
unload
switch
avail
show
help
purge

Modules, Learning by Example

The best way to explain how to use the modules package and its sub-command is to consider some simple examples of a typical workflows that involve modules. Here are two examples. Again, for a more complete description of the modules package please refer to "man module".

Example 1, Basic Non-Cray System

Consider the unmodified PATH variable right after login to one of the CUNY HPC Center systems. Without any custom or local environmental path settings, it would look something like this with no compiler, parallel programming model, or application-specific information in it:

username@service0:~> echo $PATH | tr -s ':' '\n'
/home/username/bin
/usr/local/bin
/usr/bin
/bin
/usr/bin/X11
/usr/X11R6/bin
/usr/games
/opt/c3/bin

We take note that there appears to be no path to the application that we are interested in running which is Wolfram's Mathematica in this example. Typing "which math" to find Mathematica ("math" is the command-line name for Mathematica) at the terminal yields:

 
username@service0:~>  which math
which: no math in (/home/username/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/c3/bin)

The Mathematica executable "math" is not found in the default PATH variabl defined by the system at login. Modules can be used to remedy this problem by adding the required path. To check which module files (if any) are already loaded into our environment, we are can type the "module list" sub-command at the terminal prompt:

username@service0:~> module list
No Modulefiles Currently Loaded.
username@service0:~>

No modules loaded. So the module file for Mathematica has not been loaded and it is no surprise that the Mathematica command-line "math" was not found. The next question is has the HPC Center installed Mathematica on this system and created a module file for it? To find this out we use the "module avail" sub-command:

username@service0:~> module avail
---------------------------- /share/apps/modules/default/modulefiles_UserApplications --------------------------------------

adf/2012.01(default)         cesm/1.0.3                   hoomd/0.9.2(default)         ncar/5.2.0_NCL(default)      pgi/12.3(default)
auto3dem/4.02(default)       cesm/1.0.4(default)          intel/12.1.3.293(default)    nwchem/6.1.1(default)        phoenics/2009(default)
autodock/4.2.3(default)      cuda/5.0(default)            ls-dyna/6.0.0(default)       octopus/4.0.0(default)       r/2.14.1(default)
beagle/0.2(default)          gromacs/4.5.5_32bit          mathematica/8.0.4(default)   openmpi/1.5.5_intel(default) wrf/3.4.0(default)
best/2.2L(default)           gromacs/4.5.5_64bit(default) matlab/R2012a(default)       openmpi/1.5.5_pgi

--------------------------------- /share/apps/modules/default/modulefiles_System -------------------------------------------

module-info   modules       version/3.2.9

The listing shows all available module files on this system, both those that are user-application related and those that are more system related. As shown in the output, these two types of module files are stored in different directories. Looking through the application list, there is a module for Mathematica version 8.0.4, which is also happens to be the default. On this system, the modules package has only just been installed, and therefore only one version of each application has been adapted to the module system and that version is the default.

The module file that is responsible for setting up correct environment needed to run Mathematica can now be loaded:

module load mathematica

Because there is only one version and it is the default, there is no need to include the version-specific extension to load it. To explicitly load version 8.0.4 (or any other specific and non-default version) one would use:

module load mathematica/8.0.4

After loading, the environmental PATH variable includes the path to Mathematica:

username@service0:~> echo $PATH | tr -s ':' '\n'
/home/username/bin
/usr/local/bin
/usr/bin
/bin
/usr/bin/X11
/usr/X11R6/bin
/usr/games
/opt/c3/bin
/share/apps/mathematica/8.0.4/Executables

This can be verified by rerunning the "which math" command:

username@service0:~> which math
/share/apps/mathematica/8.0.4/Executables/math

Once the head or login node enviroment variables are properly set, one can create a PBS script to run an Mathematica job on a compute node and ensure that the head or login node environment just set is passed on to the compute nodes by using the "#PBS -V" option inside you PBS script:

#!/bin/bash
#PBS -N mmat8_serial1
#PBS -q production
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin Mathematica Serial Run ..."
echo ""
math -run <test_run.nb > output
echo ""
echo ">>>> End   Mathematica Serial Run ..."

Since the PATH variable in the login environment is now includes the location of the Mathematica executable and the "#PBS -V" option ensures that this is passed to the compute node that the job is run on, the last line of the PBS script will be executed without environment-related problems.

Example 2, Less Basic From SALK (Cray System)

As do all of the systems at the CUNY HPC Center, the Cray SALK has multiple compiler, parallel programming models, libraries, and applications. In addition, SALK uses a custom high-performance interconnect with its own libraries, has its own compiler suite and compiling system, and many other custom libraries. Setting up and/or tearing down a given environment that makes all this work correctly is more complicated that it is on the other systems at the HPC Center. Modules simplifies this process tremendously for the user.

Here is an example of how to swap out the default Cray compiler environment on SALK and swap in the compiler suite from the Portland Group including all the right MPI libraries from Cray. In this case, we swap in a new release of the Portland Group compilers, not the current default on the Cray, and the version of the NETCDF library that has been compiled with the Portland group.

Having logged into SALK, we determine what modules have been load by default with "module list":

user@salk:~> module list
Currently Loaded Modulefiles:
  1) modules/3.2.6.6
  2) nodestat/2.2-1.0400.31264.2.5.gem
  3) sdb/1.0-1.0400.32124.7.19.gem
  4) MySQL/5.0.64-1.0000.5053.22.1
  5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90
  6) udreg/2.3.1-1.0400.4264.3.1.gem
  7) ugni/2.3-1.0400.4374.4.88.gem
  8) gni-headers/2.1-1.0400.4351.3.1.gem
  9) dmapp/3.2.1-1.0400.4255.2.159.gem
 10) xpmem/0.1-2.0400.31280.3.1.gem
 11) hss-llm/6.0.0
 12) Base-opts/1.0.2-1.0400.31284.2.2.gem
 13) xtpe-network-gemini
 14) cce/8.0.7
 15) acml/5.1.0
 16) xt-libsci/11.1.00
 17) pmi/3.0.0-1.0000.8661.28.2807.gem
 18) rca/1.0.0-2.0400.31553.3.58.gem
 19) xt-asyncpe/5.13
 20) atp/1.5.1
 21) PrgEnv-cray/4.0.46
 22) xtpe-mc8
 23) cray-mpich2/5.5.3
 24) pbs/11.3.0.121723

From the list, we see that the Cray Programming Environment ("PrgEnv-cray/4.0.46") and the Cray Compiler environment are loaded ("cce/8.0.7") by default among other things (PBS, MPICH, etc.). To unload these Cray modules and load in the Portland Group (PGI) equivalents we need to know the names of the PGI modules. The "module avail" command will tell us this:

user@salk:~> module avail
.
.
(several sections of output removed)
.
.
------------------------------------------------ /opt/modulefiles -----------------------------------------------------
Base-opts/1.0.2-1.0400.31284.2.2.gem(default)     gcc/4.1.2                                         pbs/11.2.0.113417
PrgEnv-cray/3.1.61                                gcc/4.2.4                                         pbs/11.3.0.121723(default)
PrgEnv-cray/4.0.46(default)                       gcc/4.4.2                                         petsc/3.1.08
PrgEnv-gnu/3.1.61                                 gcc/4.4.4                                         petsc/3.1.09
PrgEnv-gnu/4.0.46(default)                        gcc/4.5.1                                         petsc-complex/3.1.08
PrgEnv-intel/3.1.61                               gcc/4.5.2                                         petsc-complex/3.1.09
PrgEnv-intel/4.0.46(default)                      gcc/4.5.3                                         pgi/12.10
PrgEnv-pathscale/3.1.61                           gcc/4.6.1                                         pgi/12.3
PrgEnv-pathscale/4.0.46(default)                  gcc/4.7.1(default)                                pgi/12.6(default)
PrgEnv-pgi/3.1.61                                 hss-llm/6.0.0(default)                            pgi/12.8
PrgEnv-pgi/4.0.46(default)                        intel/12.1.1.256                                  wrf/3.3.0
acml/4.4.0                                        intel/12.1.4.319(default)                         wrf/3.4.0(default)
acml/5.1.0(default)                               intel/12.1.5.339                                  xt-asyncpe/5.01
admin-modules/1.0.2-1.0400.31284.2.2.gem(default) java/jdk1.6.0_24                                  xt-asyncpe/5.05
amber/12(default)                                 java/jdk1.7.0_03(default)                         xt-asyncpe/5.13(default)
cce/8.0.7(default)                                mazama/6.0.0(default)                             xt-libsci/11.0.00
chapel/1.4.0                                      modules/3.2.6.6(default)                          xt-libsci/11.0.04
chapel/1.5.0(default)                             mrnet/3.0.0(default)                              xt-libsci/11.1.00(default)
fftw/2.1.5.3                                      pathscale/4.0.12.1(default)                       xt-papi/4.2.0
fftw/3.2.2.1(default)                             pathscale/4.0.9                                   xt-papi/4.3.0(default)
fftw/3.3.0.1                                      pbs/11.1.0.111761

There are several versions of the PGI compilers and two version of the PGI Programming Environment for the Cray (SALK). We are interested in loading PGI's 12.10 release (not the default which is "pgi/12.6") and the most current release of the PGI programming environment ("PrgEnv-pgi/4.0.46"), which is the default. The PGI programming environment for the Cray provides all the environmental settings required to use the PGI compilers on the Cray which includes a number of custom libraries.

Here is a series of module commands to unload the Cray defaults, load the PGI modules mentioned, and load version 4.2.0 of NETCDF compiled with the PGI compilers.

user@salk:~> module unload PrgEnv-cray
user@salk:~> module load PrgEnv-pgi
user@salk:~> module unload pgi
user@salk:~> module load pgi/12.10
user@salk:~> 
user@salk:~> module load netcdf/4.2.0
user@salk:~>
user@salk;~> cc -V
/opt/cray/xt-asyncpe/5.13/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native.

pgcc 12.10-0 64-bit target on x86-64 Linux 
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc.  All Rights Reserved.

Several comments about this series of command are perhaps useful. First, the first three commands do not include version numbers and will therefore load or unload the current default versions. In the third line, we unload the default version of the PGI compiler (version 12.6) which is loaded with the rest of the PGI Programming Environment in the second line. We then load the non-default and more recent release from PGI, version 12.10 in the fourth line. Later, we load NETCDF version 4.2.0 which, because we have already loaded the PGI Programming Environment, will load the version of NETCDF 4.2.0 compiled with the PGI compilers. Finally, we check to see which compiler the Cray "cc" compiler wrapper actually invokes after this sequence of module commands. We see that indeed "pgcc" version 12.10 is being used.

We can confirm all this by again entering "module list". Notice that the Cray-related compiler modules have been replaced by those from PGI and that NETCDF version 4.2.0 has been loaded. We are ready to use new PGI compiler suite based environment. It is left as an exercise to the reader to figure out how the series of commands listed above could have been shortened by using the "module swap" sub- command.

user@salk:~> module list
Currently Loaded Modulefiles:
  1) modules/3.2.6.6
  2) nodestat/2.2-1.0400.31264.2.5.gem
  3) sdb/1.0-1.0400.32124.7.19.gem
  4) MySQL/5.0.64-1.0000.5053.22.1
  5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90
  6) udreg/2.3.1-1.0400.4264.3.1.gem
  7) ugni/2.3-1.0400.4374.4.88.gem
  8) gni-headers/2.1-1.0400.4351.3.1.gem
  9) dmapp/3.2.1-1.0400.4255.2.159.gem
 10) xpmem/0.1-2.0400.31280.3.1.gem
 11) hss-llm/6.0.0
 12) Base-opts/1.0.2-1.0400.31284.2.2.gem
 13) xtpe-network-gemini
 14) xtpe-mc8
 15) cray-mpich2/5.5.3
 16) pbs/11.3.0.121723
 17) xt-libsci/11.1.00
 18) pmi/3.0.0-1.0000.8661.28.2807.gem
 19) xt-asyncpe/5.13
 20) atp/1.5.1
 21) PrgEnv-pgi/4.0.46
 22) pgi/12.10
 23) hdf5/1.8.8
 24) netcdf/4.2.0

Applications

This an overview of the user-level HPC applications supported by the HPC Center staff for the benefit of the entire CUNY HPC user community. A user can chose to install any application that they are licensed for on their own account, or appeal (based on general interest) to have it installed by HPC Center staff in the shared system directory (usually /shared/apps).

Not every user-level application is installed on every system. This is because system architectural differences, load-balancing considerations, licensing limitations, the time required to maintain them, and other factors, sometimes dictate otherwise. Here, we present the current CUNY HPC Center user-level application catalogue and note the system on which each application is installed and licensed to run.

We encourage the CUNY HPC community to help the HPC Center staff create a applications catalogue that is closely tuned to the needs of the community. As such, we hope that users will solicit staff-help in growing our application install base to suite the needs of the community whatever the application discipline (CAE, CFD, COMPCHEM, QCD, BIOINFORMATICS, etc.).

Unless otherwise noted, all applications built locally were built using our default Intel-OpenMPI applications stack. Furthermore, the PBS Pro job submission scripts below are promised to work (at the time this section of the Wiki was written), but the number of processors (cores), memory, and process placement defined in the example scripts is not necessarily optimal for wall-clock or cpu-time performance. The user should use their knowledge of the application, the system, and the benefit of their experience to choose the optimal combination of processors and memory for their scripts. Details on how to make full use of the PBS Pro job submission options are covered in the PBS Pro section below.

ADCIRC

ADCIRC is a system of programs for solving time-dependent, free-surface, circulation and transport problems in two and three dimensions. These programs utilize the finite element method in space allowing the use of highly flexible, unstructured grids. The ADCIRC distribution includes and integrates the METIS tool for unstructured grid generation. In addition, ADCIRC includes a distribution of SWAN to which it can be coupled to add a shore wave simulation model.

Typical ADCIRC applications have included: (i) modeling tides and wind driven circulation, (ii) analysis of hurricane storm surge and flooding, (iii) dredging feasibility and material disposal studies, (iv) larval transport studies, (v) near shore marine operations. For more detail on using ADCIRC, please visit the ADCIRC website here [1] and read the ADCIRC manual [2]. Details on using SWAN with ADCIRC can be found here [3] and at the SWAN web site [4].

The CUNY HPC Center has installed version 50.79 on SALK (the Cray) and ANDY (the SGI) for general academic use. ADCIRC can be run in serial or MPI-parallel mode on either system. ADCIRC has demonstrated good scaling properties up to 512 cores on SALK and 64 cores on ANDY. A step-by-step walk through of running an ADCIRC test case in both serial and parallel mode follows.

Serial Execution

Create a directory where all the files needed to run the serial ADCIRC job will be kept.

salk$ mkdir test_sadcirc
salk$ cd test_sadcirc

Copy the Shinnecok Inlet example from ADCIRC installation tree and unzip it.

salk$ cp /share/apps/adcirc/default/testcase/serial_shinnecock_inlet.zip ./
salk$ unzip ./serial_shinnecock_inlet.zip 
Archive:  ./serial_shinnecock_inlet.zip
  inflating: serial_shinnecock_inlet/fort.14  
  inflating: serial_shinnecock_inlet/fort.15  
  inflating: serial_shinnecock_inlet/fort.16  
  inflating: serial_shinnecock_inlet/fort.63  
  inflating: serial_shinnecock_inlet/fort.64  

Change into the unpacked subdirectory.

salk$ cd serial_shinnecock_inlet/

There you should find the following files:

salk$ ls
fort.14  fort.15  fort.16  fort.63  fort.64

Next, create a PBS script with the following lines in it to be used to submit the serial ADCIRC job to the Cray (SALK) PBS queues. Note that on SALK running a serial job requires allocating (and wasting most of) 16 processors because fractional compute nodes cannot be allocated on SALK.

#!/bin/bash
#PBS -q production
#PBS -N SADCIRC.test
#PBS -l select=16:ncpus=1:mem=2048mb
#PBS -l place=free
#PBS -j oe
#PBS -o sadcirc.out
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo ">>>> Begin ADCRIC Serial Run ..."
aprun -n 1 /share/apps/adcirc/default/bin/adcirc
echo ">>>> End   ADCRIC Serial Run ..."

And finally to submit the serial job to the PBS queue enter:

salk$ qsub sadcirc.job

Parallel Execution

The steps required to run ADCIRC in parallel include some additional mesh partitioning and decomposition steps based on the number processors planned for the job. As before, create a directory where all the files needed for the job will be kept:

salk$ mkdir test_padcirc
salk$ cd test_padcirc

Again, copy the Shinnecok Inlet example from ADCIRC installation tree and unzip it. The starting point for the serial and parallel tests is the same, but for the parallel case the serial data set used above is partitioned and decomposed for the parallel run.

salk$ cp /share/apps/adcirc/default/testcase/serial_shinnecock_inlet.zip ./
salk$ unzip ./serial_shinnecock_inlet.zip 
Archive:  ./serial_shinnecock_inlet.zip
  inflating: serial_shinnecock_inlet/fort.14  
  inflating: serial_shinnecock_inlet/fort.15  
  inflating: serial_shinnecock_inlet/fort.16  
  inflating: serial_shinnecock_inlet/fort.63  
  inflating: serial_shinnecock_inlet/fort.64  

Rename and change into directory you just unpacked:

salk$ mv  serial_shinnecock_inlet  parallel_shinnecock_inlet
salk$ cd parallel_shinnecock_inlet/

Now we need to run the ADCIRC preparation program 'adcprep' to partition the serial domain and decompose problem:

salk$ /share/apps/adcirc/default/bin/adcprep 

When prompted, enter 8 for number of processors to be used in our parallel example here:


  *****************************************
  ADCPREP Fortran90 Version 2.3  10/18/2006
  Serial version of ADCIRC Pre-processor   
  *****************************************
  
 Input number of processors for parallel ADCIRC run:
8

Next, enter 1 to complete partitioning the domain for 8 processors using METIS:


 #-------------------------------------------------------
   Preparing input files for subdomains.
   Select number or action:
     1. partmesh
      - partition mesh using metis ( perform this first)
 
     2. prepall
      - Full pre-process using default names (i.e., fort.14)

      ...

 #-------------------------------------------------------

 calling: prepinput

 use_default =  F
 partition =  T
 prep_all  =  F
 prep_15   =  F
 prep_13   =  F
 hot_local  =  F
 hot_global  =  F

Next, provide that name of the unpartitioned file unzipped from the serial test case, fort.14:

Enter the name of the ADCIRC UNIT 14 (Grid) file:
fort.14

This will generate some additional output to your terminal and complete the mesh partition step.

You must then run 'adcprep' again to decompose the problem. When prompted enter 8, number of processors as before, but this time followed by a 2 to decompose the problem. When this preparation step completes you will find the following files and directories in your working directory:

salk$ ls
fort.15     fort.63  fort.80          partmesh.txt  PE0001  PE0003  PE0005  PE0007
fort.14     fort.16  fort.64     metis_graph.txt  PE0000   PE0002  PE0004  PE0006

The 8 subdirectories created in the second 'adcprep' run contain the partitioned and decomposed problem that each MPI processor (8 in this case) will work on.

Copy the parallel ADCIRC binary to the working directory.

# cp /share/apps/adcirc/default/bin/padcirc ./

At this point you'll have all the files needed to run the parallel job. The files and directories created and required for this 8 core parallel run are shown here:

# ls 
adc  fort.14  fort.15  fort.16  fort.80  metis_graph.txt  partmesh.txt 
PE0000/  PE0001/  PE0002/  PE0003/  PE0004/  PE0005/  PE0006/  PE0007/

Create a PBS script with the following lines in it to be used to submit the parallel ADCIRC job to the Cray (SALK) PBS queues:

#!/bin/bash
#PBS -q production
#PBS -N PADCIRC.test
#PBS -l select=16:ncpus=1:mem=2048mb
#PBS -l place=free
#PBS -j oe
#PBS -o padcirc.out
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin PADCRIC MPI Parallel Run ..."
aprun -n 8 /share/apps/adcirc/default/bin/padcirc
echo ">>>> End   PADCRIC MPI Parallel Run ..."

And finally to submit the parallel job to the PBS queue enter:

salk$ qsub padcirc.job

The CUNY HPC Center has also built and provided a parallel-coupled version of ADCIRC and SWAN to include surface wave affects in the simulation. This executable is called 'padcswan' and can be run with largely the same preparation steps and the same PBS script shown above for 'padcirc'. Details on the minor differences and additional input files required are available at the SWAN websites given above.

ADF (Amsterdam Density Functional Theory)

ADF (Amsterdam Density Functional) is a Fortran program for calculations on atoms and molecules (in gas phase or solution) from first principles. It can be used for the study of such diverse fields as molecular spectroscopy, organic and inorganic chemistry, crystallography and pharmacochemistry. Some of its key strengths include high accuracy supported by its use of Slater-type orbitals, all-electron relativistic treatment of the heavier elements, and fast parameterized DFT-based semi-empirical methods. A separate program BAND is available for the study of periodic systems: crystals, surfaces, and polymers. The COSMO-RS program is used for calculating thermodynamic properties of (mixed) fluids.

The underlying theory is the Kohn-Sham approach to Density-Functional Theory (DFT). This implies a one-electron picture of the many-electron systems, but yields in principle the exact electron density (and related properties) and the total energy. If ADF is a new program for you, we recommend that you carefully read Chapter 1, section 1.3 'Technical remarks, Terminology', which presents a discussion of a few ADF-typical aspects and terminology. This will help you to understand and appreciate the output of an ADF calculation. The ADF Manual is located on the web here: [5]

ADF 2013 (and SCM's other programs) is installed on ANDY and PENZIAS at the CUNY HPC Center. The older 2012 version is also available on ANDY server. The current license is group-limited and allows for up to 32 cores of simultaneous ADF use and 8 cores of simultaneous BAND use. This is a floating license limited to the DDR side of ANDY and the PBS 'production' queue. Users not currently in the ADF group should inquire about access by sending an email to 'hpchelp@csi.cuny.edu'.

Here is a simple ADF input deck that compute the SCF wave function for HCN. This example can be run with the PBS script shown below on from 1 to 4 cores.

Title    HCN Linear Transit, first part
NoPrint  SFO, Frag, Functions, Computation

Atoms      Internal
  1 C  0 0 0       0    0    0
  2 N  1 0 0       1.3  0    0
  3 H  1 2 0       1.0  th  0
End

Basis
 Type DZP
End

Symmetry NOSYM

Integration 6.0 6.0

Geometry
  Branch Old
  LinearTransit  10
  Iterations     30  4
  Converge   Grad=3e-2,  Rad=3e-2,  Angle=2
END

Geovar
  th   180    0
End

End Input

A PBS script ('adf_4.job') configured to use 4 cores is shown here. Note that ADF does not use the version of MPI that the HPC Center supports by default. ADF used the proprietary version of MPI from SGI that is part of SGI's MPT parallel library package. This script includes special lines to configure the run for this. A side effect of this fact is that ADF jobs will not clock time in PBS as shown under the 'Time' column when your job is being checked with 'qstat'

To include all required environmental variables and the path to the ADF executable run the modules load command (the modules utility is discussed in detail above):

module load adf
#!/bin/bash
# This script runs a 4-cpu (core) ADF job using the group
# license of Dr. Vittadello, Dr. Birke, and the CUNY HPC
# Center. This script requests only one half of the resources 
# on an ANDY compute node (4 cores, 1 half its memory). 
#
# The HCN_4P.inp deck in this directory is configured to work
# with these resources, although this computation is really 
# too small to make full use of them. To increase or decrease
# the resources PBS requests (cpus, memory, or disk) change the 
# '-l select' line below and the parameter values in the input deck.
#
#PBS -q production
#PBS -N adf_4P_job
#PBS -l select=1:ncpus=4:mem=11520mb:lscratch=400gb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# set environment up to use SGI's MPT version of MPI rather
# than the CUNY default which is OpenMPI
BASEPATH=/opt/sgi/mpt/mpt-2.02

export PATH=${BASEPATH}/bin:${PATH}
export CPATH=${BASEPATH}/include:${CPATH}
export FPATH=${BASEPATH}/include:${FPATH}
export LD_LIBRARY_PATH=${BASEPATH}/lib:${LD_LIBRARY_PATH}
export LIBRARY_PATH=${BASEPATH}/lib:${LIBRARY_PATH}
export MPI_ROOT=${BASEPATH}

# set the ADF root directory
export ADFROOT=/share/apps/adf
export ADFHOME=${ADFROOT}/2013.01

# point ADF to the ADF license file
export SCMLICENSE=${ADFHOME}/license.txt

# set up ADF scratch directory 
export MY_SCRDIR=`whoami;date '+%m.%d.%y_%H:%M:%S'`
export MY_SCRDIR=`echo $MY_SCRDIR | sed -e 's; ;_;'`
export SCM_TMPDIR=/home/adf/adf_scr/${MY_SCRDIR}_$$

mkdir -p $SCM_TMPDIR

echo ""
echo "The ADF scratch files for this job are in: ${SCM_TMPDIR}"
echo ""

# check important paths
#type mpirun
#type adf

# set the number processors to use in this job to 4
export NSCM=4

# run the ADF job
echo "Starting ADF job ... "
echo ""

adf -n 4 < HCN_4P.inp > HCN_4P.out 2>&1

# name output files
mv logfile HCN_4P.logfile

echo ""
echo "ADF job finished ... "

# clean up scratch directory files
/bin/rm -r $SCM_TMPDIR

Much of this script is similar to the script that runs Gaussian jobs, but the differences should should be described in some detail. First, ADF must be submitted to the 'production' queue which is where its floating license has been limited, and where it can only use 32 cores at one for ADF and 8 cores at a time for BAND. Second, there is a block in the script that sets up the environment to use the SGI proprietary version of MPI for parallel runs. Next is the NSCM environmental variable which defines the number of cores to use along with the '-n' option on the command line. Both of these (along with the number of cpus on the PBS '-l select' line at the beginning of the script) must be adjusted to control the number of cores used by the job.

Note the 'adf' command is actually a script that generates and runs another script that actually runs the 'adf.exe' executable. This script (called 'runscript') is built and placed in the users working directory. It typically includes some preliminary steps that are NOT run in parallel.

With the HCN input file and PBS script above, you can submit an ADF job on ANDY with:

qsub adf_4.job

All users of ADF must be licensed and placed in the 'gadf' Unix group by HPC Center staff.

AMBER Assisted Model Building with Energy Refinement

Amber is the collective name for a suite of programs that allow users to carry out molecular dynamics simulations, particularly on biomolecules. None of the individual programs carries this name, but the various parts work reasonably well together, and provide a powerful framework for many common calculations. The term "Amber" is also sometimes used to refer to the empirical force fields that are implemented within the Amber suite. It should be recognized however, that the code and force fields are distinct: several other computer packages have implemented the Amber force fields, and other force fields can be implemented and used with the Amber programs. Furthermore, the force fields are in the public domain, whereas the codes are distributed under a license agreement.

The Amber 12 software suite is divided into two parts: AmberTools12, a collection of freely available programs mostly under the GPL license, and Amber12, which is centered around the 'sander' and 'pmemd' simulation programs, and which continues to be licensed as before, under a more restrictive license. You need to install both parts, starting with AmberTools. The online manual for the AmberTools is located here [6]. The online manual for Amber itself is located here [7]. Amber 12 (2012) represents a significant change from the most recent previous version, Amber 11, which was released in April, 2010. Please see [8] for an overview of the most important changes specific to Amber 12. In particular, Amber 12 includes a carefully validated multi-GPU implementation.

AMBER version 12 is installed on SALK,ANDY and PENZIAS. The latter has 144 Kepler GPU so users who want to utilize multi-GPU AMBER implementation which supports single, double, and mixed precision operation on the GPUs must submit their jobs on this machine. All servers offer single-threaded, MPI-parallel versions. The low-latency, custom Gemini interconnect on SALK should provide superior scaling performance at higher core counts.

A PBS Pro submit script for AMBER that runs the CPU-only version using 'mpirun' on 16 processors and is appropriate for ANDY follows:

#!/bin/bash
#PBS -q production_qdr
#PBS -N AMBER_MPI
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the Non-Threaded MPI parallel executable to run
echo ">>>> Begin Non-Threaded AMBER MPI Parallel Run ..."
mpirun -np 16  -machinefile $PBS_NODEFILE sander.MPI -O -i product.in -o model1_product.out \
-c model1_equil.rst -p model1.prmtop -r model1_product.rst -x model1_product.mdcrd > amber_mpi.out 2>&1
echo ">>>> End   Non-Threaded AMBER MPI Parallel Run ...

To run a similar job, but one that uses 16 CPUs for the bonded interactions and an additional 16 GPUs for the non-bonded interactions, the following script could be used on PENZIAS:

#!/bin/bash
#PBS -q production
#PBS -N AMBER_GPU
#PBS -l select=16:ncpus=1:ngpus=1:mem=3680mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the Non-Threaded MPI parallel executable to run
echo ">>>> Begin Non-Threaded AMBER MPI-GPU Parallel Run ..."
mpirun -np 16  -machinefile $PBS_NODEFILE (to be completed)
echo ">>>> End    Non-Threaded AMBER MPI-GPU Parallel Run ..."

SALK has no GPUs and so the above script cannot be run there. In addition, job submission on SALK via PBS is somewhat different from the other CUNY system. Below is a script that will run a 16 processor (core) job similar to the first script above, but on SALK the amber module must first be loaded:

module load amber

The ensures that the programming environment on the Cray is 'PrgEnv-gnu', the Cray compiler wrapper points to the GNU suite, and the right version of the NETCDF and FFTW libraries are loaded.

Here is the script for SALK:

#!/bin/bash
#PBS -q production
#PBS -N amber_test
#PBS -l select=16:ncpus=1:mem=2048mb
#PBS -l place=free
#PBS -j oe
#PBS -o amber_test16.out
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'aprun' and point to the MPI parallel executable to run
echo ">>>> Begin AMBER MPI Run ..."
echo ""
aprun -n 16 -N 16 sander.MPI -O -i product.in -o model1_product.out -c model1_equil.rst -p model1.prmtop -r model1_product.rst -x model1_product.mdcrd
echo ""
echo ">>>> End   AMBER MPI Run ..."

This script could then be submitted with:

qsub amber.job

The most important difference to note is that on SALK the 'mpirun' command is replaced with the Cray's 'aprun' command. The 'aprun' command is used to start all jobs on SALK's compute nodes and mediates the interaction between the PBS script's resource requests and the ALPS resource manager on the Cray. SALK users should familiarize themselves with 'aprun' and its options by reading 'man aprun' on SALK. Users cannot request more resources on their 'aprun' command-lines than are defined by the PBS script's resource request lines. There is useful discussion elsewhere on the Wiki about the interaction between PBS and ALPS as mediated by the 'aprun' command and the error message generated when there is a mismatch.

For any of these jobs to run, all the required auxiliary files must be present in the directory from which the job is run.

AUTODOCK

AutoDock is a suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure. AutoDock actually consists of two main programs: autodock itself performs the docking of the ligand to a set of grids describing the target protein; autogrid pre-calculates these grids. More information about the software may be found at the autodock web-page [9].

Both autodock and autogrid are installed on Andy under the directory "/share/apps/autodock/default" (currently available version is 4.2.3).

Typical AutoDock workflow consist of the following steps.

1) Preparation of receptor file (processing a molecular structure to autodock readable format (for example receptor.pdb to receptor.pdbqt using AutoDockTools etc).

2) Preparation of drug molecule (using AutoDockTools).

3) Generation of actual input file to run Autogrid calculation (to generate atomic maps).

4) Generation of actual input file to run Autodock calculation (to generate Drug-Receptor complexes etc).

The steps above can very easily be done using a local office machine. Once these steps are completed and all of the input files are prepared a user can copy those files to Andy to run the computationally intensive part there. We have provided examples of input files in /share/apps/autodock/default/example/in/

These files will be used here to further illustrate Autodock usage.

5) Create a working directory and "cd" there:

mkdir ./autodock_test  && cd ./autodock_test

6) Copy the example input files (or the actual files you prepared in steps 1-4) to the working directory:

cp /share/apps/autodock/default/example/in/* ./

7) Create a PBS submit script. Use your favorite text editor to put the following lines into file "autodock.job":

#!/bin/bash
#PBS -N autodock_test
#PBS -q production
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo "*** Starting Job ***"

#------------------------------
# generate atomic maps
autogrid4 -p aq.gpf -l output_mapGeneration.glg

# execute autodock
autodock4 -p aq.dpf -l output_autodock.dlg
#------------------------------

echo "*** Job is done! ***"

8) Submit the job to the PBS queue with:

qsub autodock.job

One can check the status of the job using the PBS "qstat" command. Upon successful outputs will be stored in files "output_mapGeneration.glg" and "output_autodock.dlg".

9) Analyse the results.

Sample outputs from exactly the same job can be found under "/share/apps/autodock/default/example/out" (files "output_mapGeneration.glg" and "output_autodock.dlg").

BAMOVA

Bamova implements a Bayesian Analysis of Molecular Variance and different likelihood models for three different types of molecular data (including two models for high throughput sequence data), as described in detail in Gompert and Buerkle (2011) and Gompert et al. (2010). Use of the software will require good familiarity with the models described in this paper. It will also likely require some programming to format data for input and to analyze the MCMC output. For more detail on BAMOVA please visit the BAMOVA web site [10] and manual here [11]

Currently, BAMOVA version 1.02 is installed on ANDY at the CUNY HPC Center. BAMOVA is a serial program that requires an input file and distance files to run. Here, we show how to run the test input case provided with the downloaded code, 'hapcountexample.txt' which uses the distance file 'distfileexample.txt'. These files may be copied to the user's working directory for PBS job submission with:

cp /share/apps/bamova/default/examples/*.txt .

To include all required environmental variables and the path to the BAMOVA executable run the modules load command (the modules utility is discussed in detail above):

module load bamova

Here is PBS batch script that works with this example input case:

#!/bin/bash
#PBS -q production
#PBS -N BAMOVA_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BAMOVA Serial Run ..."
bamova -f ./hapcountexample.txt -d ./distfileexample.txt -l 0 -x 250000 -v 0.25 -a 0 -D 0 -w 1 -W 1 -i 0.0 -I 0.0 -T 10  > bamova_ser.out 2>&1
echo ">>>> End   BAMOVA Serial Run ..."

It should take less 30 minutes to run and will produce PBS output and error files beginning with the job name 'BAMOVA_serial', but the primary BAMOVA application results will be written into the user-specified file at the end of the BAMOVA command line after the greater-than sign. Here it is named 'bamova_ser.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specific the name of the application's output file in this way to ensure that it is written directly into the user's working directory which should have much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Please note the BAMOVA command line options. These are described in detail in the manual referenced above.

BAYESCAN

This program, BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model. One of the scenarios covered consists of an island model in which subpopulation allele frequencies are correlated through a common migrant gene pool from which they differ in varying degrees. The difference in allele frequency between this common gene pool and each subpopulation is measured by a subpopulation- specific FST coefficient. Therefore, this formulation can consider realistic ecological scenarios where the effective size and the immigration rate may differ among subpopulations.

More detailed information on Bayescan can be found at the web site here [12] and in the manual here [13].

Currently, BAYESCAN version 2.1 is installed on ANDY at the CUNY HPC Center. BAYSCAN 2.1 is an SMP parallel program that uses the OpenMP parallel programming model. It can be run on one or more (even all) cores on a single compute node (ANDY has 8 cores per node), but not using the cores across multiple compute nodes. It requires a genotype data input file to run. Here, we show how to run the test input case provided with the downloaded code, 'test_SNPs.txt'. This file may be copied to the user's working directory for submission with:

cp /share/apps/bayescan/default/examples/distro/test_SNPs.txt* .

To include all required environmental variables and the path to the BAYSCAN executable run the modules load command (the modules utility is discussed in detail above):

module load bayescan

Here is PBS batch script that runs this example input case on 2 cores:

#!/bin/bash
#PBS -q production
#PBS -N BAYSCAN_omp
#PBS -l select=1:ncpus=2:mem=5760mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# It is possible to set the number of threads to be used in this
# OpenMP program using the environment variable OMP_NUM_THREADS.
export OMP_NUM_THREADS=2

# Just point to the serial executable to run
echo ">>>> Begin BAYESCAN OpenMP Parallel Run ..."
bayescan_2.1 -threads 2 test_binary_AFLP.txt > bayescan_omp.out 2>&1
echo ">>>> End   BAYESCAN OpenMP Parallel Run ..."

This PBS batch script can be dropped into a file (say bayescan_omp.job) on ANDY and run with the following command:

qsub bayescan_omp.job

It should take less 20 minutes to run and will produce PBS output and error files beginning with the job name 'BYSCAN_omp' along with a number of BAYESCAN-specific files. The primary BAYSCAN application results will be written into the user-specified file at the end of the BAYESCAN command line after the greater-than sign. Here it is named 'bayescan_omp.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=2:mem=5760mb' and the '#PBS -l place=fee' lines. The first instructs PBS to select 1 resource 'chunk' with 2 processors (cores) and 5,760 MBs of memory in it for the job. The second instructs PBS to place the chunk on any compute node with the required resources available. Because PBS resource chunks are atomic, both cores in the request will be allocated from the same compute node.

One can run this job in serial mode on a single core by setting the thread count to 1 and change the following line:

#PBS -l select=1:ncpus=1:mem=2880mb

The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The BAYSCAN command line options can be printed using the following once the module is loaded:

bayescan_2.1 --help

These options are described in detail in the manual [14].

BEAST

BEAST is a cross-platform Java program for Bayesian MCMC analysis of molecular sequences. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies, but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. The distribution includes a simple to use user-interface program called 'BEAUti' for setting up standard analyses and a suite of programs for analysing the results. For more detail on BEAST (and BEAUTi) please visit the BEAST web site [15].

Currently, BEAST version 1.8.0 is the default version installed on ANDY and PENZIAS at the CUNY HPC Center, but version 2.1.2 has recently been installed and will be made the default after user testing. Earlier versions are also available. BEAST is a serial program, but can also be run in parallel with the help of a companion library (BEAGLE) on systems with Graphics Processing Units (GPUs). PENZIAS supports GPU processing using NVIDIA Kepler GPUs and the BEAGLE 1.0 and 2.0 GPU libraries have been installed there for this purpose. (NOTE: GPU processing on ANDY has been eliminated as the FERMI GPUs their have reached end-of-life.) BEAST can therefore be run either serially or in GPU-accelerated mode on PENZIAS. Benchmarks of BEAST show that GPU acceleration provides significant performance improvement over basic CPU serial operation.

BEAST's user interface program, 'BEAUti', can be run locally on an office workstation or from the head node of ANDY. The latter option assumes that the user has logged in directly to PENZIAS or ANDY via the secure shell with X-Windows tunneling enabled (e.g. ssh -X my.name@andy.csi.cuny.edu). This second approach is only convenient for those on the College of Staten Island campus that can directly address PENZIAS or ANDY. Moreover, if HPC staff find that using the BEAUTi interface is consuming too much CPU time users will be asked to move their pre-processing work to their desktop, which is the preferred location for pre-processing in general. Details on using ssh to login are provided elsewhere in this document. Among other things, BEAUti is used to convert raw '.nex' files in into BEAST XML-based input files. Using ANDY's or PENZIAS's head node for anything compute intensive is forbidden, but these file conversions should be fairly low intensity.

Once a usable BEAST input file has been created (or provided), a PBS batch script must be written to run the job, either in serial mode or in GPU parallel mode. Below, we show how to run both a serial and GPU-accelerated job with a test input case (testMC3.xml) available in the BEAST examples directory. The input file may be copied into the users working directory from BEAST's installation tree for submission with the PBS, as follows:

cp /share/apps/beast/2.1.2/examples/testRNA.xml .

To include all required environmental variables and the path to the BEAST executable run the modules load command (the modules utility is discussed in detail above):

module load beast/2.1.2

Next, a PBS Pro batch script must be created to run your job. The first script below shows a serial run that uses the textRNA.xml XML input file.

#!/bin/bash
#PBS -q production
#PBS -N BEAST_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BEAST Serial Run ..."
beast -m 2880 -seed 666 ./testRNA.xml > beast_ser.out 2>&1
echo ">>>> End   BEAST Serial Run ..."

This script can be dropped into a file (say 'beast_serial.job) on ANDY or PENZIAS and run with:

qsub beast_serial.job

This case should take less fifteen minutes to run and will produce PBS output and error files beginning with the job name 'BEAST_serial', as well files specific to BEAST. The primary BEAST application output will be written into the user-specified file at the end of the BEAST command line after the greater-than sign. Here it is named 'beast_ser.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered above in the PBS section of the Wiki. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory for the job. The second line instructs PBS to place this job wherever the least loaded compute node resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The HPC Center staff has made two changes to the memory constraints in operation for ALL the BEAST distributed programs (see list below). First, the default minimum memory size has been set up from 64 MBs to 256 MBs. Second, a maximum memory control option has been added to all the programs. It is not required and if NOT used, programs will use the historical default for Linux jobs which is 1024 MBs (i.e. 1 GBs). If this option is used, it must be the FIRST option included on the execution line in your script and should take the form:

-m XXXXX

where the value 'XXXX' is the new user-selected maximum memory setting in MBytes. So, the option used in the script above:

-m 2880

would bump up the memory maximum for the 'beast' program to 2,880 MBytes. Notice that this matches the amount requested per 'chunk' in the PBS '-l select' line above. You should not ask for more memory that you have requested through PBS.

You may wish to request more memory than the per cpu (core) defaults on a system. This can be accomplished by asking for more cores per PBS 'chunk' than you are going to use, but using ALL of the memory PBS allocates to the multiple cores. For instance, a '-l select' line of:

#PBS  -l selelect=1:ncpus=4:mem=11520mb

requests 4 cpus (cores) and 11,520 MBs of memory. You could make this request of PBS and then leave the extra cores unused while asking for the all of the memory allocated with a 'beast' execution line of:

beast -m 11520 -seed 666 ./testRNA.xml

This provides 4 times the single-core quantity of memory for your 'beast' run by allocating but not using the 4 PBS cores requested the '-l select' statement. The non_GPU version of 'beast' is serial in the sense that it uses only one CPU core to control the GPU. This memory ceiling management option and technique can be used with any of the programs distributed with BEAST. Another such distributed program would be 'treeannotator' for instance.

Remember that ANDY and PENZIAS have 2880 MBs of available memory per core (2.880 GBs).

Note that there are larger number of command line options available to BEAST. This example uses the defaults, other than setting the seed with '-s 666'. All of BEAST's options can be listed as follows:

beast -help
[richard.walsh@bob beast]$ /share/apps/beast/default/bin/beast -help 
  Usage: beast [-verbose] [-warnings] [-strict] [-window] [-options] [-working] [-seed] [-prefix <PREFIX>] [-overwrite] [-errors <i>] [-threads <i>] [-java] [-beagle] [-beagle_info] [-beagle_order <order>] [-beagle_instances <i>] [-beagle_CPU] [-beagle_GPU] [-beagle_SSE] [-beagle_single] [-beagle_double] [-beagle_scaling <default|none|dynamic|always>] [-help] [<input-file-name>]
    -verbose Give verbose XML parsing messages
    -warnings Show warning messages about BEAST XML file
    -strict Fail on non-conforming BEAST XML file
    -window Provide a console window
    -options Display an options dialog
    -working Change working directory to input file's directory
    -seed Specify a random number generator seed
    -prefix Specify a prefix for all output log filenames
    -overwrite Allow overwriting of log files
    -errors Specify maximum number of numerical errors before stopping
    -threads The number of computational threads to use (default auto)
    -java Use Java only, no native implementations
    -beagle Use beagle library if available
    -beagle_info BEAGLE: show information on available resources
    -beagle_order BEAGLE: set order of resource use
    -beagle_instances BEAGLE: divide site patterns amongst instances
    -beagle_CPU BEAGLE: use CPU instance
    -beagle_GPU BEAGLE: use GPU instance if available
    -beagle_SSE BEAGLE: use SSE extensions if available
    -beagle_single BEAGLE: use single precision if available
    -beagle_double BEAGLE: use double precision if available
    -beagle_scaling BEAGLE: specify scaling scheme to use
    -help Print this information and stop

  Example: beast test.xml
  Example: beast -window test.xml
  Example: beast -help

The CUNY HPC Center also provides a GPU-accelerated version of BEAST. This version can be run ONLY on PENZIAS which supports GPUs. The same 'beast' module which also loads the BEAGLE GPU library, must be loaded as shown previously. NOTE: ANDY no longer supports GPU computation as its Fermi GPUs have reach end-of-life failure rates.

A PBS batch script for running the GPU-accelerated version of BEAST follows:

#!/bin/bash
#PBS -q production
#PBS -N BEAST_gpu
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb:accel=kepler
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BEAST GPU Run ..."
beast_gpu -m 2880 -beagle -beagle_GPU  -beagle_single -seed 666 ./testRNA.xml > beast_gpu.out 2>&1
echo ">>>> End   BEAST GPU Run ..."

This script has several unique features. Next, the '-l select' line includes requests for GPU-related resources. Both 1 processor (npus=1) and 1 GPU (ngpus=1) are requested. You need both. The type of GPU accelerator is specified as an NVIDIA Kepler GPU which has 832 double precision processors (and 2496 single precision processors) running at 0.705 GHz to apply to this workload (accel=kepler). These GPU processing cores, while less powerful individually that a CPU core, in concert are what deliver the performance of the highly parallel MCMC algorithm.

In addition, GPU-specific command-line options are required to invoke the GPU version of the BEAST. Here we have requested that the 'BEAGLE' GPU-library be used and that the computation be run in single- precision (32-bits as opposed to 64-bits) on the GPU which is as much as 3X faster that double-precision on NVIDIA Kepler if you can get by with single-precision.

All the programs that are part of the BEAST 2.1.2 distribution are available, even though we have only discussed 'beast' itself in detail here. The other programs, all of which can be run with similar scripts, include:

addonmanager beast beauti densitree loganalyser  logcombiner  treeannotator

BEST

The Bayesian Estimation of Species Trees application (BEST) implements a Bayesian hierarchical model to jointly estimate gene trees and species tree from multilocus DNA molecular sequence data. It provides a new approach for estimating the mutation-rate- based, phylogenetic relationships among species. Its method accounts for deep coalescence, but not for other complicating issues such as horizontal transfer or gene duplication. The program works in conjunction within the popular Bayesian phylogenetics package, MrBayes (Ronquist and Huelsenbeck, Bioinformatics, 2003). BEST's parameters are defined using the 'prset' command from MrBayes. Details on BEST's capabilities and options are avialable at the BEST web site here [16]

Currently, BEST version 2.2.0 and 2.3.1 are available ANDY at the CUNY HPC Center. BEST version 2.2.0 is the current default because a special large memory version is available that it not yet available in version 2.3.1. Both versions can be run in either a parallel or serial mode.

To run BEST, first a NEXUS-formatted, DNA sequence comparison input file (e.g. a '.nex' file) must be created using MrBayes. See the section on MrBayes below for this. Here is an example NEXUS input file:

#NEXUS

begin data;
   dimensions ntax=17 nchar=432;
   format datatype=dna missing=?;
   matrix
   human       ctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacgtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagaggttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatcac
   tarsier     ctgactgctgaagagaaggccgccgtcactgccctgtggggcaaggtagacgtggaagatgttggtggtgaggccctgggcaggctgctggtcgtctacccatggacccagaggttctttgactcctttggggacctgtccactcctgccgctgttatgagcaatgctaaggtcaaggcccatggcaaaaaggtgctgaacgcctttagtgacggcatggctcatctggacaacctcaagggcacctttgctaagctgagtgagctgcactgtgacaaattgcacgtggatcctgagaatttcaggctcttgggcaatgtgctggtgtgtgtgctggcccaccactttggcaaagaattcaccccgcaggttcaggctgcctatcagaaggtggtggctggtgtggctactgccttggctcacaagtaccac
   bushbaby    ctgactcctgatgagaagaatgccgtttgtgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttctttgactcctttggggacctgtcctctccttctgctgttatgggcaaccctaaagtgaaggcccacggcaagaaggtgctgagtgcctttagcgagggcctgaatcacctggacaacctcaagggcacctttgctaagctgagtgagctgcattgtgacaagctgcacgtggaccctgagaacttcaggctcctgggcaacgtgctggtggttgtcctggctcaccactttggcaaggatttcaccccacaggtgcaggctgcctatcagaaggtggtggctggtgtggctactgccctggctcacaaataccac
   hare        ctgtccggtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgagaccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtccactgcttctgctgttatgggcaaccctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcaccttcgctaagctgagtgaactgcattgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcactttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
   rabbit      ctgtccagtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtcctctgcaaatgctgttatgaacaatcctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcacctttgctaagctgagtgaactgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcattttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
   cow         ctgactgctgaggagaaggctgccgtcaccgccttttggggcaaggtgaaagtggatgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagtcctttggggacttgtccactgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagattcctttagtaatggcatgaagcatctcgatgacctcaagggcacctttgctgcgctgagtgagctgcactgtgataagctgcatgtggatcctgagaacttcaagctcctgggcaacgtgctagtggttgtgctggctcgcaattttggcaaggaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgtggccaatgccctggcccacagatatcat
   sheep       ctgactgctgaggagaaggctgccgtcaccggcttctggggcaaggtgaaagtggatgaagttggtgctgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagcactttggggacttgtccaatgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagactcctttagtaacggcatgaagcatctcgatgacctcaagggcacctttgctcagctgagtgagctgcactgtgataagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtggttgtgctggctcgccaccatggcaatgaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgttgccaatgccctggcccacaaatatcac
   pig         ctgtctgctgaggagaaggaggccgtcctcggcctgtggggcaaagtgaatgtggacgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttcttcgagtcctttggggacctgtccaatgccgatgccgtcatgggcaatcccaaggtgaaggcccacggcaagaaggtgctccagtccttcagtgacggcctgaaacatctcgacaacctcaagggcacctttgctaagctgagcgagctgcactgtgaccagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgatagtggttgttctggctcgccgccttggccatgacttcaacccgaatgtgcaggctgcttttcagaaggtggtggctggtgttgctaatgccctggcccacaagtaccac
   elephseal   ttgacggcggaggagaagtctgccgtcacctccctgtggggcaaagtgaaggtggatgaagttggtggtgaagccctgggcaggctgctggttgtctacccctggactcagaggttctttgactcctttggggacctgtcctctcctaatgctattatgagcaaccccaaggtcaaggcccatggcaagaaggtgctgaattcctttagtgatggcctgaagaatctggacaacctcaagggcacctttgctaagctcagtgagctgcactgtgaccagctgcatgtggatcccgagaacttcaagctcctgggcaatgtgctggtgtgtgtgctggcccgccactttggcaaggaattcaccccacagatgcagggtgcctttcagaaggtggtagctggtgtggccaatgccctcgcccacaaatatcac
   rat         ctaactgatgctgagaaggctgctgttaatgccctgtggggaaaggtgaaccctgatgatgttggtggcgaggccctgggcaggctgctggttgtctacccttggacccagaggtactttgatagctttggggacctgtcctctgcctctgctatcatgggtaaccctaaggtgaaggcccatggcaagaaggtgataaacgccttcaatgatggcctgaaacacttggacaacctcaagggcacctttgctcatctgagtgaactccactgtgacaagctgcatgtggatcctgagaacttcaggctcctgggcaatatgattgtgattgtgttgggccaccacctgggcaaggaattcaccccctgtgcacaggctgccttccagaaggtggtggctggagtggccagtgccctggctcacaagtaccac
   mouse       ctgactgatgctgagaagtctgctgtctcttgcctgtgggcaaaggtgaaccccgatgaagttggtggtgaggccctgggcaggctgctggttgtctacccttggacccagcggtactttgatagctttggagacctatcctctgcctctgctatcatgggtaatcccaaggtgaaggcccatggcaaaaaggtgataactgcctttaacgagggcctgaaaaacctggacaacctcaagggcacctttgccagcctcagtgagctccactgtgacaagctgcatgtggatcctgagaacttcaggctcctaggcaatgcgatcgtgattgtgctgggccaccacctgggcaaggatttcacccctgctgcacaggctgccttccagaaggtggtggctggagtggccactgccctggctcacaagtaccac
   hamster     ctgactgatgctgagaaggcccttgtcactggcctgtggggaaaggtgaacgccgatgcagttggcgctgaggccctgggcaggttgctggttgtctacccttggacccagaggttctttgaacactttggagacctgtctctgccagttgctgtcatgaataacccccaggtgaaggcccatggcaagaaggtgatccactccttcgctgatggcctgaaacacctggacaacctgaagggcgccttttccagcctgagtgagctccactgtgacaagctgcacgtggatcctgagaacttcaagctcctgggcaatatgatcatcattgtgctgatccacgacctgggcaaggacttcactcccagtgcacagtctgcctttcataaggtggtggctggtgtggccaatgccctggctcacaagtaccac
   marsupial   ttgacttctgaggagaagaactgcatcactaccatctggtctaaggtgcaggttgaccagactggtggtgaggcccttggcaggatgctcgttgtctacccctggaccaccaggttttttgggagctttggtgatctgtcctctcctggcgctgtcatgtcaaattctaaggttcaagcccatggtgctaaggtgttgacctccttcggtgaagcagtcaagcatttggacaacctgaagggtacttatgccaagttgagtgagctccactgtgacaagctgcatgtggaccctgagaacttcaagatgctggggaatatcattgtgatctgcctggctgagcactttggcaaggattttactcctgaatgtcaggttgcttggcagaagctcgtggctggagttgcccatgccctggcccacaagtaccac
   duck        tggacagccgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgactgtggagctgaggccctggccaggctgctgatcgtctacccctggacccagaggttcttcgcctccttcgggaacctgtccagccccactgccatccttggcaaccccatggtccgtgcccatggcaagaaagtgctcacctccttcggagatgctgtgaagaacctggacaacatcaagaacaccttcgcccagctgtccgagctgcactgcgacaagctgcacgtggaccctgagaacttcaggctcctgggtgacatcctcatcatcgtcctggccgcccacttcaccaaggatttcactcctgactgccaggccgcctggcagaagctggtccgcgtggtggcccacgctctggcccgcaagtaccac
   chicken     tggactgctgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgaatgtggggccgaagccctggccaggctgctgatcgtctacccctggacccagaggttctttgcgtcctttgggaacctctccagccccactgccatccttggcaaccccatggtccgcgcccacggcaagaaagtgctcacctcctttggggatgctgtgaagaacctggacaacatcaagaacaccttctcccaactgtccgaactgcattgtgacaagctgcatgtggaccccgagaacttcaggctcctgggtgacatcctcatcattgtcctggccgcccacttcagcaaggacttcactcctgaatgccaggctgcctggcagaagctggtccgcgtggtggcccatgccctggctcgcaagtaccac
   xenlaev     tggacagctgaagagaaggccgccatcacttctgtatggcagaaggtcaatgtagaacatgatggccatgatgccctgggcaggctgctgattgtgtacccctggacccagagatacttcagtaactttggaaacctctccaattcagctgctgttgctggaaatgccaaggttcaagcccatggcaagaaggttctttcagctgttggcaatgccattagccatattgacagtgtgaagtcctctctccaacaactcagtaagatccatgccactgaactgtttgtggaccctgagaactttaagcgttttggtggagttctggtcattgtcttgggtgccaaactgggaactgccttcactcctaaagttcaggctgcttgggagaaattcattgcagttttggttgatggtcttagccagggctataac
   xentrop     tggacagctgaagaaaaagcaaccattgcttctgtgtgggggaaagtcgacattgaacaggatggccatgatgcattatccaggctgctggttgtttatccctggactcagaggtacttcagcagttttggaaacctctccaatgtctccgctgtctctggaaatgtcaaggttaaagcccatggaaataaagtcctgtcagctgttggcagtgcaatccagcatctggatgatgtgaagagccaccttaaaggtcttagcaagagccatgctgaggatcttcatgtggatcccgaaaacttcaagcgccttgcggatgttctggtgatcgttctggctgccaaacttggatctgccttcactccccaagtccaagctgtctgggagaagctcaatgcaactctggtggctgctcttagccatggctacttc
   ;
end;

begin mrbayes;
   charset non_coding = 1-90 358-432;
   charset coding     = 91-357;
   partition region = 2:non_coding,coding;
   set partition = region;
   lset applyto=(2) nucmodel=codon;
   prset ratepr=variable;
   mcmc ngen=5000 nchains=1 samplefreq=10;
end;

Next, a PBS Pro batch script must be created to run your job. The first script below shows a MPI parallel script for the above '.nex' input file. Note that the number of processors that can be used by the job is limited to the number of chains in the input file. Here, we have just 2 chains and therefore can only request 2 processors. If you make the mistake of asking for more processors than input file chains, you will get the following error message:

      The number of chains must be at least as great
      as the number of processors (in this case 4)

Also, to include all required environmental variables and the path to the BEST executable run the modules load command (the modules utility is discussed in detail above):

module load best

Here is the MPI parallel PBS batch script for BEST that request 2 processors, one for each chain in the input file:

#!/bin/bash
#PBS -q production
#PBS -N BEST_parallel
#PBS -l select=2:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin BEST Parallel Run ..."
mpirun -np 2 -machinefile $PBS_NODEFILE mbbest ./bglobin.nex  > best_mpi.out 2>&1
echo ">>>> End   BEST Parallel Run ..."

This script can be dropped into a file (say 'best_mpi.job) on ANDY and run with:

qsub best_mpi.job

It should take less five minutes to run and will produce PBS output and error files beginning with the job name 'BEST_parallel'. The primary BEAST application results will be written into the user-specified file at the end of the BEST command line after the greater-than sign. Here it is named 'best_mpi.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=2:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 2 resource 'chunks' each with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The CUNY HPC Center also provides a serial version of BEST. A PBS batch script for running the serial version of BEST follows:

#!/bin/bash
#PBS -q production
#PBS -N BEST_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin BEST Serial Run ..."
mbbest_serial ./bglobin.nex > best_ser.out 2>&1
echo ">>>> End   BEST Serial Run ..."

BOWTIE2

BOWTIE2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. BOWTIE2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. BOWTIE2 supports gapped, local, and paired-end alignment modes. BOWTIE2 is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, CUFFLINKS, SAMTOOLS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the BOWTIE2 home page here [17].

At the CUNY HPC Center BOWTIE2 version 2.0.6 is installed on ANDY. BOWTIE2 is a parallel threaded code (pthreads) that takes its input from a simple text file provided on the command line. Below is an example PBS script that will run the lambda virus test case provided with the BOWTIE2 distribution which can be copied from the local installation directory to your current location as follows:

cp /share/apps/bowtie/default/example/reference/lambda_virus.fa .

To include all required environmental variables and the path to the BOWTIE2 executable run the modules load command (the modules utility is discussed in detail above). BOWTIE2 is the default version of BOWTIE installed at the CUNY HPC Center. The older original version is also installed.

module load bowtie

Running 'bowtie2' from the interactive prompt without any options will provide a brief description of the form of the command-line arguments and options. Here is PBS batch script that builds lambda virus the index and aligns the sequences in serial mode:

#!/bin/bash
#PBS -q production
#PBS -N BOWTIE2_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BOWTIE2 Serial Run ..."
echo ""
echo ">>>> Build Index ..."
bowtie2-build lambda_virus.fa lambda_virus > lambda_virus_index.out 2>&1
echo ""
echo ">>>> Align Sequence ..."
bowtie2 -x lambda_virus -U reads_1.fq -S eg1.sam > lambda_virus_align.out 2>&1
echo ""
echo ">>>> End   BOWTIE2 Serial Run ..."

This script can be dropped in to a file (say bowtie.job) and started with the command:

qsub bowtie2.job

Running the lambda virus test case should take less than 2 minutes and will produce PBS output and error files beginning with the job name 'BOWTIE2_serial'. The primary BOWTIE2 application results will be written into the user-specified file at the end of the CUFFLINKS command line after the greater-than sign. Here it is named 'lambda_virus_index.out' and 'lambda_virus_align.out.' The expression '2>&1' at the end of the command-line combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

To run BOWTIE2 in parallel-threads mode several changes to the script are required. Here is a modified script that shows how to run BOWTIE2 using two threads. ANDY has as many as 8 physical compute cores per compute node and therefore as many as 8 threads might be chosen, but the larger the number of cores-threads requested the longer the job may wait to start as PBS looks for a compute node with the free resources requested.

#!/bin/bash
#PBS -q production
#PBS -N BOWTIE2_threads
#PBS -l select=1:ncpus=2:mem=5760mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BOWTIE2 Threaded Run ..."
echo ""
echo ">>>> Build Index ..."
bowtie2-build lambda_virus.fa lambda_virus > lambda_virus_index.out 2>&1
echo ""
echo ">>>> Align Sequence ..."
bowtie2 -p 2 -x lambda_virus -U reads_1.fq -S eg1.sam > lambda_virus_align2.out 2>&1
echo ""
echo ">>>> End   BOWTIE2 Threaded Run ..."

Notice the difference in the '-l select' line where the resource 'chunk' now includes 2 cores (ncpus=2) and requests twice as much memory as before. Also, notice that the BOWTIE2 command-line now includes the '-p 2' option to run the code with 2 threads working in parallel. Perfectly or 'embarrassingly' parallel workloads can run close to 2, 4, or more times as fast as the same workload in serial mode depending on the number of threads requested, but workloads cannot be counted on to be perfectly parallel.

The speed ups that you observe will typically be less than perfect and diminish as you ask for more cores-threads. Larger jobs will typically scale more efficiently as you add cores-threads, but users should take note of the performance gains that they see as cores-threads are added and select a core-thread count the provides efficient scaling and avoids diminishing returns.

BPP2

BPP2 uses a Bayesian modeling approach to generate the posterior probabilities of species assignments taking into account uncertainties due to unknown gene trees and the ancestral coalescent process. For tractability, it relies on a user-specified guide tree to avoid integrating over all possible species delimitations. Additional information can be found at the download site here [18].

At the CUNY HPC Center BPP2 version 2.1b is installed on. BPP2 is a serial code that takes its input from a simple text file provided on the command line. Below is an example PBS script that will run the fence lizard test case provided with the distribution archive (/share/apps/bpp2/default/examples).

To include all required environmental variables and the path to the BPP2 executable run the modules load command (the modules utility is discussed in detail above):

module load bpp2


#!/bin/bash
#PBS -q production
#PBS -N BPP2_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Invoke the executable in command-line mode to run
echo ">>>> Begin BPP2 Serial Run ..."
bpp2 ./lizard.bpp.ctl > bpp2_ser.out 2>&1
echo ">>>> End   BPP2 Serial Run ..."

This script can be dropped in to a file (say bpp2.job) and started with the command:

qsub bpp2.job

Running the fence lizard test case should take less than 15 minutes and will produce PBS output and error files beginning with the job name 'BPP2_serial'. The primary BPP2 application results will be written into the user-specified file at the end of the BPP2 command line after the greater-than sign. Here it is named 'bpp2_ser.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

BROWNIE

BROWNIE is a program for analyzing rates of continuous character evolution and looking for substantial rate differences in different parts of a tree using likelihood ratio tests and Akaike Information Criterion (AIC) statistics. It now also implements many other methods for examining trait evolution and methods for doing species delimitation.

BROWNIE (version 1.2) is installed on the Andy cluster under the directory "/share/apps/brownie/default/bin/". The directory "/share/apps/brownie/default/examples/" contains two example files.

In order to run one of these examples on Andy follow the steps:

1) create a directory and "cd" there:

mkdir ./brownie_test  && cd ./brownie_test

2) Copy the example input deck to the current directory:

cp /share/apps/brownie/default/example/ratetest_example.nex ./

3) Create a PBS submit script. Use your favorite text editor to put the following lines into file "brownie_serial.job"

#!/bin/bash
#PBS -q production
#PBS -N BROWNIE_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin BROWNIE Serial Run ..."
brownie ./ratetest_example.nex > brownie_ser.out 2>&1
echo ">>>> End   BROWNIE Serial Run ..."

4) Load the BROWNIE module to include all required environmental variables and the path to the BROWNIE executable (the modules utility is discussed in detail above.

module load brownie

5) Submit the job to the PBS queue using:

qsub brownie_serial.job

Running the rate test case should take less than 15 minutes and will produce PBS output and error files beginning with the job name 'BROWNIE_serial'. The primary BROWNIE application results will be written into the user-specified file at the end of the BROWNIE command line after the greater-than sign. Here it is named 'brownie_ser.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

One can check the status of the job using "qstat" command. Upon successful completion the following files will be generated:

BrownieBatch.nex
brownie_test.eXXXX   --- std error from PBS
BrownieLog.txt
brownie_test.oXXXX  --- std output from PBS
RatetestOutput.txt --- result returned by Brownie

CGAL

The Computational Geometry Algorithms Library (CGAL), offers data structures and algorithms like triangulations (2D constrained triangulations, and Delaunay triangulations and periodic triangulations in 2D and 3D), Voronoi diagrams (for 2D and 3D points, 2D additively weighted Voronoi diagrams, and segment Voronoi diagrams), polygons (Boolean operations, offsets, straight skeleton), polyhedra (Boolean operations), arrangements of curves and their applications (2D and 3D envelopes, Minkowski sums), mesh generation (2D Delaunay mesh generation and 3D surface and volume mesh generation, skin surfaces), geometry processing (surface mesh simplification, subdivision and parameterization, as well as estimation of local differential properties, and approximation of ridges and umbilics), alpha shapes, convex hull algorithms (in 2D, 3D and dD), search structures (kd trees for nearest neighbor search, and range and segment trees), interpolation (natural neighbor interpolation and placement of streamlines), shape analysis, fitting, and distances (smallest enclosing sphere of points or spheres, smallest enclosing ellipsoid of points, principal component analysis), and kinetic data structures.

The library is installed on PENZIAS.

CONSED

CONSED is a DNA sequence analysis finishing tool that provides sequence viewing, editing, alignment, and assembly capabilities from a X Windows graphical user interface (GUI). It makes extensive use of other non-graphical and underlying sequence analysis tools including PHRED, PHRAP, and CROSSMATCH that may also be used separately and are described else where in this document. It also includes a viewer called BAMVIEW. The CONSED tool chain is developed and maintained at the University of Washington and is described more completely here [19] CONSED is provided at the CUNY HPC Center under an academic license that allows use, but not the copying or out bound transfer of any of the executables or files distributed under this academic license. The license is not transferable in any way and users wishing to run the application at their own site must acquire a license directly from the authors.

The CUNY HPC Center supports CONSED version 23.0 for interactive use on KARLE. CONSED 23.0 and the tool chain described above is also installed on ANDY to allow for the batch use of underlying support tools mention above and described in detail below. In general, running GUI-based applications on ANDY's login node is discouraged. There should be little need to do this as KARLE is on the periphery of the CUNY HPC network making login there direct and KARLE shares its HOME directory file system with ANDY making files created on either system immediately available on the other.

Rather than rewrite portions of the CONSEND manual here, users are directed to the manual's "Quick Tour" section here [20] and asked to walk through some of the exercises after logging into KARLE. If problems or questions come up, please post them to "hpchelp@csi.cuny.edu". The CONSED 23.0 distribution is installed on KARLE in the following directory:

/share/apps/consed/default

All the files in the distribution can be found there.

CP2K

CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems. It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. CP2K provides state-of-the-art methods for efficient and accurate atomistic simulations.

At the CUNY HPC Center CP2K version 2.3 is installed on ANDY. CP2K can be built as a serial, MPI-parallel, or MPI-OpenMP-parallel code. At this time, only the MPI-parallel version of the application has been built for production use at the HPC Center. Further information on CP2K is available at the website here [21].

Below is an example PBS script that will run the CP2K H2O-32 test case provided with the CP2K distribution. It can be copied from the local installation directory to your current location as follows:

cp /share/apps/cp2k/2.3/tests/SE/regtest-2/H2O-32.inp .

To include all required environmental variables and the path to the CP2K executable run the modules load command (the modules utility is discussed in detail above).

module load cp2k

Here is the example PBS script:

#!/bin/bash
#PBS -q production
#PBS -N CP2K_MPI.test
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin CP2K MPI Parallel Run ..."
mpirun -np 8 -machinefile $PBS_NODEFILE cp2k.popt ./H2O-32.inp > H2O-32.out 2>&1
echo ">>>> End   CP2K MPI Parallel Run ..."

This script can be dropped in to a file (say cp2k.job) and started with the command:

qsub cp2k.job

Running the H2O-32 test case should take less than 5 minutes and will produce PBS output and error files beginning with the job name 'CP2K_MPI.test'. The CP2K application results will be written into the user-specified file at the end of the CP2K command line after the greater-than sign. Here it is named 'H2O-32.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=8:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 8 resource 'chunks' with 1 processor (core) and 2,880 MBs of memory in each for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

CUFFLINKS

CUFFLINKS assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. CUFFLINKS then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. CUFFLINKS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, SAMTOOLS, and TOPHAT are also installed at the CUNY HPC Center.Additional information can be found at the CUFFLINKS home page here [22].

At the CUNY HPC Center CUFFLINKS version 2.0.2 is installed on ANDY. CUFFLINKS is a parallel threaded code (pthreads) that takes its input from a simple text file provided on the command line. Below is an example PBS script that will run the messenger RNA test case provided at the website here [23].

To include all required environmental variables and the path to the CUFFLINKS executable run the modules load command (the modules utility is discussed in detail above):

module load cufflinks

Running 'cufflinks' from the interactive prompt without any options will provide a brief description of the form of the command-line arguments and options. Here is PBS batch script that runs this test case in serial mode:

#!/bin/bash
#PBS -q production
#PBS -N CLINKS_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Invoke the executable in command-line mode to run
echo ">>>> Begin CLINKS Serial Run ..."
cufflinks ./mRNA_test.sam > mRNA_test.out 2>&1
echo ">>>> End   CLINKS Serial Run ..."

This script can be dropped in to a file (say cufflinks.job) and started with the command:

qsub cufflinks.job

Running the mRNA test case should take less than 1 minute and will produce PBS output and error files beginning with the job name 'CLINKS_serial'. The primary CUFFLINKS application results will be written into the user-specified file at the end of the CUFFLINKS command line after the greater-than sign. Here it is named 'mRNA_test.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

To run CUFFLINKS in parallel-threads mode several changes to the script are required. Here is a modified script that shows how to run CUFFLINKS using two threads. ANDY has as many as 8 physical compute cores per compute node and therefore as many as 8 threads might be chosen, but the larger the number of cores-threads requested the longer the job may wait to start as PBS looks for a compute node with the free resources requested.

#!/bin/bash
#PBS -q production
#PBS -N CLINKS_threads
#PBS -l select=1:ncpus=2:mem=5760mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Invoke the executable in command-line mode to run
echo ">>>> Begin CLINKS Threaded Run ..."
cufflinks -p 2 ./clinks_ptest.sam > clinks_ptest.out 2>&1
echo ">>>> End   CLINKS Threaded Run ..."

Notice the difference in the '-l select' line where the resource 'chunk' now includes 2 cores (ncpus=2) and requests twice as much memory as before. Also, notice that the CUFFLINKS command-line now includes the '-p 2' option to run the code with 2 threads working in parallel. Perfectly or 'embarrassingly' parallel workloads can run close to 2, 4, or more times as fast as the same workload in serial mode depending on the number of threads requested, but workloads cannot be counted on to be perfectly parallel.

The speed ups that you observe will typically be less than perfect and diminish as you ask for more cores-threads. Larger jobs will typically scale more efficiently as you add cores-threads, but users should take note of the performance gains that they see as cores-threads are added and select a core-thread count the provides efficient scaling and avoids diminishing returns.

DL_POLY

DL_POLY is a general purpose molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith, T.R. Forester and I.T. Todorov. Both serial and parallel versions are available. The original package was developed by the Molecular Simulation Group (now part of the Computational Chemistry Group, MSG) at Daresbury Laboratory under the auspices of the Engineering and Physical Sciences Research Council (EPSRC) for the EPSRC's Collaborative Computational Project for the Computer Simulation of Condensed Phases ( CCP5). Later developments were also supported by the Natural Environment Research Council through the eMinerals project. The package is the property of the Central Laboratory of the Research Councils, UK.

DL_POLY versions 2.20 and 3.10 are installed on ANDY under the user applications directory

/share/apps/dlpoly

The default is currently version 2.20, but users can select the more current version by referencing the 3.10 directory explicitly in their scripts.

To run DL_POLY the user needs to provide a set of several files. Those that are required include:

1) The CONTROL file, which indicates to DL_POLY what kind of simulation you want to run, how much data you want to gather, and for how long you want the simulation to run.

2) The CONFIG file, which contains the atom positions, and, depending on how the file was created (e.g. whether this is a configuration created from ‘scratch’ or the end point of another run), the atom's velocities and forces.

3) The FIELD file, which specifies the nature of the intermolecular interactions, the molecular topology, and the atomic properties, such as charge and mass.

Sometimes you may require a fourth file: TABLE, which contains short-ranged potential and force arrays for functional forms not available within DL_POLY (usually because they are too complex e.g. spline potentials) and/or a fifth file: TABEAM, which contains metal potential arrays for non- analytic or too complex functional forms and/or a sixth file: REFERENCE, which is similar to the CONFIG file and contains the ”perfect” crystalline structure of the system.

Several directories are included in the installation tree. The primary executable for DL_POLY and a number of other supporting scripts are located in the directory:

/share/apps/dlpoly/default/bin

A collection of example input files that you may use as test cases are located in the directory:

/share/apps/dlpoly/default/data

The user and installation guide in PDF format are located in the directory:

/share/apps/dlpoly/default/man

Support utilities and programs are found in:

/share/apps/dlpoly/default/utility
/share/apps/dlpoly/default/public

To test DL_POLY, copy the files in

/share/apps/dlpoly/default/data/TEST10/LF

to a working directory (e.g. dlpoly) in your $HOME directory and run the PBS script provided below using the PBS submission command:

qsub dlpoly.job
#!/bin/bash
#PBS -q production
#PBS -N DLPOLY_mpi
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin DLPOLY MPI Run ..."
mpirun -np 8 -machinefile $PBS_NODEFILE dlpoly > dlpoly_mpi.out 2>&1
echo ">>>> Begin DLPOLY MPI Run ..."

Please refer to the DL_POLY manual for more detailed information on DL_POLY and to the general PBS section in this Wiki for more details on the PBS queuing system. There are a number of tutorials and further information to be found online by Googling.

GAMESS-US

GAMESS is a program for ab initio molecular quantum chemistry. Briefly, GAMESS can compute SCF wavefunctions ranging from RHF, ROHF, UHF, GVB, and MCSCF. Correlation corrections to these SCF wavefunctions include Configuration Interaction, second order perturbation Theory, and Coupled-Cluster approaches, as well as the Density Functional Theory approximation. Excited states can be computed by CI, EOM, or TD-DFT procedures. Nuclear gradients are available, for automatic geometry optimization, transition state searches, or reaction path following. Computation of the energy hessian permits prediction of vibrational frequencies, with IR or Raman intensities. Solvent effects may be modeled by the discrete Effective Fragment potentials, or continuum models such as the Polarizable Continuum Model. Numerous relativistic computations are available, including infinite order two component scalar corrections, with various spin-orbit coupling options. The Fragment Molecular Orbital method permits use of many of these sophisticated treatments to be used on very large systems, by dividing the computation into small fragments. Nuclear wavefunctions can also be computed, in VSCF, or with explicit treatment of nuclear orbitals by the NEO code.

The program is installed on PENZIAS. Before run the program please load the proper module file:

module load gamess-us

Below is a simple test input file named test_1.inp for GAMESS:

!  test_1
!    1-A-1 CH2    RHF geometry optimization using GAMESS.
!
!    Although internal coordinates are used (COORD=ZMAT),
!    the optimization is done in Cartesian space (NZVAR=0).
!    This run uses a criterion (OPTTOL) on the gradient
!    which is tighter than default, but very safe.
!
!    This job tests the sp integral module, the RHF module,
!    and the geometry optimization module.
!
!    Using the default search METHOD=STANDARD,
!    FINAL E= -37.2322678015, 8 iters, RMS grad= .0264308
!    FINAL E= -37.2351919062, 7 iters, RMS grad= .0202617
!    FINAL E= -37.2380037239, 7 iters, RMS grad= .0013100
!    FINAL E= -37.2380352917, 8 iters, RMS grad= .0007519
!    FINAL E= -37.2380396312, 5 iters, RMS grad= .0001615
!    FINAL E= -37.2380397693, 5 iters, RMS grad= .0000067
!    FINAL E= -37.2380397698, 3 iters, RMS grad= .0000004
!
 $CONTRL SCFTYP=RHF RUNTYP=OPTIMIZE COORD=ZMT NZVAR=0 $END
 $SYSTEM TIMLIM=1 $END
 $STATPT OPTTOL=1.0E-5  $END
 $BASIS  GBASIS=STO NGAUSS=2 $END
 $GUESS  GUESS=HUCKEL $END
 $DATA
Methylene...1-A-1 state...RHF/STO-2G
Cnv  2

C
H  1 rCH
H  1 rCH  2 aHCH

rCH=1.09
aHCH=110.0
 $END

The program need to create a scratch space on local node in order to keep small temporary files. Creating the scratch space is done in following scripts so users should not remove or modify the following lines:

export MY_SCRDIR=`whoami;date '+%m.%d.%y_%H:%M:%S'`
export MY_SCRDIR=`echo $MY_SCRDIR | sed -e 's; ;_;'`
export GAMESS_SCRDIR=/state/partition1/g09_scr/${MY_SCRDIR}_$$
mkdir -p $GAMESS_SCRDIR
echo $GAMESS_SCRDIR

The PBS start up script is given below.

#!/bin/bash
# This script runs a 4-cpu GAMESS-US job
# with the 4 cpus packed onto a single compute node 
# to ensure that it will run as an SMP parallel job.
#PBS -q production
#PBS -N geom
#PBS -l select=1:ncpus=4:mem=7680mb
#PBS -l place=free
#PBS -V

# print out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# set the working directory

cd $PBS_O_WORKDIR

# set the name and location of the GAMESS-US scratch directory
# on the compute node.  This is where one needs to go
# to remove left-over script files.

export MY_SCRDIR=`whoami;date '+%m.%d.%y_%H:%M:%S'`
export MY_SCRDIR=`echo $MY_SCRDIR | sed -e 's; ;_;'`

export GAMESS_SCRDIR=/state/partition1/g09_scr/${MY_SCRDIR}_$$
mkdir -p $GAMESS_SCRDIR

echo $GAMESS_SCRDIR
 
# start the  job

gamess 01 test_1.inp -n 4 > test_1.out

# remove the scratch directory before terminating

/bin/rm -r $GAMESS_SCRDIR

echo 'Job is done!'

GARLI

GARLI is a program that performs phylogenetic inference using the maximum-likelihood criterion. Several sequence types are supported, including nucleotide, amino acid and codon. Version 2.0 adds support for partitioned models and morphology-like data types. It is usable on all operating systems, and is written and maintained by Derrick Zwickl at the University of Texas at Austin. Additional information can be found on the GARLI Wiki here [24].

At the CUNY HPC Center, GARLI version 2.0 is installed on on ANDY. GARLI has both a serial and MPI parallel version that takes its input from a simple text configuration file ('garli.conf') and a '.nex' sequence file ('rana.nex' for instance). Like other applications on ANDY, GARLI path and environment variables are controlled using the modules utility. To include all required environmental variables and the path to the GARLI executable run the modules load command (the modules utility is discussed in detail above):

module load garli

Below is an example PBS script that will run the frog ('rana.nex') test case provided with the distribution archive (/share/apps/garli/default/examples/basic). Users can copy the necessary files from this location.

#!/bin/bash
#PBS -q production
#PBS -N GARLI_mpi
#PBS -l select=2:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin GARLI MPI Run ..."
mpirun -np 2 -machinefile $PBS_NODEFILE garli_mpi > garli_mpi.out 2>&1
echo ">>>> Begin GARLI MPI Run ..."

This script can be dropped in to a file (say garli_mpi.job) and started with the command:

qsub garli_mpi.job

Running the 'rana.nex' test case should take less than 15 minutes and will produce PBS output and error files beginning with the job name 'GARLI_mpi'. The primary GARLI application results will be written into the user-specified file at the end of the GARLI command line after the greater-than sign. Here it is named 'garli_mpi.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered in the PBS section above. The most important lines here are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

GAUSS

An easy-to-use data analysis, mathematical and statistical environment based on the powerful, fast and efficient GAUSS Matrix Programming Language. GAUSS is used to solve real world problems and data analysis problems of exceptionally large scale. GAUSS version 3.2.27 is currently available on ANDY and BOB. At the CUNY HPC Center GAUSS is typically run in serial mode. (Note: GAUSS should not be confused with the computational chemistry application Gaussian.)

A PBS Pro submit script for GAUSS that runs on 1 processor (core) follows:

#!/bin/bash
#PBS -q production
#PBS -N GAUSS_job
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname
echo ""

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

#
# Point to the serial executable to run
#
echo ">>>> Begin GAUSS Serial Run ..."
# Static executable
/share/apps/gauss/default/tgauss.static < ./pxyz.e > pxyz.out
#
# Dynamic executable
#/share/apps/gauss/default/tgauss < ./quantile.e > gauss.out
#
echo ">>>> End   GAUSS Serial Run ..."

Here, the file pxyz.e was taken from the GAUSS examples in /share/apps/gauss/examples. Upon successful completion, a run file "graphic.tkf" should be created in working directory.

pxyz.e:

library pgraph;
graphset;

let v = 100 100 640 480 0 0 1 6 15 0 0 2 2;
wxyz = WinOpenPQG( v, "XYZ Plot", "XYZ" );
call WinSetActive( wxyz );

begwind;
makewind(9,6.855,0,0,1);
makewind(9/2.9,6.855/2.9,0,0,0);
makewind(9/2.9,6.855/2.9,0,3.8,0);
_psurf = 0;
title("\202XYZ Curve - \201Toroidal Spiral");
fonts("simplex complex");
xlabel("X");
ylabel("Y");
zlabel("Z");

setwind(1);
t = seqa(0,.0157,401);
a = .2; b=.8; c=20;
x = 3*((a*sin(c*t)+b) .* cos(t));
y = 3*((a*sin(c*t)+b) .* sin(t));
z = a*cos(c*t);
margin(.5,0,0,0);
ztics(-.3,.3,.3,0);
_pcolor = 10;
view(-3,-2,4);
volume(1,1,.7);
_plwidth = 5;
xyz(x,y,z);

nextwind;
margin(0,0,0,0);
title("");
x = x .* (sin(z)/10);
_paxes = 0;
_pframe = 0;
_pbox = 13;
_pcolor = 11;
_plwidth = 0;
view(15,2,10);
xyz(x,y,z);

nextwind;
_pcolor = 9;
a = .4; b=.4; c=15;
x = 3*((a*sin(c*t)+b) .* cos(t));
y = 3*((a*sin(c*t)+b) .* sin(t));
z = a*cos(c*t);
volume(1,1,.4);
xyz(x,y,z);

endwind;

call WinSetActive( 1 );

Gaussian

The Gaussian is a set of programs for calculating electronic structure. It is available on BOB which is the computer dedicated only to run Gaussian jobs meaning that there are no modules on that server i.e. users do not need to load module prior to execution. The program is also available on ANDY. Please note that even without module the jobs are still submitted via queuing system (see below) i.e. users must create and use PBS script in order to run a Gaussian job. BOB's nodes have 2 x 4 core processors i.e. the maximum number of cores available on one node on BOB is 8. In practice a total of 192 cores are allocated for Gaussian jobs, which is almost the entire system. Parallel runs with fewer than 8 cores are possible, as are serial runs, and these may end up being scheduled to run by PBS sooner than full 8-way parallel jobs because nodes with 2 or 4 free cores are easier to find on a busy system.

The Gaussian jobs are submitted via PBS queue 'production_gau' (see PBS script below). As the level of utilization across our systems varies, Gaussian users should be prepared to submit jobs of any even core-count (2, 4, 6, or 8) to either system by keeping working directories and model scripts on each system. This will give the user options and limit the time their jobs spend waiting in the PBS Gaussian queue. Users should learn how and be prepared to submit jobs in any multiple of 2 cores to increase the chances of being scheduled immediately, rather than only waiting for an 8-core node to become completely free. In constructing runs for fewer that 8 cores users should reduce the resources requested in the PBS '-l select' statement in a proportional manner from the compute-node maximums on each the system being used. The HPC Center staff has produced example PBS scripts that demonstrate the differences between runs on 2, 4, and 8 cores. These examples can be requested by sending an email to 'hpchelp@csi.cuny.edu'.

Gaussian Scratch File Storage Space

Scratch space for Gaussian's temporary files are node-local i.e. all temporary files are written on a local disk installed on each node and not in user's own directory and is not counted against user quota. The path to these compute-node scratch directories is established in the Gaussian PBS script and is /state/partition1/[g03_scr,g09_scr]' depending on whether the job is a Gaussian 03 or Gaussian 09 job. If a single Gaussian job is using all the cores on a particular node (this is often the case) then that entire local scratch space is available to that job, assuming files from previous jobs have been cleaned up. In order to use Gaussian scratch space the users must not edit their PBS scripts in order to place Gaussian scratch files anywhere other than the directories used in the recommended scripts. In particular, users MUST NOT place their scratch files in their home directories. The home directory is backed up to tape and backing up large integrals files to tape will unnecessarily waste backup tapes and increase backup time-to-completion.

Users are encouraged to ensure that their scratch file data is removed after each completed Gaussian run. The example PBS script below for submitting Gaussian jobs includes a final line to remove scratch files, but this is not always successful. You may have to manually remove your scratch files. The examples script prints out both the node where the job was run and the unique name of each Gaussian job's scratch directory. Please police your own use of Gaussian scratch space on BOB by going to '/state/partition1/[g09_scr, g03_scr]' and looking for directories that begin with your name and the date that the directory was created. For example:

bob$ ssh compute-0-11
bob$
bob$ cd /state/partition1/g09_scr
bob$
bob$ ls
a.eisenberg_09.27.11_12:54:15_25320  a.eisenberg_10.04.11_15:54:18_27212  ffernandez_09.25.11_19:15:55_2297    jarzecki_09.30.11_15:04:28_1661
a.eisenberg_09.27.11_14:37:11_19643  a.eisenberg_10.04.11_15:54:20_29710  ffernandez_09.26.11_12:33:13_15542  michael.green_09.21.11_15:19:23_5986
bob$
bob$ /bin/rm -r michael.green_09.21.11_15:19:23_5986
bob$

Above, a job that created a scratch directory on 9.21.11 is removed because the user (michael.green) knows that this job has completed and the files are no longer needed. If you are not sure when your job started, you can get this information from the full listing of your job's PBS output (qstat -f JID) and looking for the 'stime' or the start time for the job. Clearly, you do not wish to remove the directories of jobs that are currently running.

The HPC Center has created a special PBS resource ('lscratch') to determine the amount of scratch space available at runtime and to start jobs ONLY if that amount of scratch space is available. Users need to be able to accurately estimate the scratch space they require to efficiently set this flag in their PBS script. Jobs requiring the maximum available (~800 GBytes) should allocate an entire, 8-core compute node to themselves and use all eight cores for the run. As a good rule of thumb you can request 100 GBs per core requested in the PBS script, although this is NOT guaranteed to be enough. Finally, Gaussian users should note that Gaussian scratch files are NOT backed up. Users are encouraged to save their checkpoint files in their PBS 'working directories' on /home/user.name if they will be needed for future work. From this location, they will be backed up. Again, Gaussian scratch files in /home/gaussian are NOT backed up.

NOTE: If other users have failed to clean up after themselves, and you request the maximum amount of Gaussian scratch space, it may not be available and your job may sit in the queue.

Gaussian PBS Job Submission

Similarly, Gaussian parallel jobs are limited by the number of cores on a single compute node. Eight (8) is the maximum processor (core) count on BOB and the memory per core is 1920 MBs. Here, we provide a simple Gaussian input file (a Hartree-Fock geometry optimization of methane), and the companion PBS batch submit script that would allocate 4 cores on a single compute node and 400 GBytes of compute-node-local storage.

SPECIAL NOTE ON Gaussian03: The PBS batch script below is written to select Gaussian09, but would work for Gaussian03 if all the '09' strings it contains were edited to '03'. Also, the following additional line (fix) also MUST be included in the Gaussian03 script to get it to work. This is because the Gaussian03 binaries are very old, and now only work with a much older release of the Portland Group Compiler that we happened to have saved (release 10.3). This line can be added to the script any place before the executable 'g09' is invoked.

setenv LD_LIBRARY_PATH /share/apps/pgi/10.3/linux86-64/10.3/libso:"$LD_LIBRARY_PATH"

The Gaussian 09 methane input deck is:

%chk=methane.chk
%mem=8GB
%nproc=4
# hf/6-31g

Title Card Required

0 1
 C                  0.80597015   -1.20895521    0.00000000
 H                  1.16262458   -2.21776521    0.00000000
 H                  1.16264299   -0.70455702    0.87365150
 H                  1.16264299   -0.70455702   -0.87365150
 H                -0.26402985   -1.20894202    0.00000000

END

Notice that we have explicitly requested 8 GBytes of memory with the '%mem=8GB' directive. This will allow the job to make full use of half of the memory available on a single BOB compute. The input file also instructs Gaussian to use 4 processors which will ensure that all of Gaussian's parallel executables (i.e. links) will run in SMP mode with 4 cores. For this simple methane geometry optimization, requesting these resources (both here and in the PBS script) is a bit extravagant, but both the input file and script can be adapted to other more substantial molecular systems running more accurate calculations. User's can make pro-rated adjustments to the resources requested in BOTH the Gaussian input deck and PBS submit script to run jobs on 2, 6, or even 8 cores.

Here is the Gaussian PBS script:

#!/bin/csh
# This script runs a 4-cpu (core) Gaussian 09 job
# with the 4 cpus packed onto a single compute node 
# to ensure that it will run as an SMP parallel job.
#PBS -q production_gau
#PBS -N methane_opt
#PBS -l select=1:ncpus=4:mem=7680mb:lscratch=400gb
#PBS -l place=free
#PBS -V

# print out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# set the G09 root directory

setenv g09root /share/apps/gaussian

# set the name and location of the G09 scratch directory
# on the compute node.  This is where one needs to go
# to remove left-over script files.

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

setenv GAUSS_SCRDIR /state/partition1/g09_scr/${MY_SCRDIR}_$$
mkdir -p $GAUSS_SCRDIR

echo $GAUSS_SCRDIR

# run the G09 setup script

source $g09root/g09/bsd/g09.login

# users must explicitly change to their working directory with PBS

cd $PBS_O_WORKDIR

# start the G09 job

$g09root/g09/g09 menthae.input

# remove the scratch directory before terminating

/bin/rm -r $GAUSS_SCRDIR

echo 'Job is done!'

To run the job, one must use the standard PBS job submission command as follows:

qsub g09.job

Some of this PBS script's features are worth detailing. First, note that Gaussian scripts conventionally are run using the C-shell. Next, the '-l select' directive requests one PBS resource chunk (see the PBS Pro section below for the definition of a resource chunk) which includes 4 processors (cores) and nearly 8 GBytes of memory. The '-l select' directive also instructs PBS to check to see if there are any compute nodes available with 400 GBytes of storage using the 'lscratch=400gb' directive. As a Gaussian user, you must be able to estimate the amount of scratch storage space your job will need. PBS will keep this job in a queued state until sufficient resources, including sufficient scratch storage space, are found to run the job. Previously completed jobs that have not cleaned up their scratch files may prevent this job from running. The amount of scratch requested is presumed by PBS to be the amount that will be used; therefore, requesting more scratch space than is required by the job may also prevent subsequent jobs from running that might otherwise have the space to run.

Working down further in the script, the '-l place=free' directive tells PBS to place the chunk defined in the '-l select' statement onto the compute node that is least busy, but still has the 4 free cores required by this chunk. If no node large enough exists at all or is available at the time of job submission, the job will be queued (perhaps indefinitely). Note, that PBS does not necessarily return submitted jobs that mistakenly request a resource chunk request that cannot be filled on the system where the job was submitted. It may just remain in the Q-state indefinitely. Furthermore, the messages explaining Q-state delays at the end of the 'qstat -f JIB output are often note very informative. Jobs that what a long time in the Q-state may have scripting errors that prevent them from running.

Further along in the script, the Gaussian09 environment variables are set, and the location and name of the job's scratch directory is defined. On BOB this directory will always be placed in '/state/partition1' on the compute node that PBS assigns to the job. A job's scratch directory will be given a name composed of the user's name, the date and time of creation, and the process ID unique to the job. Finally, the script calls the master Gaussian 09 executable, 'g09', to start the job. After job completion, this script should automatically remove the scratch files it created in the scratch directory. Please verify that this has occurred.

Users may choose to run jobs with fewer processors (cores, cpus) and smaller storage space requests than this sample job. This includes one processor jobs and others using a fraction of a compute node (2 processors, 4 processors, 6 processors). On a busy system, these smaller jobs may start sooner that those requesting a full 8 processors packed on a single node. Selecting the most efficient combination of processors, memory, and storage will ensure that resources will not be wasted and will be available to allocate to the next job submitted.

All users of Gaussian that publish based on its results must include the following citation in the publication to be in compliance with the terms of the license:

Gaussian [03,09], Revision C.02, M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A.D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople, Gaussian, Inc., Wallingford CT, 2004.

GMP

GMP is a library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface. The library is installed on PENZIAS.

Gnuplot

Gnuplot is a portable command-line driven graphing utility. It is installed on the following systems:

  • Karle under /usr/bin/gnuplot
  • Andy under /share/apps/gnuplot/default/bin/gnuplot
  • Bob under /share/apps/gnuplot/default/bin/gnuplot

Extensive documentation of gnuplot is available at the gnuplot's homepage.

GENOMEPOP2

GenomePop2 is a newer and specialized version of the older program GenomePop (version 1.0). GenomePop2 (version 2.2) is designed to manage SNPs under more flexible and useful settings that are controlled by the user. If you need models with more than 2 alleles you should use the older GenomePop version of the program.

GenomePop2 allows the forward simulation of sequences of biallelic positions. As in the previous version, a number of evolutionary and demographic settings are allowed. Several populations under any migration model can be implemented. Each population consists of a number N of individuals. Each individual is represented by one (haploids) or two (diploids) chromosomes with constant or variable (hotspots) recombination between binary sites. The fitness model is multiplicative with each derived allele having a multiplicate effect of (1-s * h-E) onto the global fitness value. By default E=0 and h=0.5 in diploids, but 1 in homozygotes or in haploids. Selective nucleotide sites undergoing directional selection (positive or negative) in different populations can be defined. In addition, bottlenecks and/or population expansion scenarios can be settled by the user during a desired number of generations. Several runs can be executed and a sample of user-defined size is obtained for each run and population. For more detail on how to use GenomePop2, please visit the web site here [25].

The CUNY HPC Center has installed GenomePop2 version 2.2 on ANDY and BOB. GenomePop2 is a serial code that reads all of its input parameters from a file in the user's working directory called 'GP2Input.txt'. How to set up such a file is explained in the How-To section at the GenomePop2 web-site here [26]. The following PBS batch script runs the third example given in the How-To which defines different SNPs ancestral alleles in different populations.

NOTE: Version 1.0.6 of the program has also been installed and can be found at '/share/apps/genomepop/1.0.6/bin/genomepop1'

#!/bin/bash
#PBS -q production
#PBS -N GENPOP2_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin GENPOP2 Serial Run ..."
echo ""
/share/apps/genomepop/default/bin/genomepop2
echo ""
echo ">>>> End   GENPOP2 Serial Run ..."

This script can be dropped in to a file (say genomepop2.job) and started with the command:

qsub genomepop2.job

This test case should take less than a minute to run and will produce PBS output and error files beginning with the job name 'GENPOP2_serial'. Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run the job will be printed in the PBS output file by the 'hostname' command.

While it is not visible in this PBS script, your customized 'GP2Input.txt' file MUST be present in the working directory for the job. When the job completes, GenomePop2 will have created a subdirectory called 'GP2_Results' with the results files in it. One could easily adapt this script to run GenomePop version 1.

GROMACS

GROMACS (Groningen Machine for Chemical Simulations) is a full-featured suite of free software, licensed under the GNU General Public License to perform molecular dynamics simulations -- in other words, to simulate the behavior of molecular systems with hundreds to millions of particles using Newton's equations of motion. It is primarily used for research on proteins, lipids, and polymers, but can be applied to a wide variety of chemical and biological research questions.

The CUNY HPC Center has installed GROMACS, the support tools and primary executable on PENZIAS and ANDY. The later version 4.6.3 in 64bit version with and without GPU support is installed on PENZIAS. On ANDY the older versions like 4.6.1. and 4.5.5 are installed both in single (32-bit) and double (64-bit) precision. Please note that in 4.5.5 some of the executable naming conventions have changed. All versions are setup to use the modules environment on ANDY. The older version will need to have the older version of OpenMPI and the Intel compiler loaded after the defaults are unloaded.

ON Andy all of the GROMACS double-precision executables end in the suffix '_d' to distinguish them from the single-precision version, as in:

mdrun_mpi_d

On PENZIAS the double precision executables are default and there is no 32 bit build. Note that not every tool and executable has been implemented in parallel or for the GPU in a given build even though their names will include the 'mpi' or 'gpu' suffix. The primary executable, 'mdrun', has been of course and is used the examples scripts below. Details on the GROMACS MD software suite can be found at the website here [27] and in the GROMACS manual here: [28].

Below PBS batch scripts demonstrate how to run the MD part of a typical GROMACS computations on ANDY, but before the PBS script can be submitted, the GROMACS module must be loaded to setup the environment. To get the 32-bit default version enter:

module load gromacs

To see that the GROMACS environment has indeed been loaded enter:

module list

which should list GROMACS and the other default modules that are loaded upon login:

[richard.walsh@andy 16cpu]$module list
Currently Loaded Modulefiles:
  1) pbs/11.3.0.121723              3) intel/13.0.1.117               5) gromacs/4.6.0_32bit_mpi_only
  2) cuda/5.0                       4) openmpi/1.6.3_intel
<pre>

Once the GROMACS module has been loaded, the following script can be used to run a GROMACS test job:

<pre>
#!/bin/bash
#PBS -N GRMX_32bit
#PBS -q production_qdr
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# print out PBS master compute node are you running on 
echo -n "The primary compute node for this job was: "
hostname

# You must explicitly change to your working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel, double-precision executable (mdrun_mpi_d)
mpirun -np 16 -machinefile $PBS_NODEFILE mdrun_mpi -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log > GRMX_32bit.out 2>&1

This PBS script is fairly typical of others provided here on the HPC Center Wiki for running MPI parallel workloads. The line '-l select=16:ncpus=1:mem=2880mb' requests 16 PBS resource 'chunks' each with 1 processors and 2880 MBytes of memory. Next, '-l place=free' instructs PBS to place each processor on the least loaded nodes where ever they happen to be (no packing of processors on a single node is requested). The '-V' options ensure that the GROMACS environment that we setup with the modules command and the other environmental defaults are passed on to the compute nodes where the PBS job will be run.

The job is being directed to the 'production_qdr' routing queue which uses that half of ANDY that includes the QDR (faster) InfiniBand interconnect.

The comments in the script explain the sections other than the 'mpirun' command that is used to start the GROMACS 'mdrun_mpi' executable with the requested 16 processors. The environment variable '$PBS_NODEFILE' contains the list of the nodes that PBS has allocated to this job.

To run the 64-bit (double-precision) version of the code, the 32-bit module would have to be unloaded with:

module unload gromacs

and replaced with the 64-bit version:

module load gromacs/4.6.0_64bit_mpi_only

On ANDY the the 'mpirun' line in the script above would need to have the '_d' suffix added to it to ensure that the 64-bit, double-precision executable was selected, as in:

mpirun -np 16 -machinefile $PBS_NODEFILE mdrun_mpi_d -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log > GRMX_64bit.out 2>&1


The GPU version of the script to run the same problem would look like this:

#!/bin/bash
#PBS -N GRMX_GPU.test
#PBS -q production
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# print out PBS master compute node are you running on 
echo -n "The primary compute node for this job was: "
hostname

# You must explicitly change to your working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel, double-precision executable (mdrun_mpi_d)
mdrun_gpu -px -pf -s md_para.tpr -o md_para.trr -c md_para.gro -e md_para.edr -g md_para.log > GRMX_GPU.out 2>&1

There are several important differences from the 16 processor MPI script above. First, all GPU jobs must be sent to the 'production_gpu' routing queue to be allocated GPU resources by PBS. Next, the '-l select' line has changed. It now requests 1 CPU (where the GPU host program runs) and 1 GPU, and names the type of GPU acceleration resource that it needs--in this case an NVIDIA Fermi 2.0 device. ANDY has 96 of these devices on the GPU side of the system. Lastly, the 'mdrun' command-line has changed. The 'mpirun' start program is not needed and the executable includes the '_gpu' suffix. The command-line options to this version of the program have not change. Only a single GPU is used in this example, although each NVIDIA Fermi GPU has 448 small-cores that will be dedicated to this job. On ANDY there are 2 GPU devices per physical node.

While the GPU version of the code will not performance every calculation on the GPU, benchmarks have show that 1 GPU can out perform 8 or more CPU cores depending on their clock and architecture. Of course, with the MPI parallel version you can request more that this number of CPUs. Users should investigate the scaling properties of the different versions and submit their jobs based their findings and how busy the CPU and GPU resources on the system are.

For the 'mdrun_mpi' command-line options used above a short summary of their meaning will be written to the terminal if the '-help' option is used. The GPu version of the code will have some additional options. Here is the MPI list:

Option     Filename  Type         Description
------------------------------------------------------------
  -s      topol.tpr  Input        Run input file: tpr tpb tpa
  -o       traj.trr  Output       Full precision trajectory: trr trj cpt
  -x       traj.xtc  Output, Opt. Compressed trajectory (portable xdr format)
-cpi      state.cpt  Input, Opt.  Checkpoint file
-cpo      state.cpt  Output, Opt. Checkpoint file
  -c    confout.gro  Output       Structure file: gro g96 pdb etc.
  -e       ener.edr  Output       Energy file
  -g         md.log  Output       Log file
-dhdl      dhdl.xvg  Output, Opt. xvgr/xmgr file
-field    field.xvg  Output, Opt. xvgr/xmgr file
-table    table.xvg  Input, Opt.  xvgr/xmgr file
-tabletf    tabletf.xvg  Input, Opt.  xvgr/xmgr file
-tablep  tablep.xvg  Input, Opt.  xvgr/xmgr file
-tableb   table.xvg  Input, Opt.  xvgr/xmgr file
-rerun    rerun.xtc  Input, Opt.  Trajectory: xtc trr trj gro g96 pdb cpt
-tpi        tpi.xvg  Output, Opt. xvgr/xmgr file
-tpid   tpidist.xvg  Output, Opt. xvgr/xmgr file
 -ei        sam.edi  Input, Opt.  ED sampling input
 -eo      edsam.xvg  Output, Opt. xvgr/xmgr file
  -j       wham.gct  Input, Opt.  General coupling stuff
 -jo        bam.gct  Output, Opt. General coupling stuff
-ffout      gct.xvg  Output, Opt. xvgr/xmgr file
-devout   deviatie.xvg  Output, Opt. xvgr/xmgr file
-runav  runaver.xvg  Output, Opt. xvgr/xmgr file
 -px      pullx.xvg  Output, Opt. xvgr/xmgr file
 -pf      pullf.xvg  Output, Opt. xvgr/xmgr file
 -ro   rotation.xvg  Output, Opt. xvgr/xmgr file
 -ra  rotangles.log  Output, Opt. Log file
 -rs   rotslabs.log  Output, Opt. Log file
 -rt  rottorque.log  Output, Opt. Log file
-mtx         nm.mtx  Output, Opt. Hessian matrix
 -dn     dipole.ndx  Output, Opt. Index file
-multidir    rundir  Input, Opt., Mult. Run directory
-membed  membed.dat  Input, Opt.  Generic data file
 -mp     membed.top  Input, Opt.  Topology file
 -mn     membed.ndx  Input, Opt.  Index file

Option       Type   Value   Description
------------------------------------------------------
-[no]h       bool   no      Print help info and quit
-[no]version bool   no      Print version info and quit
-nice        int    0       Set the nicelevel
-deffnm      string         Set the default filename for all file options
-xvg         enum   xmgrace  xvg plot formatting: xmgrace, xmgr or none
-[no]pd      bool   no      Use particle decompostion
-dd          vector 0 0 0   Domain decomposition grid, 0 is optimize
-ddorder     enum   interleave  DD node order: interleave, pp_pme or cartesian
-npme        int    -1      Number of separate nodes to be used for PME, -1
                            is guess
-nt          int    0       Total number of threads to start (0 is guess)
-ntmpi       int    0       Number of thread-MPI threads to start (0 is guess)
-ntomp       int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is guess)
-ntomp_pme   int    0       Number of OpenMP threads per MPI process/thread
                            to start (0 is -ntomp)
-pin         enum   auto    Fix threads (or processes) to specific cores:
                            auto, on or off
-pinoffset   int    0       The starting logical core number for pinning to
                            cores; used to avoid pinning threads from
                            different mdrun instances to the same core
-pinstride   int    0       Pinning distance in logical cores for threads,
                            use 0 to minimize the number of threads per
                            physical core
-gpu_id      string         List of GPU id's to use
-[no]ddcheck bool   yes     Check for all bonded interactions with DD
-rdd         real   0       The maximum distance for bonded interactions with
                            DD (nm), 0 is determine from initial coordinates
-rcon        real   0       Maximum distance for P-LINCS (nm), 0 is estimate
-dlb         enum   auto    Dynamic load balancing (with DD): auto, no or yes
-dds         real   0.8     Minimum allowed dlb scaling of the DD cell size
-gcom        int    -1      Global communication frequency
-nb          enum   auto    Calculate non-bonded interactions on: auto, cpu,
                            gpu or gpu_cpu
-[no]tunepme bool   yes     Optimize PME load between PP/PME nodes or GPU/CPU
-[no]testverlet bool   no      Test the Verlet non-bonded scheme
-[no]v       bool   no      Be loud and noisy
-[no]compact bool   yes     Write a compact log file
-[no]seppot  bool   no      Write separate V and dVdl terms for each
                            interaction type and node to the log file(s)
-pforce      real   -1      Print all forces larger than this (kJ/mol nm)
-[no]reprod  bool   no      Try to avoid optimizations that affect binary
                            reproducibility
-cpt         real   15      Checkpoint interval (minutes)
-[no]cpnum   bool   no      Keep and number checkpoint files
-[no]append  bool   yes     Append to previous output files when continuing
                            from checkpoint instead of adding the simulation
                            part number to all file names
-nsteps      int    -2      Run this number of steps, overrides .mdp file
                            option
-maxh        real   -1      Terminate after 0.99 times this time (hours)
-multi       int    0       Do multiple simulations in parallel
-replex      int    0       Attempt replica exchange periodically with this
                            period (steps)
-nex         int    0       Number of random exchanges to carry out each
                            exchange interval (N^3 is one suggestion).  -nex
                            zero or not specified gives neighbor replica
                            exchange.
-reseed      int    -1      Seed for replica exchange, -1 is generate a seed
-[no]ionize  bool   no      Do a simulation including the effect of an X-Ray
                            bombardment on your system

HOOMD

Performs general purpose particle dynamics simulations, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many processor cores on a fast cluster. Unlike some other applications in the particle and molecular dynamics space, HOOMD developers have worked to implement all the of the code's computationally intensive kernels on the GPU, although currently only single node, single-GPU or OpenMP-GPU runs are possible. There is no MPI-GPU or distributed parallel GPU version available at this time.

HOOMD's object-oriented design patterns make it both versatile and expandable. Various types of potentials, integration methods and file formats are currently supported, and more are added with each release. The code is available and open source, so anyone can write a plugin or change the source to add additional functionality. Simulations are configured and run using simple python scripts, allowing complete control over the force field choice, integrator, all parameters, how many time steps are run, etc. The scripting system is designed to be as simple as possible to the non-programmer.

The HOOMD development effort is led by the Glotzer group at the University of Michigan, but many groups from different universities have contributed code that is now part of the HOOMD main package, see the credits page for the full list. The HOOMD website and documentation are available here [29]. HOOMD version 0.9.2 has been installed on ANDY which has NVIDIA's S2050 Fermi GPUs with 448 computational cores. The version installed runs in single-precision (32-bit) mode.

A basic input file in HOOMD's python scripting format is present here:

$cat test.hoomd
from hoomd_script import *

# create 100 random particles of name A
init.create_random(N=100, phi_p=0.01, name='A')

# specify Lennard-Jones interactions between particle pairs
lj = pair.lj(r_cut=3.0)
lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)

# integrate at constant temperature
all = group.all();
integrate.mode_standard(dt=0.005)
integrate.nvt(group=all, T=1.2, tau=0.5)

# run 10,000 time steps
run(10e3)

Here is a PBS script that will run the above test case on a single ANDY GPU:

#!/bin/bash
#PBS -q production_gpu
#PBS -N HOOMDS_test
#PBS -l select=1:ncpus=1:ngpus=1:accel=fermi
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin HOOMD GPU Parallel Run ..."
echo ""
/share/apps/hoomd/default/bin/hoomd test.hoomd 
echo ""
echo ">>>> End   HOOMD GPU Parallel Run ..."

The example above targets one (1) GPU on any compute node with an attached GPU. In the case of ANDY, that is any of the 'gpute-XX' compute nodes on the QDR InfiniBand side of the system. By selecting the '-q production_gpu' PBS routing queue and asking for one (1) GPU with '-l select=1:ncpus=1:ngpus=1:accel=fermi' PBS will ensure that a GPU is available to the HOOMD job. By default if no options are offered to the 'hoomd' command, the executable will first look for a GPU and if it finds one it will use it; otherwise, it will run only on the CPU. GPU-only or CPU-only execution can be requested using the '--mode=gpu' or '--mode=cpu' on the command line above. NOTE: The options to the 'hoomd' command must be placed AFTER the python input script (i.e. hoomd test.hoomd --mode=gpu). The 'hoomd' executable accepts a variety of options to control runtime behavior. These are described in detail here [30].

HOPSPACK

HOPSPACK stands for Hybrid Optimization Parallel Search Package designed to help users to solve wide range of derivative free optimization problems. The later can be noisy, non-convex or non-smooth ones. The basic optimization problem addressed is to minimize objective function on n unknowns f(x) subject to constrains: $A_I$th>Ax ≥ bi Aex = be ci(x) ≥ 0 ce(x) = 0 l≤x≤u The first two constraints specify linear inequalities and equalities with coefficient matrices AI and AE. The next two constraints describe nonlinear inequalities and equalities captured in functions cI(x) and cE(x). The final constraints denote lower and upper bounds on the variables. HOPSPACK allow variables to be continuous or integer-valued and has provisions for multi-objective optimization problems. In general, functions f(x),cI(x), and cE(x) can be noisy and nonsmooth, although most algorithms perform best on determinate functions with continuous derivatives.

The users are allowed to design and implement their own solver either by writing their own code or by building existing solvers already in a framework. Because all solvers (called citizens) are members of the same global class they can share assigned resources. The main features of the package are:

- Only function values are required for the optimization. - The user must provide a separate program that can evaluate the objective and nonlinear constraint functions at a given point. - A robust implementation of the Generating Set Search (GSS) solver is supplied, including the capability to handle linear constraints. - Multiple solvers can run simultaneously and are easily configured to share information. - Solvers may share a cache of computed function and constraint evaluations to eliminate duplicate work. - Solvers can initiate and control sub-problems


Before start the hopspack the user must load the hopspack module file.

module load hopspack

The program can be started by command

hopspack <input file>

The input file contains HOPSPACK parameters in a text format.

The pbs script to run job with hopspack is:

#!/bin/bash
#PBS -q production
#PBS -N hopspack
#PBS -l select=2:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin HOPSPACK_MPI Run ..."
mpirun -np 2 -machinefile $PBS_NODEFILE hopspack input_text_file > hopspack.out 2>&1
echo ">>>> Begin HOPSPACK Run ..."

This script can be incorporated into a file (i.e. hopspack_job) and started with the command:

qsub hopspack_job

IMa2

The IMa2 application performs basic calculations ‘Isolation with Migration’ using Bayesian inference and Markov chain Monte Carlo methods. The only major conceptual addition to IMa2 that makes it different from the original IMa program is that it can handle data from multiple populations. This requires that the user specify a phylogenetic tree. Importantly, the tree must be rooted, and the sequence in time of internal nodes must be known and specified. More information on the IMa2 and IMa can be found in the user manual here [31]

The latest IMa2 (8-26-11 release) is a serial program that is currently installed on ANDY and BOB at the CUNY HPC Center, and requires an input file and potentially several additional data files to run. Here we provide a script that will run the test input program supplied by the authors, 'ima2_testinput.u'. Completing this run may also require the prior file ('ima2_priorfile_4pops.txt') and the nested models file ('ima2_all_nested_models_2_pops.txt'). All these files can be copied out of the IMa2 installation examples directory, as follows:

cp /share/apps/ima2/default/examples/ima2_testinput.u .

A working PBS batch script that will complete an IMa2 run is presented here:

#!/bin/bash
#PBS -q production
#PBS -N IMA2_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin IMa2 Serial Run ..."
echo ""
/share/apps/ima2/default/bin/IMa2 -i ima2_testinput.u -o ima2_testoutput.out -q2 -m1 -t3 -b10000 -l100
echo ""
echo ">>>> End   IMa2 Serial Run ..."

This script can be dropped into a file (say 'ima2_serial.job) on BOB, and run with:

qsub ima2_serial.job

It should take less than a few minutes to run and will produce PBS output and error files beginning with the job name 'IMA2_serial'. It also produces IMa2's own output files. Details on the meaning for the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Please take note of the IMa2 options used here. Details on each can be found in the IMa2 manual referenced above.

HONDO PLUS

Hondo Plus 5.1 is a versatile electronic structure code that combines work from the original Hondo application developed by Harry King in the lab of Michel Dupuis and John Rys, and that of numerous subsequent contributers. It is currently distributed from the research lab of Dr. Donald Truhlar at the University of Minnesota. Part of the advantage of Hondo Plus is the availability of source implementations of a wide variety of model chemistries developed over its life time that researchers can adapt to their particular needs. The license to use the code requires a literature citation which is documented in the Hondo Plus 5.1 manual found at:

http://comp.chem.umn.edu/hondoplus/HONDOPLUS_Manual_v5.1.2007.2.17.pdf

The Hondo Plus 5.1 installed at the CUNY HPC Center is the serial version of the application and it is currently available only on ANDY. It was compiled with the Intel Fortran compiler. The installation directory (/share/apps/hondoplus/default) includes a large number of examples in the form of a test suite of input decks and correct outputs in the directory;

/share/apps/hondoplus/default/examples

A simple PBS Pro script to run a Hondo Plus serial job on ANDY is present here using one of the test input decks from the examples directory:

#!/bin/bash
# This script runs a serial HondoPlus job in the 
# PBS qserial queue.  The HondoPlus code was compiled
# with the Intel Fortran compiler and the recommended
# settings. The SCM memory scratch space was left at
# the default size.
#PBS -q production
#PBS -N hondo_job
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

hostname

cd $HOME/hondo

echo 'HondoPlus Job starting ... '

/share/apps/hondoplus/default/bin/hondo test1.0.1315.in test1.0.1315.out

# Clean up scratch files by default

echo 'HondoPlus Job is done!'

Hondo Plus was compiled with the default memory sizes as set in the distribution. With the larger memory available on ANDY and many modern Linux cluster systems compiling a larger version is possible. Those interested should contact CUNY HPC Center help at hpchelp@csi.cuny.edu.

LAMARC

LAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates. It approximates a summation over all possible genealogies that could explain the observed sample, which may be sequence, SNP, microsatellite, or electrophoretic data. LAMARC and its sister program MIGRATE are successor programs to the older programs Coalesce, Fluctuate, and Recombine, which are no longer being supported. These programs are memory-intensive, but can run effectively on workstations. They are supported on a variety of operating systems. For more detail on LAMARC please visit the website here [32], read this paper [33], and look at the documentation here [34].

LAMARC version 2.1.8 is currently installed at the CUNY HPC Center on the system ANDY. LAMARC is a serial code that can be compiled with or without a GUI interface. To discourage interactive GUI-based runs on these system's login nodes, LAMARC has been compiled with the GUI disabled and should be run in command-line mode from a PBS batch script (take note of the '-b' option). Here is a PBS batch script that will run the sample XML input file that is provided with the distribution, 'sample_infile.xml' in /share/apps/lamarc/default/examples. This run assumes that you have already converted your raw data file into one that is readable by LAMARC. This has already been done with the simple example input. As tutorial on the use of the converter is located here [35].

To include all required environmental variables and the path to the LAMARC executable run the modules load command (the modules utility is discussed in detail above):

module load lamarc

Here is PBS batch script that runs the sample input case on one processor (serially):

#!/bin/bash
#PBS -q production
#PBS -N LAMARC_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin LAMARC Serial Run ..."
lamarc ./sample_infile.xml -b > sample_infile.out 2>&1
echo ">>>> End   LAMARC Serial Run ..."

This script can be dropped into a file (say 'lamarc_serial.job) on either ANDY, and run with:

qsub lamarc_serial.job

This sample input file should take less than a minute to run and will produce PBS output and error files beginning with the job name 'LAMARC_serial'. It also produces LAMARC's own output files. Details on the meaning for the PBS script are covered above in the PBS section of this Wiki. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are to be found (freely). The compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

Note the presence of the batch-mode option '-b' on the LAMARC command line. This is REQUIRED to complete a batch submission, but the absence of other options indicates that the input file that you are using has everything else configured as you wish or that you will be using the default settings. If you do not use the '-b' option your batch job will sit forever waiting for input from your terminal that it will never get, because it is a batch job.

You can customize-edit the input file settings manually using a Unix editor like 'vi', although you will have to work through a lot of XML punctuation to do this. Another approach is to run LAMARC interactively on the login node to generate a customized input file from the defaults-based file created by the converter program. Then this customized file can be saved from the interactive menu before it is run. DO NOT complete the actually computation on the login node. Submit it as a PBS batch job using the script above. Your customized file can be used as the input (with its new name) as in the script above using the '-b' batch option.

LAMMPS

LAMMPS is a classical molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous state. It can model atomic, polymeric, biological, metallic, granular, and coarse-grained systems using a variety of force fields and boundary conditions. LAMMPS runs efficiently on single-processor desktop or laptop machines, but is also designed for parallel computers, including clusters with and without GPUs. It will run on any parallel machine that compiles C++ and supports the MPI message-passing library. This includes distributed- or shared-memory parallel machines and Beowulf-style clusters. LAMMPS can model systems with only a few particles up to millions or billions. LAMMPS is a freely-available open-source code, distributed under the terms of the GNU Public License, which means you can use or modify the code however you wish. LAMMPS is designed to be easy to modify or extend with new capabilities, such as new force fields, atom types, boundary conditions, or diagnostics. A complete description of LAMMPS can be found in its on-line manual here [36] or from the full PDF manual here [37].

The complete LAMMPS package is installed on PENZIAS server. Because the later is GPU enabled computer (Kepler) the code is compiled as double-double i.e double precision on force and velocity. The abundance of GPU in PENZIAS makes the use of OpenMP obsolete, because usually the better performance results are obtained by oversubscribing Kepler GPUs rather than by using OpenMP style. That is why the OpenMP package along with KIM package is not installed on PENZIAS. As a rule of thumb, it is recommended to use 2-4 MPI threads per Kepler GPU in order to gain maximum performance. Current packages installed on PENZIAS are: ASPHERE, BODY, CLASS2, COLLOID, DIPOLE, FLD, GPU, GRANULAR, KSPACE, MANYBODY, MC, MEAM, MISC, MOLECULE, OPT, PERI, POEMS, REAX, REPLICA, RIGID, SHOCK, SRD, VORONOI, XTC. In addition the following USER packages are also installed: ATC, AWPMD, CG-CMM, COLVARS, CUDA, EFF, LB, MISC, MOLFILE, PHONON, REAXC and SPH.

Here is a LAMMPS input deck (in.lj) from the LAMMPS benchmark suite.

 3d Lennard-Jones melt

variable        x index 1
variable        y index 1
variable        z index 1

variable        xx equal 20*$x
variable        yy equal 20*$y
variable        zz equal 20*$z

units           lj
atom_style      atomic

lattice         fcc 0.8442
region          box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box      1 box
create_atoms    1 box
mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve

run             100

The script below utilizes version of LAMMPS on 8 CPU processor cores. Before use it however load module lammps with the command

module load lammps
#!/bin/bash
#PBS -q production
#PBS -N LAMMPS_test
#PBS -l select=8:ncpus=1
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo ""
echo -n ">>>> PBS Master compute node is: "
hostname

# Change to working directory
cd $PBS_O_WORKDIR

echo ">>>> Begin LAMMPS MPI Parallel Run ..."
echo ""
mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/lammps/default/bin/lmp_openmpi < in.lj > out_file
echo ""
echo ">>>> End   LAMMPS MPI Parallel Run ..."

On PENZIAS the GPU mode is the default unless otherwise turned off with the command-line switch '-cuda off'. In order to run on GPU you must also modify the PBS script by editing the '-l select' line to request one GPU for every CPU allocated by PBS. In other words the '-l select' line for a GPU run would look like this:

#PBS -l select=8:ncpus=1:ngpus=1 

Other changes to the input file and/or the command line. are also need to run on the GPUs. Details regarding command-line switches are available here [38]. Here is a simple listing:

-c or -cuda
-e or -echo
-i or -in
-h or -help
-l or -log
-p or -partition
-pl or -plog
-ps or -pscreen
-r or -reorder
-sc or -screen
-sf or -suffix
-v or -var

Other requirements for running in GPU mode (or to use any of the various user-packages) can be found here [39] and in the LAMMPS User Manual mentioned above. The 'package gpu' and 'package cuda' commands the those of primary interest.

At the end note that users interested a single-precision version of LAMMPS should contact the HPC Center through 'hpchelp@csi.cuny.edu'.

The installation on the Cray XE6 (SALK) does not include the GPU parallel models because the CUNY HPC Center Cray does not have GPU hardware.

LS-DYNA

From its early development in the 1970s, LS-DYNA has evolved into a general purpose material stress, collision, and crash analysis program with many built-in material and structural element models. In recent years, the code has also been adapted for both OpenMP and MPI parallel execution on a variety of platforms. The most recent version, LS-DYNA 971 Revision 6.0.0, is installed on ANDY at the CUNY HPC Center under an academic license held by the City College of New York. The use of this license to do work that is commercial in anyway is prohibited.

Details on LS-DYNA's use, input deck construction, and execution options can be found in the LS-DYNA manual here [40]. All files related to the HPC Center installation of version 971 (executables and example inputs) are located in:

/share/apps/lsdyna/default/[bin,examples]

Both 32-bit and 64-bit executables in both serial and parallel versions are provided. The MPI parallel versions use OpenMPI as their MPI parallel library, the HPC Center's default version of MPI. The serial executable can also be run in OpenMP (not to be confused with OpenMPI) node- local SMP-parallel mode. The names of the executable files in the '/share/apps/lsdyna/default/bin' directory are:

ls-dyna_32.exe  ls-dyna_64.exe  ls-dyna-mpp32.exe  ls-dyna_mpp64.exe

Those with the string 'mpp' in the name are the MPI distributed parallel versions of the code. The integer (32 or 64) designates the precision of the build. In the examples below, depending on the type of script being submitted (serial or parallel, 32- or 64-bit), a different executable will be chosen. The scaling properties of LS-DYNA in parallel mode are limited, and users should not carelessly submit parallel jobs requesting large numbers of cores without understanding how their job will scale. A large 128 core job that runs only 5% faster than a 64 core job is a waste of resources. Please examine the scaling properties of your particular job before scaling up.

As is the case with most long running applications run at the CUNY HPC Center, whether parallel or serial, LS-DYNA jobs are run using a PBS batch job submission script. Here we provide some example scripts for both serial and parallel execution.


Note that before using this script you will need to setup the environment for LS-DYNA. On Andy "modules" is used to manage environments. Setting up LS-DYNA is done with

module load ls-dyna

First an example serial execution script (called say 'airbag.job') run at 64-bits using the LS-DYNA 'airbag' example ('airbag.deplo.k') from the examples directory above as the input.

#!/bin/bash
#PBS -q production_qdr
#PBS -N ls-dyna_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin LS-DYNA Serial Run ..."
ls-dyna_64.exe i=airbag.deploy.k memory=2000m
echo ">>>> End   LS-DYNA Serial Run ..."

Details on the PBS options at the head of the this script file are discussion below, but in summary '-q production_qdr' selects the routing queue into which the job will be placed, '-N ls-dyna_serial' sets this job's name, '-l select=1:ncpus=1:mem=2880mb' selects-requests 1 PBS resource chunk that includes 1 cpu and 2880 MBytes of memory, and '-l place=free' allows PBS to put the resources needed for the job any where on the ANDY system.

The LS-DYNA command line sets the input file to be used and the amount of in-core memory that is available to the job. Note that this executable does NOT include the string 'mpp' which means that it is not the MPI executable. Users can copy the 'airbag.deploy.k' file from the examples directory and cut-and-paste this script to run this job. It takes a relatively sort time to run. The PBS command for submitting the job would be:

qsub airbag.job

Here is a PBS script that runs a 16 processor (core) MPI job. This script is set to run the TopCrunch '3cars' benchmark which is relatively long-running using MPI on 16 processors. There are a few important differences in this script.

#!/bin/bash
#PBS -q production_qdr
#PBS -N ls-dyna_mpi
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin LS-DYNA MPI Parallel Run ..."
mpirun -np 16 --machinefile $PBS_NODEFILE ls-dyna_mpp64.exe i=3cars_shell2_150ms.k ncpu=16 memory=2000m
echo ">>>> End   LS-DYNA MPI Parallel Run ..."

Focusing on the difference in this script relative to the serial PBS script above. First, the '-l select' line requests not 1 PBS resource chunk, but 16 each with 1 cpu (core) and 2880 MBytes of memory. This provides the necessary resources to run our 16 processor MPI-parallel job. Next, the LS-DYNA command line is different. The LS-DYNA MPI-parallel executable is used (ls-dyna_mpp64.exe), and it is run with the help of the OpenMPI job submission command 'mpirun' which sets the number of processors and the location of those processors on the system. The actually LS-DYNA key words also add the string 'ncpu=16' to instruct LS-DYNA that this is to be a parallel run.

Running in parallel on 16 cores in 64-bit mode on ANDY, the '3cars' case takes about 9181 seconds of elapsed time to complete. If the user would like to run this job, they can grab the input files out of the directory '/share/apps/lsdyna/6.0.0/examples/3cars' on ANDY and use the above script.

MATHEMATICA

General notes

“Mathematica” is a fully integrated technical computing system that combines fast, high-precision numerical and symbolic computation with data visualization and programming capabilities. Mathematica version 9.0.1 is currently installed on the CUNY HPC Center's ANDY cluster (andy.csi.cuny.edu) and KARLE standalone server (karle.csi.cuny.edu). The basics of running Mathematica on CUNY HPC systems are present here. Additional information on how to use Mathematica can be found at http://www.wolfram.com/learningcenter/

Modes of Operation in Mathematica

Mathematica can be run locally on an office workstation, directly on a server or cluster from its head node, or across the network between an office-local client and a remote server (a cluster for instance). It can be run serially or in parallel; its licenses can be provided locally or via a network-resident license server; and it can be run in command-line or GUI mode. The details of installing and running Mathematica on a local office workstation are left to the user. Those modes of operation important to the use of CUNY's HPC resources are discussed here.

Selecting Between GUI and Command-Line Mode

The use of command-line mode or GUI mode is determined by the Mathematica command selected. To use the Mathematica GUI, enter the following command to the user prompt:

$mathematica

To use Mathematica Command Line Interface (CLI), enter:

$math

More details on these and other Mathematica commands is available through man command as in:

$man mathematica
$man math
$man mcc

The lines above provide documentation on the GUI, CLI, and Mathematica C-compiler, respectively.

A Note on Fonts on Unix and Linux Systems

If you have Mathematica installed on your local system, you should already have the correct fonts available for local use, but when displaying the Mathematica GUI (via X11 forwarding). on your local system while running remotely, some additional preparation may be required to provide the fonts that Mathematica requires to X11 locally. The procedure for setting this up is presented here.

The Mathematica GUI interface supports BDF, TrueType, and Type1 fonts. These fonts are automatically installed for local use by the MathInstaller. Your workstation or personal computer will have access to these fonts if you have installed Mathematica for local use. However, if the Mathematica process is installed and running only on a remote system at the CUNY HPC Center (say ANDY), then X11 and the Mathematica GUI being displayed on your local machine (through X11 port forwarding) must know where to find the Mathematica fonts locally. Typically, the Mathematica fonts must be added to your local workstation's X11 font path using the 'xset' command, as follows.

First, you must create a client-local directory into which to copy the fonts, for example on a Linux system cd $HOME; mkdir Fonts. Next, you must copy the Mathematica font directories into this local directory from their remote location on ANDY. They are currently stored in the directory:

on ANDY: /share/apps/mathematica/8.0.4/SystemFiles/Fonts/
on KARLE: /share/apps/mathematica/8.0/SystemFiles/Fonts/

To create local copies in the 'Fonts' directory you created, execute the following commands from your local desktop (this assumes that secure copy (scp) is available on your desktop system):

$
$mkdir Fonts
$
$cd Fonts
$scp -r your.account@karle.csi.cuny.edu:/share/apps/mathematica/8.0/SystemFiles/Fonts/*   .
$
$ls -l
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 AFM
drwxr-xr-x 2 your.account users 45056 Nov  3 16:08 BDF
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 SVG
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 TTF
drwxr-xr-x 2 your.account users   4096 Nov  3 16:07 Type1
$

After you have copied the remote font directories into your local directory, run the following X11 'xset' commands locally:

xset fp+ ${HOME}/Fonts/Type1; xset fp rehash
xset fp+ ${HOME}/Fonts/BDF;    xset fp rehash

For optimal on-screen performance, the Type1 font path should appear before the BDF font path. Hence, ${HOME}/Fonts/Type1 should appear before ${HOME}/Fonts/BDF in the path. You can check font path order by executing the command:

xset q

Additional information on handling Mathematica fonts can be found at http://reference.wolfram.com/mathematica/tutorial/FontsOnUnixAndLinux.html

Using Mathematica on KARLE

Karle is a standalone, four socket, 4 x 6 = 24 core head-like node and is highly capable system. Karle's 24 Intel E740-based cores run at 2.4 GHz. Karle has a total of 96 Gbytes of memory or 4 Gbytes per core. Users can run GUI applications on Karle following this approach or they can prefer CLI. Selecting Between GUI and Command-Line Mode is described here.

Serial Job Exmaple

If mathematica was started in interactive mode using GUI/CLI users can enter mathematica commands as they would normally do:

$ /share/apps/mathematica/9.0.1/bin/math
Mathematica 9.0 for Linux x86 (64-bit)
Copyright 1988-2013 Wolfram Research, Inc.

In[1]:= Print["Hello World!"]
Hello World!

In[2]:= Table[Random[],{i,1,10}]

Out[2]= {0.22979, 0.168789, 0.257107, 0.724029, 0.466558, 0.588178, 0.186516, 
 
>    0.957024, 0.950642, 0.938009}

In[3] = Exit[]
$

Alternatively one may put these commands into a text file:

$ cat test.nb
Print["Hello World!"]
Table[Random[],{i,1,10}]
In[3] = Exit[]

$

and run it using:

/share/apps/mathematica/9.0.1/bin/math < test.nb

the following output will be received:

Mathematica 9.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= Hello World!

In[2]:= 
Out[2]= {0.67778, 0.737257, 0.862751, 0.623122, 0.253662, 0.541513, 0.776872, 
 
>    0.424682, 0.934039, 0.190007}

In[3]:= 
Parallel Job Example

To run parallel computations in Mathematica on Karle first start required amount kernels (CUNY HPC license allows up to 16 kernels) and then run actual computation. Consider the following example:

$ cat parallel.nb 

LaunchKernels[8]

With[{base = 10^1000, r = 10^10}, WaitAll[Table[ParallelSubmit[
     While[! PrimeQ[p = RandomInteger[{base, base + r}]], Null]; 
     p], {$KernelCount}]] - base]
$
$
$ /share/apps/mathematica/9.0.1/bin/math < parallel.nb 
Mathematica 9.0 for Linux x86 (64-bit)
Copyright 1988-2013 Wolfram Research, Inc.

In[1]:= 
In[1]:= 
Out[1]= {KernelObject[1, local], KernelObject[2, local], 
 
>    KernelObject[3, local], KernelObject[4, local], KernelObject[5, local], 
 
>    KernelObject[6, local], KernelObject[7, local], KernelObject[8, local]}

In[2]:= 
In[2]:= 
Out[2]= {4474664203, 8096247063, 9746330049, 4733134789, 2879419863, 
 
>    377023287, 7848087693, 8139999951}

In[3]:= 
$
Statement
LaunchKernels[8]
starts 8 local kernels. Rest on the notebook runs parallel evaluation on those 8 kernels.

Submitting Batch Jobs to the CUNY ANDY Cluster

Currently, there is no simple and secure method of submitting Mathematica jobs from a remote (user local or desktop) CUNY installation of Mathematica to ANDY. This is something that is being pursued. In the mean time, both serial and parallel Mathematica jobs can be submitted from ANDY's head node by constructing a standard batch job. To ease the process of debugging such work, we recommend that user's test their Mathematica command sequences locally on smaller, but similar cases before submitting thier work to the cluster. The standard batch submission process is simple to set up and imposes no burden on ANDY's head node.

Serial Batch Jobs Run with 'qsub' Using a Mathematica Command (Text) File

In the following example, a batch job is created around a locally pre-tested Mathematica command sequence that is then submitted to ANDY batch queueing system using the qsub command. The simple Mathematica command sequence shown here computes a matrix of integrals and prints out every element of that matrix. Any valid sequence of Mathematica commands provided in a note book file, whether tested on an office Mathematica installation or on the cluster head node itself, could be used in this example.

When working remotely from an office or a classroom, a user would validate their command sequence on their local workstation (via a smaller local test run), modify it incrementally to make use of the additional resources available on ANDY, and then copy, paste, and save the Mathematica command sequence in a notebook file (file.nb) on ANDY. This last step would be done through a text editor like 'vi' or 'emacs' from a cluster terminal window. From a Windows desktop, the free, secure Windows-to-Linux terminal emulation package, PuTTY could be used. From a Linux desktop, connecting with secure shell 'ssh' would be the right approach.

Below, a note book file called "test_run.nb" does a serial (single worker-kernel) integral calculation (that might have been tested on the user's office Mathematica installation) has been saved on ANDY from a 'vi' session. Its contents are listed here:

$
$ cat test_run.nb

Print ["Beginning Integral Calculations"]; p=5;
Timing[matr = Table[Integrate[x^(j+i),{x,0,1}], {i,1,p-1}, {j,1,p-1}]//N];
For[i=1, i<p, i++, For[j=1, j<p, j++, Print[matr[[i]][[j]]]]];
Print ["Finished!"];
Quit[];

$

As a serial Mathematica job, this job executes on just one core of just one of ANDY's compute nodes. The simple batch script offered to 'qsub' to run this job (we will call it serial_run.math here) is listed below. This script is written in the PBS Pro form, which became the workload manager on ANDY on 11-18-09. For details on PBS Pro see the section on using the PBS Pro workload manager elsewhere in the CUNY HPC Wiki.

$
$cat serial_run.math

#!/bin/bash
#PBS -N mmat8_serial1
#PBS -q production
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

cd $PBS_O_WORKDIR

math -run <test_run.nb > output

$

This script runs on a single processor (core) within a single ANDY compute node invoking a single Mathematica kernel instance. The '-N mmat8_serial1' option names the job 'mmat8_serial1' The job is directed to ANDY's production routing queue, which reads the script's resource request information, and places it in the appropriate execution queue. The '-l select=1:ncpus=1:mem=1920mb' option requests one resource 'chunk' composed of 1 processor (core) and 1920 Mbytes of memory. The '-l place=free' option instructs PBS Pro to place the job where it wishes, which will be on the compute node with the lowest load average. The '-V' option ensures that the current local Unix environment is pushed out to the compute node that runs the job. Because this is a batch script with no connection to the terminal, the CLI version of the Mathematica command, 'math', is used.

Save this script in a file for your future use, for example in "serial_run.math". With few modifications, it can be used to run most serial Mathematica batch jobs on ANDY.

To run this job script use the command:

 qsub serial_run.math 

Like any other batch jobs submitted using 'qsub', you can check the status of your job by running the command 'qstat' or 'qstat -f JID'. Upon completion, the output generated by the job will be written to the file 'output'.

Here is the output from this sample serial batch job:

Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= Beginning Integral Calculations

In[2]:= 
In[3]:= 0.333333
0.25
0.2
0.166667
0.25
0.2
0.166667
0.142857
0.2
0.166667
0.142857
0.125
0.166667
0.142857
0.125
0.111111

In[4]:= Finished!

In[5]:= 
SMP-Parallel Batch Jobs Run with 'qsub' Using a Mathematica Command (Text) File

Mathematica provides some easy-to-use methods to perform parallel computations in so-called SMP regime. This mode of operations allow users to use cores that are available within one compute node. KARLE as a standalone computational node has 24 cores and each of ANDY's nodes has 8 cores. Consider the following mathematica notebook:


$
$ cat test_smp.nb
(* perform some computations in serial mode *)
Timing[Table[{i,Plus @@ (#[[2]] &) /@ FactorInteger[(10^i - 1)/9]}, {i, 60, 70}]]

(* initialize 4 MathKernels*)
Needs["SubKernels`LocalKernels`"]
(* object 'mykernels' contains information about 4 computational instances *)
mykernels = LaunchKernels[LocalMachine[4]];

(* Let every kernel report it's existence *)
ParallelEvaluate[$MachineName, mykernels]

(* Perform the same computation as before but now using ParallelTable using those 4 kernels*)
Timing[ParallelTable[{i,Plus @@ (#[[2]] &) /@ FactorInteger[(10^i - 1)/9]}, {i, 60, 70}]]

Exit[]
$

This job first performs some computations using only one core. After that a stack of 4 computational kernels is created by Mathematica and similar computations are repeated in parallel. PBS script that sends this job into the queue is:

$
$cat parallel_run.math

#PBS -N mmat_smp
#PBS -q production
#PBS -l select=4:ncpus=1
#PBS -l place=pack
#PBS -V

cd $PBS_O_WORKDIR

math -run <test_smp.nb > output
$

There are two important things to note here: 1) "#PBS -l place=pack" -- user must request PBS to pack allocated resources onto a single physical compute node. 2) "#PBS -l select=4:ncpus=1" -- user requests 4 cores from PBS (4 'chunks' with ncpu=1 each). This is important because in the mathematica notebook 4 computational kernels were created.

As before, the '-V' option ensures that the environment local to the head node is pushed out to the compute node that runs the job. The CLI version of the Mathematica command, 'math', is used again here.

As any other PBS job, this smp-parallel Mathematica job is submitted to the PBS queue using "qsub parallel_run.math" command.

Result of this SMP-parallel batch job will be stored in the file 'output':

Mathematica 8.0 for Linux x86 (64-bit)
Copyright 1988-2011 Wolfram Research, Inc.

In[1]:= 
Out[1]= {8.14851, {{60, 20}, {61, 7}, {62, 5}, {63, 14}, {64, 15}, {65, 7}, 
 
>     {66, 15}, {67, 3}, {68, 10}, {69, 6}, {70, 12}}}

In[2]:= 
In[2]:= 
In[2]:= 
In[2]:= 
In[3]:= 
In[4]:= 
Out[4]= {r1i0n8, r1i0n8, r1i0n8, r1i0n8}

In[5]:= 
In[5]:= 
Out[5]= {0.544033, {{60, 20}, {61, 7}, {62, 5}, {63, 14}, {64, 15}, {65, 7}, 
 
>     {66, 15}, {67, 3}, {68, 10}, {69, 6}, {70, 12}}}

In[6]:= 
In[6]:= 
In[6]:=

It can be seen in the output that computations were first done in serial mode on one core. Line "Out[4]" is the output from "ParallelEvaluate[$MachineName, mykernels]". Obviously all 4 computational kernels evaluated the same $MachineName (as they were started on the same host). "Out[5]" is original computation performed in parallel on 4 MathKernels.


Submitting Batch Jobs from Remote Locations to Clusters

A method for doing this is being developed and tested.

For more information on Mathematica:

  • Online documentation is available through the Help menu within the Mathematica notebook front end.
  • The Mathematica Book, 5th Edition (Wolfram Media, Inc., 2003) by Stephen Wolfram.
  • The Mathematica Book is available online.
  • Additional Mathematica documentation is available online.
  • Information on the Parallel Computing Toolkit is available online.
  • Getting Started with Mathematica (Wolfram Research, Inc., 2004).
  • The Wolfram web site http://www.wolfram.com

MATLAB

The MATLAB high-performance language for technical computing integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include:

Math and computation

Algorithm development

Data acquisition

Modeling, simulation, and prototyping

Data analysis, exploration, and visualization

Scientific and engineering graphics

Application development, including graphical user interface building

At the CUNY HPC Center, MATLAB jobs can be run only on BOB and should be initiated from a Linux or Windows client on the CSI-campus or (for those users either on or off campus) from the CUNY HPC Center gateway machine KARLE. When configured correctly, MATLAB generates and places the batch submit scripts required to run a MATLAB job in the user's working directory on BOB's head node and completes the entire batch submission process, returning the results to the client.

MATLAB is an interactive system with both a command line interface (CLI) and Graphical User Interface (GUI) whose basic data element is an array that does not require dimensioning. It allows you to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar, non-interactive language such as C or Fortran. Properly licensed and configured, MATLAB's compute engine can be run serially or in parallel, and on a local desktop or client, or on a remote server or cluster. From the CUNY HPC Center's new MATLAB client system, KARLE (karle.csi.cuny.edu), each of these modes of operation is supported. KARLE is a 4 socket system based on Intel E740 processors with 6 cores per socket giving it a total of 24 physical cores (the E740 processor does not support Intel hyper-threading). Each core is clocked at 2.4 GHz. KARLE includes 4 GBytes of memory per core for a total of 96 GBytes. KARLE is directly accessible from any CUNY campus using the secure shell utility (ssh -X karle.csi.cuny.edu). ALL MATLAB work at the CUNY HPC Center should be started on or from KARLE which was purchased to completely replace the MATLAB functionality of NEPTUNE and TYPHOON. MATLAB jobs should NOT be run from the head nodes of either BOB or ANDY, the destination systems for MATLAB PBS batch jobs.

Starting MATLAB in GUI or CLI Mode on KARLE

As mentioned above, MATLAB can either be run from its Graphical User Interface (GUI) or from its Command Line Interface (CLI). By default MATLAB selects the mode for you based on how you have logged into KARLE. If you have logged in using the '-X' option to 'ssh' which allows your 'ssh' session to support the X11 network graphical interface, then MATLAB will be started in GUI mode. If the '-X' option is not used, it will be started in CLI mode. The following examples show each approach

Setting up MATLAB to run in GUI mode on KARLE:

local$ 
local$ ssh -X my.account@karle.csi.cuny.edu

Notice:
  Users may not access these CUNY computer resources 
without authorization or use it for purposes beyond 
the scope of authorization. This includes attempting
 to circumvent CUNYcomputer resource system protection 
facilities by hacking, cracking or similar activities,
 accessing or using another person's computer account, 
and allowing another person to access or use 
the user's account. CUNY computer resources may not 
be used to gain unauthorized access to another 
computer system within or outside of CUNY. 
Users are responsible for all actions performed 
from their computer account that they permitted or 
failed to prevent by taking ordinary security precautions.

my.account@karle.csi.cuny.edu's password: 
Last login: Fri Feb 17 11:50:02 2012 from 163.238.130.1

[my.account@karle ~]$
[my.account@karle ~]$ matlab

(MATLAB GUI windows are displayed on your screen)

Setting up MATLAB to run in CLI mode on KARLE:

local$ 
local$ ssh my.account@karle.csi.cuny.edu

Notice:
  Users may not access these CUNY computer resources 
without authorization or use it for purposes beyond 
the scope of authorization. This includes attempting
 to circumvent CUNYcomputer resource system protection 
facilities by hacking, cracking or similar activities,
 accessing or using another person's computer account, 
and allowing another person to access or use 
the user's account. CUNY computer resources may not 
be used to gain unauthorized access to another 
computer system within or outside of CUNY. 
Users are responsible for all actions performed 
from their computer account that they permitted or 
failed to prevent by taking ordinary security precautions.

my.account@karle.csi.cuny.edu's password: 
Last login: Fri Feb 17 11:50:02 2012 from 163.238.130.1

[my.account@karle ~]$
[my.account@karle ~]$ matlab

Warning: No display specified.  You will not be able to display graphics on the screen.

                                               < M A T L A B (R) >
                                     Copyright 1984-2011 The MathWorks, Inc.
                           Version 7.11.1.866 (R2010b) Service Pack 1 64-bit (glnxa64)
                                                February 15, 2011

 
  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.
 
>> 
>>

(MATLAB has defaulted to CLI mode because the X11 DISPLAY variable is not set)

Once MATLAB has started in either GUI or CLI mode on KARLE you should be able to proceed as you might have from your own desktop for interactive work, or according to the instructions for batch on BOB or ANDY work presented in sections below.

Modes of Operation: Local versus Remote (Batch)

Client-local MATLAB jobs (those run directly on KARLE) can be run in serial or in parallel mode. Server-remote MATLAB jobs submitted from KARLE (via either MATLAB's GUI or CLI) can also be run serially or in parallel on the CUNY HPC Center's clusters BOB (bob.csi.cuny.edu) or ANDY (andy.csi.cuny.edu) through the PBS Pro batch scheduler there. All 24 of KARLE's cores are available, although individual parallel jobs on KARLE should be limited to 8 cores and will compete with whatever other jobs happen to be running there. On BOB and ANDY (using the HPC Center's MATLAB DCS license) up to a total of 16 cores may be used by a single job, or 32 cores among a collection of competing jobs for different Unix groups. MATLAB jobs run on BOB or ANDY are given their own resources by the PBS batch job scheduler. In the future, individual MATLAB jobs may be limited to 12 or fewer cores based on demand, total license seats, and evolving usage patterns at the HPC Center.

Modes of Operation: Serial versus Parallel

MATLAB also gives its users the option to run jobs serially or in parallel. Parallel jobs (whether local or remote) can be divided into several distinct categories:

Loop-limited parallel work that relies on MATLAB's 'parfor' loop construct to divide work within looping structures where loop iterations are fully independent. This approach is similar to the traditional thread-based, SMP parallel programming models, OpenMP and POSIX Threads.

Independent or embarrassingly parallel work that relies on MATLAB's 'createTask' construct and divides completely independent workloads (tasks) among a collection of processors that DO NOT need to communicate. This approach is similar to global parallel search tools MapReduce and Hadoop used by Google and Yahoo. MATLAB R2012a refers to this kind of parallel work as Independent Parallel.

Single Program Multiple Data (SPMD) parallel work that relies on MATLAB's 'spmd' and 'labindex' contructs to partition work on a single input data stream among processor-id conditioned paths through a single program. This approach is similar to the traditional MPI programming model. MATLAB R2012a refers to this kind of parallel work as SPMD or Communicating Parallel.

Graphical Processing Unit (GPU) parallel work that relies on MATLAB functions and/or user provided routines that are GPU-enabled. This approach is MATLAB's method of delivering GPU-accelerated performance while limiting the amount of specialized programming that GPUs typically require (i.e. CUDA). This capability is only available from batch jobs submitted to ANDY from KARLE. BOB and KARLE have NO GPU resources.

Each of these parallel job types (as well as serial work) can be run on KARLE interactively (or in the background) or as jobs submitted from KARLE to BOB or ANDY and its PBS Pro batch scheduler; the one exception is GPU-parallel MATLAB work, which cannot be run directly on KARLE or BOB. Such work must be submitted to ANDY from KARLE.

Computing PI Serially on KARLE

To illustrate each MATLAB parallel model described above, a MATLAB script of the classic algorithm for computing PI by the numerical integration of the function for the arctangent (1/(1+x**2)) is presented here, first for local computation on KARLE and then remote submission computation to ANDY or BOB.

First, we present a serial MATLAB script for computing PI using numerical integration locally on KARLE:

%  ----------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ----------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local Serial Version
%  ----------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple serial code
%  which uses a standard 'for' loop and runs with a matlab pool size of 1.
%  ----------------------------------------------------------------------
%
%  Clear environment and set output format.
%
  clear all; format long eng
%
%  Set processor (lab) pool size
%
  matlabpool open 1
  numprocs = matlabpool('size');
%
%  Open an output file.
%
  fid=fopen('/home/richard.walsh/matlab/serial_PI.txt','wt');
%
%  Define and initialize global variables
%
  mypi = 0.0;
  ttime = 0.0;
%
%  Define and initialize 'for' loop integration variables.
%
  nv = 10000;    %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
%  Start stopwatch timer to measure compute time.
%
  tic;
%
% This serial 'for' loop, loops over all of 'nv', and computes and sums
% the arctangent function's value at every interval into 'ht'.
%
  for i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end
%
% The numerical integration is completed by multiplying the summed
% function values by the constant interval (differential) 'wd' to get
% the area under the curve.
%
  mypi = wd * ht;
%
%  Stop stopwatch timer.
%
  ttime = toc;
%
% Print total time and calculated value of PI.
%
 fprintf('Number of intervals chosen (nv) was: %d\n', nv);
 fprintf('Number of processors (labs) used was: %d\n', numprocs);
 fprintf('Computed value for PI was: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
 fprintf('Time to complete the computation was: %6.6f\n', ttime);
%
%
 fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv);
 fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
 fprintf(fid,'Computed value for PI was: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
 fprintf(fid,'Time to complete the computation was: %6.6f\n', ttime);
%
%   Close output file.
%
 fclose(fid);
%
 matlabpool close;
%
% End of script
%

This script can be entered into the MATLAB CLI or GUI command window (or simply as 'matlab < serial_PI.m'). It will compute PI to an accuracy that depends on the number of intervals (nv is set to 10,000 here). All the work is done by a single processor that MATLAB refers to as a 'lab'. We will not go into the details of the algorithm here, but readers can find many descriptions of it on the Internet. The algorithm is completely defined within the scope of the MATLAB 'for' loop and the statement that follows it. A key feature of the script is the definition of the MATLAB pool size:

matlabpool open 1;

This statement is not actually required for this serial job, but we include it to illustrate the changes that will take place in moving to parallel operation. Here, the MATLAB pool size is set to 1 which forces serial operation. The 'mypi' variable will contain the result of the entire integration (rather than just partials) computed by the single processor ('lab') in the pool. This processor completes every iteration in the 'for' loop.

The function 'farc()', which computes 1/(1 + x**2) for each x, must be made available in the user's MATLAB working directory. While this serial job runs locally on KARLE and will pick up the file where it was created there, later when we submit the job to ANDY or BOB, 'farc()' will need to be transferred to ANDY or BOB as a job dependent file. The job is timed using the 'tic' and 'toc' MATLAB library calls. The accuracy of the computed result is measured by comparing the computed result to MATLAB's internal value for PI (pi) used in the print statements.

Computing PI Using Loop-Local Parallelism on KARLE

Now, a modified version of the script that runs in parallel using the 'parfor' loop construct is presented.

%  -------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  -------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local Thread-Parallel Version
%  -------------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple parallel code
%  which uses a 'parfor' loop and runs with a matlab pool size of 4.
%  -------------------------------------------------------------------------
%
% Clear environment and set output format.
%
 clear all; format long eng;
%
%   Set processor (lab) pool size
%
 matlabpool open 4;
 numprocs = matlabpool('size');
%
%   Open an output file.
%
 fid=fopen('/home/richard.walsh/matlab/parfor_PI.txt','wt');
%
%   Define and initialize global variables
%
 mypi = 0.0;
 ttime = 0.0;
%
%   Define and initialize 'for' loop integration variables.
%
  nv = 10000;    %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
% Start stopwatch timer to measure compute time.
%
  tic;
%
% This parallel 'parfor' loop divides the interval count 'nv' implicitly among the
% processors (labs) and computes partial sums on each of the arctangent function's value
% at the assigned intervals. MATLAB then combines the partial sums implicitly
% as it leaves the 'parfor' loop construct placing the global sum into 'ht'.
%
  parfor i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end
%
% The numerical integration is completed by multiplying the summed
% function values by the constant interval (differential) 'wd' to get
% the area under the curve.
%
  mypi = wd * ht;
%
%  Stop stopwatch timer.
%
  ttime = toc;
%
% Print total time and calculated value of PI.
%
fprintf('Number of intervals chosen (nv) was: %d\n', nv);
fprintf('Number of processors (labs) used was: %d\n', numprocs);
fprintf('Computed value for PI is: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
fprintf('Time to complete the computation was: %6.6f\n', ttime);
%
%
fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv);
fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
fprintf(fid,'Computed value for PI is: %3.20f\n with error of %3.20f\n', mypi, abs(pi-mypi));
fprintf(fid,'Time to complete the computation was: %6.6f\n', ttime);
%
%   Close output file.
%
fclose(fid);
%
matlabpool close;
%
% End of script
%

Focusing on the changes, first we see that that the MATLAB pool size has been increased to 4 with:

matlabpool open 4;

Next, the 'for' loop has been replaced by the 'parfor' loop, which as the comments make plain, divides the loop's iterations among the 4 processors ('labs') in the pool.

  parfor i = 1 : nv
    x = wd * (i - 0.50);
    ht = ht + farc(x);
  end

The iterations in the loop are assumed to be entirely independent, and by default MATLAB assigns blocks of iterations to each processor (lab) statically and in advance rather than dynamically as each iteration is completed. So, in this case iterations 1 to 2,500 would be assigned to processor 1, iterations 2,501 to 5,000 to processor 2, and so on. Another important feature of the 'parfor' construct is that it automatically generates the global result from each processor's partial result as the loop exits and places that global value in the variable 'ht'.

These are the important differences. When this job is run, the wall-clock time to get the result should be reduced, and the 'numprocs' variable will report that 4 processors were used for the job. An important thing to note is that getting parallel performance gains using the 'parfor' construct requires very few MATLAB script modifications once the serial version of the code has been created. At the same time, this approach is limited to simpler cases where the intrinsic parallelism of the algorithm is confined to loop-level structures and the processors used to do the work are connected to the same memory space (i.e. they are within the same physical compute node). In this case that is KARLE.

Compute PI using SPMD Parallelism on KARLE

The next step is to modify the above 'parfor' loop-local parallelism to use MATLAB's much more general MPI-like SPMD parallel programming model. Here is the same algorithm adapted to use MATLAB SPMD constructs:

%  ----------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ----------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Local SPMD Parallel Version
%  ----------------------------------------------------------------------
%  This is a MATLAB SPMD (Single Program Multiple Data) or MPI-like version 
%  of the MPI parallel routine for computing PI using the trapazoidal rule 
%  and the integral of the arctangent (1/(1+x**2)). This is example uses the 
%  MATLAB 'labs' abstraction to ascertain the names of the processor and
%  assigned them their share of the work. Versions of the alogorithm appear
%  "Computational Physics, 2nd Edition" by Landau, Paez, and Bordeianu and
%  "Using MPI" by Gropp, Lusk, and Skjellum.
%  ----------------------------------------------------------------------
%
%  Clear environment and set output format
%
  clear all; format long eng;
%
%  Set processor (lab) pool size
%
  matlabpool open 4;
  numprocs = matlabpool('size');
%
%   Open an output file.
%
  fid=fopen('/home/richard.walsh/matlab/spmd_PI.txt','wt');
%
%  Start the SPMD block which executes the same MATLAB commands on all processors (labs)
%
spmd
%
%  Find out which processor I am using the 'labindex' variable
%
   myid = labindex;
%
%  Define and set composite array variables
%
  mypi = 0.0;
  totpi = 0.0;
  ttime = 0.0;
%
%   Define and initialize 'for' loop integration variables.
%
  nv = 10000;   %  Set default number of intervals and accuracy
% nv = input('Please define the number of intervals: ')
  ht = 0.0;
  wd = 1.0 / nv;
%
%  Start stopwatch timer on processor 1 to measure compute time
%  
  if (myid == 1)
     tic;
  end
%
%  This parallel 'for' loop divides the interval count 'nv' explicitly among the processors
%  (labs) using the processor id 'myid' and the loop step size defined by 'numprocs'. 
%  The partial sums from each processor of the arctangent function's value are then 
%  combined explicitly via the call to the 'gplus()' global reduction function. Because this
%  is part of the SPMD block the global sum is generated on each processor.
%
  for i = myid : numprocs : nv
     x = wd * (i - 0.50);
     ht = ht + farc(x);
  end
%
  mypi = wd * ht;
%
%  The variable 'totpi' is a composite array with one storage location for each 
%  processor (lab) each of which gets the grand total generated by the 'gplus()'
%  function. For instance, the grand total delivered to processor (lab) 1 is stored
%  in totpi{1} for instance. 
%
   totpi = gplus(mypi);
%
%  Complete stopwatch timing of computation (including gather by 'gplus()') on 
%  processor (lab) 1. Because the 'gplus()' call is a blocking operation this time
%  is the same as the time to finish the whole calculation. 
%
  if (myid == 1)
     ttime = toc;
  end
%
%  Terminate the SPMD block of the code
%
end
%
%   Print computation time and calculated value of PI. Use the index for processor
%   1 to access processor 1 specific array elements of the composite variables.
%
fprintf('Number of intervals chosen (nv) was: %d\n', nv{1});
fprintf('Number of processors (labs) used was: %d\n', numprocs);
fprintf('Computed value for PI was: %3.20f with error of %3.20f\n', totpi{1}, abs(totpi{1}-pi));
fprintf('Time to complete the computation was: %6.6f seconds\n', ttime{1});
%
%
fprintf(fid,'Number of intervals chosen (nv) was: %d\n', nv{1});
fprintf(fid,'Number of processors (labs) used was: %d\n', numprocs);
fprintf(fid,'Computed value for PI was: %3.20f with error of %3.20f\n', totpi{1}, abs(totpi{1}-pi));
fprintf(fid,'Time to complete the computation was: %6.6f seconds\n', ttime{1});
%
%   Close output file.
%
fclose(fid);
%
matlabpool close;
%
% End of script
%

Looking again at the differences between this SPMD script and the serial and 'parfor' parallel versions above, we see that that SPMD block is marked by the spmd header and terminating end statement much later in the script.

spmd

.
.
.

end

This entire section of the script is run independently by each processor (lab) generated by the 'matlabpool open 4' command at the top of the script. But, if each processor runs the same section of the script, the question is how is the work divided? Would it not just be computed in its entirety and redundantly 4 times? The division is accomplished in the same way that it would be in an MPI parallel implementation of the PI integration algorithm, using processor-unique IDs and the processor count.

MATLAB provides these constructs using the 'labindex' and 'numprocs' variables within the 'spmd' block. The 'labindex' contains a unique value for each processor in the pool counting from 1 while the 'numprocs' variable is assigned the MATLAB pool size at the beginning of the script. The values for each can be used to conditionally direct and control the path of each processor through what is the same script. Here, this is most importantly visible in the 'for' loop:

  for i = myid : numprocs : nv
     x = wd * (i - 0.50);
     ht = ht + farc(x);
  end

The basic 'for' loop appears again, but with a starting iteration ('myid') set from the 'labindex' of each processor and the processor count (lab pool size) used as the step size ('numprocs') for the loop. In this way, the 'for' loop's work is explicitly divided among the 4 processors. Processor 1 gets iterations 1, 5, 9, 13, etc. Processor 2 gets iterations 2, 6, 10, 14, etc., and so on. Each processors ends up with its own unique fraction of the 'for' loop's assigned work. The variables within the SPMD block including this loop are MATLAB composite arrays with values and memory locations unique to each processor.

This fact has two important consequences that have implication for later SPMD work on the distributed nodes of our remote clusters, BOB and ANDY. First, each chunk of 'for' loop work can be run on a physically separate compute node with its own memory space. Second, the sum in the variable 'ht' is only partial on each processor, and the MATLAB programmer (you) must explicitly combine the partial sums to get the correct global result for PI. This is accomplished with the 'gplus()' function in the second line after the 'for' loop with:

totpi = gplus(mypi);

The 'mypi' composite array result has a unique value on each processor equal to approximately 1/4 of the value of PI. Processor-specific values can be explicitly referenced using the 'mpi{n}' expression where 'n' is the lab index or processor ID value. The 'gplus()' function is one of a class of global reduction functions that will gather partial results computed by each member of a MATLAB SPMD pool, perform a specific arithmetic operation, and then place the result in each pool member's memory space. In this case, the composite array element 'totpi{n}' on each processor in the pool will receive the global sum of the partial values of PI on each processor. There are other global reduction 'g' functions like 'gprod()', 'gmax()', 'gmin()', etc., each with their own operation type. Refer to the MATLAB website for further information. http://www.mathworks.com/products/parallel-computing/demos.html?file=/products/demos/shipping/distcomp/paralleltutorial_gop.html

The rest of the SPMD script is largely the same as the others; however, a few additional comments are in order. First, note that when the SPMD block is closed, the composite array elements are referenced explicitly in the print statements. The script prints out the results present on processor one (1). Secondly, note that the timing results were only collected from processor one (1). One might wonder that if processor one (1) were to complete its partial result faster that the timing results gathered would be in error. This is prevented by the fact that the 'gplus()' function is blocking, which means that each processor (lab) will wait at the 'gplus()' call until all processors have received the global result. This will make the compute time from processor one (1) representative of the time for all.

This script and the others above can all be run directly from the MATLAB CLI or GUI command window on KARLE. Please explore the use of each and look at the timings generated.

Running Remote Parallel Jobs on BOB or ANDY

With some preparation of the communications link between KARLE and BOB or ANDY, and some other minor modifications, the scripts presented above can also be submitted to the PBS Pro batch queues on BOB or ANDY. The modified script is transferred automatically, submitted to BOB or ANDY's PBS batch queuing system, and the results are automatically returned to KARLE. The entire process can be tracked from the MATLAB GUI or CLI on KARLE, although the jobs are also visible on BOB or ANDY.

This process is made possible by building a $HOME file tree on BOB or ANDY that mirrors the tree on KARLE and the secure copy ('scp') and secure shell ('ssh') commands. The procedure for setting up and running the above scripts remotely on ANDY is present here, although with a simple string substitution the will work on BOB as well. With the introduction of KARLE, all MATLAB users within the CUNY family (whether local to the College of Staten Island or not) have equal access to both KARLE's client-local MATLAB capability and ANDY and BOB's remote-cluster MATLAB capability.

In MATLAB, remote parallel jobs can be divided into two basic classes. An Independent Parallel job is a workload divided among two or more fully independent MATLAB workers (processors) that generate fully independent PBS pro jobs (unique job IDs). MATLAB manages these as tasks within a single MATLAB job, but to PBS they are separate jobs with distinct PBS JIDs. No communication is possible or expected between such Independent Parallel processes (jobs).

When submitted to ANDY or BOB from KARLE, the MATLAB client submits several independent jobs, one for each worker, to the PBS batch scheduler. Each worker works on a piece of the same problem, but runs fully independently of the others. They are queued up by PBS Pro as separately scheduled serial jobs in PBS Pro's serial execution queue, qserial. Each Independent Parallel job has its own job ID and is run on its own compute node. In other contexts, such "Independent Parallel" work might be referred to as embarrassingly parallel.

On the other hand, MATLAB Communicating Parallel jobs do not produce independent processes run by separate 'workers', but produce a single, coupled, parallel workload run on seperate processors under MATLAB's 'labs' abstraction. Such workloads produce a single PBS batch job with one job ID, even while it runs on multiple processors. Communicating Parallel jobs are run in PBS Pro's parallel production qlong16 queue and have only one job ID. Communication is presumed to be required between the processes (labs) and relies on MPI-like inter-process communication. Such "Communication Parallel" work falls into the same category as the Single Program Multiple Data (SPMD) parallel job class that we introduced above.

Regardless of the type of remote job (Independent Parallel or Communicating Parallel), one must have set up two-way, passwordless 'ssh' between the submitting client (KARLE or CSI office workstation) and ANDY or BOB's head node. This is currently possible only from within the CUNY's CSI campus, and this is the reason non-CSI users must use KARLE to submit MATLAB jobs to ANDY or BOB.

Licensing requirements for client-to-cluster job submission

MATLAB combines its basic tools for manipulating matrices with a large suite of applications-specific libraries or 'toolboxes', including its Parallel Computing Toolbox which is required to submit jobs to a cluster. In order to successfully run parallel MATLAB jobs on ANDY or BOB, a user must have (or be able to acquire over their campus network) licenses for all the MATLAB components that will be used by their job. At a minimum, users must have a client-local license for MATLAB itself and the Parallel Computing Toolbox. For those wish to submit work from deskside MATLAB clients within the CSI campus network, CSI has 5 combined MATLAB and Parallel Computing Toolbox node-locked client licenses to distribute on a case-by-case and temporary basis. With these two licenses and the licenses that CUNY HPC provides on ANDY and BOB for the Distributed Computing Server (DCS), basic MATLAB Distributed or Parallel jobs can be run (governed by the 'ssh' requirement above). If the job makes use of other applications-specific toolboxes (e.g. Aerospace Toolbox, Bioinformatics Toolbox, Econometrics Toolbox, etc.), it will attempt to acquire those licenses from the CSI campus MATLAB license server as well. As such, remote job submissions from a MATLAB licensed CSI desk-side system that use a licensed MATLAB tool box will require three distinct license acquisition events to succeed: one on the local desk-side computer, one from the HPC Center's MATLAB tool box license server (this runs on NEPTUNE), and one from the the HPC Center's MATLAB DCS license server (this runs on ANDY).

Note: When running locally on KARLE or submitting batch jobs from KARLE to ANDY or BOB, these license requirements will be met, although not every MATLAB toolbox has been licensed by CSI.

Currently, a properly configured CSI campus client that also requires an application-specific toolbox to complete its work will have two license.lic files installed on their system in ${MATLAB_ROOT}/licenses (the value for the MATLAB_ROOT directory can be determined on the machine of interest by typing 'matlabroot' at the MATLAB command-line prompt). The first will be the node-local license (say, my_machine.lic) for MATLAB and the Parallel Computing Toolbox, and the second will be the network-served license (network.lic) pointing to the campus MATLAB toolbox license server (this is NEPTUNE as mentioned above). These are read in alphabetical order upon MATLAB startup to obtain proper licensing. Other licensing schemes are conceivable.

The node-local license for MATLAB and the Parallel Computing Tool Box might look something like this. This first INCREMENT block provides the node-local capability and the second Parallel Computing Tool Box capability:

# BEGIN--------------BEGIN--------------BEGIN
# DO NOT EDIT THIS FILE.  Any changes will be overwritten.
# MATLAB license passcode file.
# LicenseNo: 99999
INCREMENT MATLAB MLM 22 01-jan-0000 uncounted 99C9EC4D3695 \
        VENDOR_STRING=vi=30:at=187:pd=1:lo=GM:lu=200:ei=944275: \
        HOSTID=MATLAB_HOSTID=0015179549BA:000000 PLATFORMS="i86_re \
        amd64_re" ISSUED=30-Sep-2009 SN=000000 TS_OK
INCREMENT Distrib_Computing_Toolbox MLM 22 01-jan-0000 uncounted \
        E77E2F473055 \
        VENDOR_STRING=vi=30:at=187:pd=1:lo=GM:lu=200:ei=944275: \
        HOSTID="0015179549ba 0015179549bb 002219504c4f 002219504c51" \
        PLATFORMS="i86_re amd64_re" ISSUED=30-Sep-2009 SN=000000 TS_OK
# END-----------------END-----------------END

The network license for any required Applications Toolboxes would look something like this:

SERVER neptune.csi.cuny.edu 002219A46FF7 28000
USE_SERVER

(The license files above are for illustration only, and are not functional license files.)

Within the CSI campus, a node-local license file for MATLAB, the Parallel Computing Toolbox license, and the network licenses for the MATLAB applications tool boxes that CSI supports can be obtained from CUNY's HPC group. In addition, installations of MATLAB on a CSI campus client must have included the current on-campus File Installation Key. This discussion does not apply to non-CSI users because the MATLAB installation on KARLE is complete.

In the future, if arrangements are made for non-CSI CUNY sites to have direct 'ssh' access to CUNY's HPC clusters at CSI, those non-CSI sites will need to provide local licensing for MATLAB itself, the Parallel Computing Toolbox, and any Applications Toolboxes they require. For all CUNY users (within and outside of CSI), the CUNY clusters at CSI provide the proper DCS licensing automatically for jobs started on the cluster as long as they arrive with the proper licenses for the Toolboxes they use. Jobs initiated from KARLE will meet this requirement.

Setting up the client and cluster environment for remote execution

A number of steps must be taken to successfully transfer, submit, and recover MATLAB jobs submitted on KARLE from the HPC clusters ANDY and BOB. An important first step is to ensure that the version of MATLAB running locally is identical to the version running on the CUNY cluster, BOB. This has been taken care of for those submitting jobs from KARLE, but could be an issue for those at the CSI campus setting up deskside MATLAB clients. The CUNY HPC Center is currently running MATLAB Version R2012a, but to determine the release generally, login to KARLE, run matlab's command-line interface (CLI), and to the >> prompt enter MATLAB's 'version' command. If identical versions are not running, the local MATLAB will detect a mismatch, assume there are potential feature incompatibilities, and not submit your job. The error message produced when this occurs is not very diagnostic.

Note: Again, this does not apply to non-CSI users submitting their work from KARLE where the versions already match.

Next, two-way, passwordless secure login and file transfer in both directions must be working correctly between KARLE, and BOB and ANDY. For Linux-to-Linux transfers this involves following the procedures outlined in the 'ssh-keygen' man page and/or referring to the numerous descriptions on the web. This includes putting the public keys generated with 'ssh-keygen' on both the client (KARLE) and the servers (ANDY and BOB) into the other machine's authorized_keys file. For Windows-to-Linux transfers this is usually accomplished with the help of the Windows remote login utility 'PuTTY'. Please refer to the numerous HOW TOs on the web to complete this. Windows users that have trouble with this can send email to the CUNY HPC Center helpline 'hpchelp@csi.cuny.edu' (Note: CSI clients that are behind a firewall or reside on a local subnet behind a router may require special configuration, including port-forwarding of return 'ssh' traffic on port 22 from ANDY and BOB through the local router to the local client). Non-CSI users initiating jobs from KARLE can rely on the standard procedures from two Linux systems.

In addition, on the cluster, passwordless 'ssh' must be allowed for the user from the head node to all of the compute nodes where the MATLAB job might run. This is the default for user accounts on ANDY and BOB, but it should be checked by the user before submitting jobs. Because the home directory on the head node is shared with all the compute nodes, accomplishing this is a simple matter of including the head node's public key in the 'authorized_keys' file in the user's '.ssh' directory. Again refer to the ssh-keygen man page or many on-line sources for more detail here.

Once passwordless 'ssh' is operational, the CUNY HPC group recommends studying the sections in MATLAB's Parallel Computing Toolbox User Guide [41]. The sections on 'Programming Distributed Jobs' and 'Programming Parallel Jobs' are particularly useful. The sub-sections titled 'Using the Generic Scheduler Interface' are specific to the topic of submitting remote jobs to the so-called 'Generic Interface', which is the term that MATLAB uses for workload managers generally (PBS, SGE, etc.). Note: Reading through these section of MATLAB's on-line documentation is strongly recommended before submitting the test jobs provided below.

In addition, an important source of information can be found in the README files in the following directory under MATLAB's root directory or installation tree on your campus-client system or on the head node of BOB:

$(MATLAB_ROOT}/toolbox/distcomp/examples/integration/pbs
$(MATLAB_ROOT}/toolbox/local/pbs/unix

There are similar directories for other common workload managers at the same level. Since, in a submission from KARLE or campus client to ANDY or BOB, there is not a shared file system users should pay particularly close attention to the contents of the 'nonshared' subdirectory in the first PBS directory above. There is guidance for both Linux and Windows clients on non-shared file systems there. Further information can be found at the MATLAB website here [42] and here [43].

Computing PI Serially Remotely on ANDY or BOB

Below is a a fully commented MATLAB remote batch job submission script. This can be thought of as the 'boiler-plate' wrapping that is required to run the serial script for computing PI presented above on ANDY or BOB instead of on KARLE. Much of the explicit scripting presented here can be used to define a MATLAB configuration template for ANDY or BOB within the MATLAB GUI on KARLE to reduce the number of explicit commands one must enter in the GUI command window to submit a batch job. From the text-driven MATLAB CLI all of these commands would need to be entered to run the job remotely on ANDY (or BOB with the proper string substitutions).

%  ---------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ---------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Remote (ANDY) Batch Serial Version 
%  ---------------------------------------------------------------------------
%  This MATLAB script calculates PI using the trapazoidal rule from the
%  integral of the arctangent (1/(1+x**2)). This is a simple serial code
%  which uses a standard 'for' loop and runs with a matlab pool size of 1.
%  This version includes all the code required to complete the submission
%  of job from the local client (KARLE) to a remote cluster (ANDY) for standard
%  serial processing and returns the results to the client for viewing.
%
%  This version is design to run stand-alone from the MATLAB command line window
%  in the MATLAB GUI or from the text-driven command-line interface (CLI). Many 
%  of the commands in this file could be included in a MATLAB GUI "configuration" 
%  template for batch job submission to ANDY simplifying the script considerably.
%  Versions of this alogorithm appear in "Computational Physics, 2nd Edition" by 
%  Landau, Paez, and Bordeianu; and "Using MPI" by Gropp, Lusk, and Skjellum.
%  ---------------------------------------------------------------------------
%
%  Define the name of the remote cluster (server, ANDY in this case) running PBS
%
clusterHost = 'andy.csi.cuny.edu';
%
%  Create the remote cluster object and define the path on the client (local) to
%  the working directory from which MATLAB stages the job and expects to find all
%  required script files. At the CUNY HPC Center, the client is typically KARLE
%  at (karle.csi.cuny.edu).
%
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );
%
%  Define the path on the server (remote) to the working directory to which MATLAB
%  stages the PBS job and expects to find all required files and scripts. In this
%  script this will be on ANDY.
%
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';
%
%  Set other parameters required by the MATLAB job scheduler like the MATLAB root
%  directory on the cluster, the file system type, and the OS on the cluster.
%
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
%
%  Define the names of auxillary remote job sbmission functions
%
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);
%
%  Specify the name of the serial job submission function and its arguments.
%  This function determines the queue and resources used by the job on the server
%  (ANDY). MATLAB has two alternative destination queues. Users running test
%  or development jobs should specific the function with the 'Dev' suffix. Those
%  running production jobs should specific the function with the 'Prod' suffix.
%  Both of these scripts are located on KARLE in the MATLAB tree.
%
% set(andy, 'IndependentSubmitFcn', {@independentSubmitFcn_Dev, clusterHost, remoteJobStorageLocation});
set(andy, 'IndependentSubmitFcn', {@independentSubmitFcn_Prod, clusterHost, remoteJobStorageLocation});
%
%  Create the Independent Serial job object to be assigned to the job scheduler function
%
sjob = createJob(andy);
%
%  If this job requires data or function files ('farc.m') to run then they must
%  be transferred over to the cluster with the main routine at the time of job
%  submission, unless they are already present in the remote working directory or
%  they are MATLAB intrinsic functions.  Any file needed to run the job locally
%  will also be needed to run it remotely. This is accomplished by defining file
%  dependencies as shown in the following section.  Put each required file in single
%  quotes and inside {}'s as shown.
%
set(sjob, 'AttachedFiles', {'serial_PI_func.m' 'farc.m'});
%
%  Create and name a task (defined here by our serial MATLAB script for computing PI)
%  to be completed by the remote MATLAB job (worker) on ANDY. The task which will be executed
%  on one processor should be provided in MATLAB function rather than MATLAB script form
%  which allows the users to indicate which variables must be transferred on input and
%  returned as output.
% 
stask = createTask(sjob,@serial_PI_func,1,{});
%
%   Submit the job to the MATLAB scheduler on KARLE which moves all files to ANDY and initiates
%   the PBS job there.
%
submit(sjob);
%
%   Wait for the remote PBS batch job on ANDY to finish. This implies that the
%   batch job has finished successfully and returned its outputs to the the client
%   working directory on KARLE.
%
wait(sjob, 'finished');
%
%  Get and print output results from disk
%
results = fetchOutputs(sjob);
%
%   End of PbsPiSerial.m

Most of the scripting is fully described in the comment sections above, but several things should be called out. First, jobs can be submitted to either the development or production queues. Short running test jobs like this should be run in the development queue which is reserved and protected from longer running production jobs. Jobs that run longer than the development queue CPU time limit of 64 minutes will be killed automatically by PBS. The development queue allows jobs of no more than 8 processors (8 CPU minutes on each processor). To chose the development queue, use the job submit function listed above with the 'Dev' suffix. To use the production queue for longer (no time limit) and more parallel jobs (limit of 16 processors) use the 'Prod' suffix.

Second, as suggested in the comments, all user files required by the job must be included in the file dependences line and be present in the directory from which MATLAB was run. The example from above is:

set(sjob, 'AttachedFiles', {'serial_PI_func.m' 'farc.m'});

Using the MATLAB function format rather than script format for dependent files is a convenient way to deliver inputs to the job and recover outputs from the job. A MATLAB function requires a header line of the following type:

function [output1 output2 output3 ... ] = serial_PI_func(input1 input2, input3 ... )

Functions may have zero or more inputs and zero or more outputs depending out what they are computing.

Finally, the 'createTask' command names the job's driver function (serial_PI_func in this case), sets the number of processors (1 here), and names the function's arguments (none here). The 'submit' command initiates the job, starting one session and one PBS job on ANDY for each task defined by each separate 'createTask' command.

Computing PI In Parallel Remotely on ANDY or BOB

Below is a a fully commented MATLAB parallel remote batch job submission script. It is very similar to the the 'boiler-plate' wrapping presented above for serial job submission to ANDY. Like the serial script, much of the explicit scripting presented here can also be used to define a cluster (ANDY or BOB) configuration template for parallel job submission in the MATLAB GUI that reduces the number of commands one must enter in the GUI command window. And similarly, from the text-driven MATLAB CLI all of these commands would need to be entered to run the job remotely on ANDY. To run the same script on BOB simply complete the appropriate string substitutions.

%  ---------------------------------------------------------------------------
%  M. E. Kress, PhD, July 2010
%  R. B. Walsh,  MS, Aug  2010
%  College of Staten Island, CUNY
%  ---------------------------------------------------------------------------
%  Demo MATLAB PI Program for CUNY HPC Wiki:  Remote (ANDY) Batch SPMD Parallel Version
%  ---------------------------------------------------------------------------
%  This is a MATLAB SPMD (Single Program Multiple Data) or MPI-like version
%  of the parallel algorithm for computing PI using the trapazoidal rule
%  and the integral of the arctangent (1/(1+x**2)). This example generates a
%  MATLAB pool under its 'labs' abstraction, ascertains the names of each processor
%  (lab), and assigns each of them a share of the work. This version of the 
%  algorithm submits the SPMD job from the local client (KARLE) to a remote cluster
%  (ANDY) for parallel processing and returns the results to the client for viewing.
%
%  This version is design to run stand-alone from the MATLAB command line window
%  in the MATLAB GUI or from the text-driven command-line interface (CLI). Many 
%  of the commands in this file could be included in a MATLAB GUI "configuration"
%  template for batch job submission to ANDY simplifying the script considerably.
%  Versions of this alogorithm appear in "Computational Physics, 2nd Edition" by
%  Landau, Paez, and Bordeianu; and "Using MPI" by Gropp, Lusk, and Skjellum.
%  ---------------------------------------------------------------------------
%
%  Define the name of the remote cluster (server, ANDY in this case) running PBS
%
clusterHost = 'andy.csi.cuny.edu';
%
%  Create the remote cluster object and define the path on the client (local) to
%  the working directory from which MATLAB stages the job and expects to find all
%  required script files. At the CUNY HPC Center, the client is typically KARLE
%  at (karle.csi.cuny.edu).
%
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );
%
%  Define the path on the server (remote) to the working directory to which MATLAB
%  stages the PBS job and expects to find all required files and scripts. In this 
%  script this will be on ANDY.
%
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';
%
%  Set other parameters required by the MATLAB job scheduler like the MATLAB root
%  directory on the cluster, the file system type, and the OS on the cluster.
%
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
%
%  Define the names of auxillary remote job sbmission functions
%
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);
%
%  Specify the name of the parallel job submission function and its arguments.
%  This function determines the queue and resources used by the job on the server
%  (ANDY). MATLAB has two alternative destination queues. Users running test
%  or development jobs should specific the function with the 'Dev' suffix. Those
%  running production jobs should specific the function with the 'Prod' suffix.
%  Both of these scripts are located on KARLE in the MATLAB tree.
%
% set(andy, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn_Dev, clusterHost, remoteJobStorageLocation});
set(andy, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn_Prod, clusterHost, remoteJobStorageLocation});
%
%  Create the Communicating Parallel job object to be assigned to the job scheduler function
%
pjob = createCommunicatingJob(andy);
%
%  If this job requires data or function files ('farc.m') to run then they must
%  be transferred over to the cluster with the main routine at the time of job 
%  submission, unless they are already present in the remote working directory or
%  they are MATLAB intrinsic functions.  Any file needed to run the job locally
%  will also be needed to run it remotely. This is accomplished by defining file
%  dependencies as shown in the following section.  Put each required file in single
%  quotes and inside {}'s as shown.
%
set(pjob, 'AttachedFiles', {'spmd_PI_func.m' 'farc.m'});
%
%  Define the number of processors (labs, workers) to use for this job. To ensure that
%  you get exactly one processor count, specifiy the maximum and minimum number to be
%  the same.
%
set(pjob, 'NumWorkersRange', [4,4]);
%
%  Create and name a task (defined here by our parallel SPMD MATLAB script for computing PI)
%  to be completed by the remote MATLAB job (lab, worker) pool. The task which will be executed
%  by each processor (lab, worker) should be provided in MATLAB >>function<< rather than MATLAB 
%  script form which allows the users to indicate which variables must be transferred on
%  input and returned as output.
%
ptask = createTask(pjob,@spmd_PI_func,4,{});
%
%   Submit the job to the MATLAB scheduler on KARLE which moves all files to ANDY and initiates
%   the PBS job there.
%
submit(pjob);
%
%   Wait for the remote PBS batch job on ANDY to finish. This implies that the
%   batch job has finished successfully and returned its outputs to the the client
%   working directory on KARLE.
%
wait(pjob, 'finished');
%
%  Get and print output results from disk
%
results = fetchOutputs(pjob);
%
%   End of PbsPiParallel.m

The reader will see that much of this parallel job submission script is the same as the serial script above, but there are important differences that need to be explained. First, the job submit function is different:

set(andy, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn_Prod, clusterHost, remoteJobStorageLocation});

Above, the 'independentSubmitFcn_Prod' function was used while here it is 'communicatingSubmitFcn_Prod' a function specific to parallel jobs. Another difference is that here a MATLAB pool is requested rather than a serial task and the minimum and maximum pool size is defined. This ensures that the job will use four (4) processors on ANDY.

pjob = createMatlabPoolJob(sched);

.
.
.

set(pjob, 'NumWorkersRange', [4,4]);

This parallel job also has file dependencies, but this script uses function version of the SPMD-Parallel algorithm presented above for running in SPMD parallel mode on KARLE. Finally, this script creates a task that uses four (4) processors (labs) to complete the computation making it a Coupled Parallel job that will be started by PBS with one job ID but using 4 cores.

ptask = createTask(pjob,@spmd_PI_func,4,{});
Computing Remotely on ANDY Using MATLAB's GPU capability

In addition to the mix of remote (batch), serial and parallel, CPU-based cluster computing that MATLAB is capable of described above, MATLAB supports the use of GPUs in any remote job submissions to the HPC Center's ANDY system, which has 96 Fermi GPUs. This includes both the use of MATLAB's increasing number of built-in GPU-parallel functions and operators, as well as user-authored and compiled CUDA C and CUDA Fortran routines.

Over the last three years, MATLAB has added a higher language level way of expressing CUDA-like host-to-device and device-to-host data motion and CUDA kernel computation concepts directly to MATLAB scripts. The HPC Center has created the required middleware PBS job-launch scripts to support MATLAB GPU computing similar to those supporting remote MATLAB CPU computing. In this section, some basic examples are provided on the how to use ANDY's GPU capability while running MATLAB from KARLE. It is clear that for some algorithmic kernels, GPUs can provide factors of 2, 3, and even 10 in performance improvement over pure single CPU alternatives. For this reason, the staff at the HPC Center encourages its MATLAB users to try out the MATLAB GPU computing methods described here.

This first simple example script makes use of a MATLAB pre-defined function for computing FFTs on the GPU. It is an "Independent" (as opposed to a "Communicating") MATLAB job that uses only 1 CPU and 1 GPU to complete its work. Two files are required and presented below. The first is the MATLAB GPU submit script, which is very similar to those above used for CPU "Independent" job submissions. The second is a GPU-targeted source script (".m" file) using MATLAB internal GPU functions and operators that is transferred from the client (KARLE) to the server (ANDY) and run on ANDY's GPUs under PBS.

Here is the MATLAB submit script that can be offered to the MATLAB command-line on KARLE (or to develop a GPU job submission context from within the MATLAB GUI):

%
% Initialize arguments to MatLab SubmitFcn for an independent parallel
% GPU job (1 task) to be submitted to the PBS production_mtlg queue. --rbw
%

% Define destination cluster for remote GPU job
clusterHost = 'andy.csi.cuny.edu';

% Define the remote working directory on server-remote (cluster)
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';

% Create cluster object and define client-local working directory
% for the client JobStorageLocation
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );

% Define other cluster object job submission variables
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);

% Name and associate a job submission function (script) with the cluster object
% To run an indepedent parallel job, you must specify the IndependentSubmitFcn
set(andy, 'IndependentSubmitFcn', {@independentSubmitFcn_ProdG, clusterHost, remoteJobStorageLocation});

% Create the independent parallel GPU job
igpujob = createJob(andy);

% Create a dependency on the GPU parallel function file gfft.m to get it transferred to cluster
set(igpujob, 'AttachedFiles', {'gfft.m'});

% Create the individual sub-tasks (1) in the independent parallel GPU job
igputask = createTask(igpujob,@gfft,1,{});

% Submit the independent parallel GPU job (1 task)
submit(igpujob);

% Wait until the independent GPU job registers a finished state on the 
% submitting client
wait(igpujob, 'finished');

% Collect and print results
results = fetchOutputs(igpujob)

This is very similar to the CPU submit scripts presented elsewhere in that it fills in the variables required to work with the remote system (the MATLAB "cluster object"), defines the remote job submit function, the task to be executed by the job, the submission of the job, and checks for completion and prints the output. The main differences are the name of the MATLAB "Independent" submit function (independentSubmitFcn_ProdG) which includes the terminating "G" to select a GPU-ready PBS script when it is executed on ANDY. The other difference is in the function file being transferred from the submission directory on KARLE (/home/richard.walsh/matlab) to the remote working directory on ANDY (/home/richard.walsh/matlab_remote). This GPU-capable function script is called "gttf.m" and is presented here:

function hstdata = gfft

hstrand = rand(32, 'single');

devrand = gpuArray(hstrand);

devfft  = fft(devrand);
devsum  = ( real(devfft) + devrand ) * 6;

hstdata = gather(devsum);

disp(hstdata);

This file must be present in your working directory on KARLE, but is like any other MATLAB function file in that no special pre-compilation or processing for the remote GPU is required. It simply uses MATLAB's built-in GPU-targeted functions. Walking quickly through the "gfft" function, the first line gives it its name ("gfft") and the name of the one output variable ("hstdata") that it returns. The next fills the MATLAB array "hstrand" on the host CPU with 32 random numbers using MATLAB's random number generation function. This array is then moved from the CPU host to the GPU device with the "gpuArray" operator in the same way that CUDA's "cudaMemcpy()" function works. The next two lines complete operations on the GPU using MATLAB's built-in GPU functions because the array "devrand" is known to be on the GPU device in its memory. The result vector "devfft" (still on the GPU) is copied back to the function's output array "hstdata" on the host CPU in the "gather(devsum)" and finally this host- resident array is written to MATLAB's output.

To complete a run using this example bring up MATLAB in command-line mode and paste the first script into the window. At the point at which the submit command is executed you will be prompted for your user name and password on ANDY. The second file, "gfft.m", should be available in your working directory on KARLE so that it can be transferred to ANDY prior to its execution the GPU. More details on GPU computing within MATLAB are available here [44]

The approach described above can be extended to include user-defined MATLAB GPU kernel function from user-authored CUDA C or CUDA Fortran source files. The GPU job submission script and GPU function file presented above are now modified slightly, after creating a CUDA PTX file from a basic CUDA source code kernel (".cu" file), to support this second approach.

Here is the simple CUDA C kernel ("addv.cu") that will be executed via MATLAB's remote batch GPU-job submission capability:

__global__ void addv( double * v1, const double * v2 )
{
    int idx = threadIdx.x;
    v1[idx] += v2[idx];
}

This is a simple kernel that does a floating-point add of 2 one-dimensional thread blocks (or vectors) passed in as arguments on the GPU. It must be prepared for execution on the GPU by prior compilation with the NVIDIA compiler "nvcc" as follows:

nvcc -ptx  addv.cu

This creates in addition to the CUDA source file "addv.cu", a PTX assemble file ready to be transferred and assembled prior to execution on the GPU. Both the CUDA source and PTX assembly files are text files that must be transferred from the local client (KARLE) to the remote cluster prior to execution (or already be in place on ANDY). Note, that KARLE does not have attach GPU devices and therefore does not have CUDA installed on it. The CUDA source file must be compiled with "nvcc" on ANDY.

Lines for the modified GPU-ready remote job submit script are presented here:

%
% Initialize arguments to MatLab SubmitFcn for an independent parallel
% GPU job (1 task) to be submitted to the PBS production_mtlg queue. --rbw
%

% Define destination cluster for remote GPU job
clusterHost = 'andy.csi.cuny.edu';

% Define the remote working directory on server-remote (cluster)
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';

% Create cluster object and define client-local working directory
% for the client JobStorageLocation
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );

% Define other cluster object job submission variables
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);

% Name and associate a job submission function (script) with the cluster object
% To run an indepedent parallel job, you must specify the IndependentSubmitFcn
set(andy, 'IndependentSubmitFcn', {@independentSubmitFcn_ProdG, clusterHost, remoteJobStorageLocation});

% Create the independent parallel GPU job
igpujob = createJob(andy);

% Create a dependency on the GPU parallel function file gfft.m to get it transferred to cluster
set(igpujob, 'AttachedFiles', {'addv.cu'});
set(igpujob, 'AttachedFiles', {'addv.ptx'});
set(igpujob, 'AttachedFiles', {'gaddv'});

% Create the individual sub-tasks (1) in the independent parallel GPU job
igputask = createTask(igpujob,@gaddv,1,{});

% Submit the independent parallel GPU job (1 task)
submit(igpujob);

% Wait until the independent GPU job registers a finished state on the 
% submitting client
wait(igpujob, 'finished');

% Collect and print results
results = fetchOutputs(igpujob)

Focusing the few lines of difference, in the file dependency section three files are named as dependent and needing to be transferred, the two files associated with the "addv" CUDA source, and the MATLAB function file "gaddv" that will set up and run our CUDA kernel. Our example kernel is simple, but in theory any CUDA kernel can used in a similar fashion.

Here is the "gaddv" MATLAB function file that link the user-authored CUDA function with MATLAB's GPU kernel object and then runs the kernel:

function hvect = gaddv

kaddv = parallel.gpu.CUDAKernel('addv.ptx', 'addv.cu', 'addv');

bsize = 128;
kaddv.ThreadBlockSize = bsize;

gvect = feval(kaddv, ones(bsize, 1), ones(bsize, 1));

hvect = gather(gvect);

disp(hvect);

The procedure for creating and integrating user-authored CUDA kernel functions is described in more detail on the MATLAB website here [45] Looking at this MATLAB GPU kernel function file "gaddv" in more detail we see that is will produce a single array (vector) of output on the GPU and return it to the host CPU in the array variable "hvect".

The most important line in the file is perhaps this one:

kaddv = parallel.gpu.CUDAKernel('addv.ptx', 'addv.cu', 'addv');

This establishes the user authored function as a callable MATLAB CUDA kernel object named "kaddv" that has a number of properties associated with it. These are discussed in more detail in the web reference above, but the arguments name the transferred files and lastly provide the string to search for for the function's entry point in the PTX file.

Next, the CUDA kernel block size is set for the new CUDA function at 128 threads. This is attached to the MATLAB CUDA kernel object property "kaddv.ThreadBlockSize." On the next line, the function "kaddv" is finally evaluated (called) with the MATLAB "feval" command. The arguments name the function and offer two arrays (vectors) of numbers of length 128 as arguments, something the original CUDA source expects. Once evaluated, the result on the GPU device is "gathered" over to the host CPU and placed into the "hvect" array where it is displayed.

This basic example can be used as a guide to get any user-authored CUDA source function to run on ANDY's GPUs. MATLAB has an additional approach that would allow the "gaddv" function to be placed inside the driving script rather than transferred as a separate file. To read about this approach look here [46]

Other Examples of MATLAB CPU Parallel Job Submission

CUNY's HPC group has successfully submitted both MATLAB Independent Parallel and Communicating Parallel jobs from KARLE to both BOB and ANDY and from both the MATLAB GUI using a configuration files for BOB and ANDY, and from the CLI without a configuration files. Below, the CUNY's HPC group includes basic example MATLAB scripts that have been used successfully to submit Independent Parallel and Communicating Parallel work to ANDY from our KARLE Linux client. These scripts could be modified to run on BOB by change all occurrences of the string 'andy' to 'bob'.

The example MATLAB script for Independent Parallel job submission is list here:

%
% Initialize arguments to MatLab SubmitFcn for an independent parallel
% job (4 tasks) to be submitted to the PBS production queue. --rbw
%

% Define the destination cluster for the remote job
clusterHost = 'andy.csi.cuny.edu';

% Define the remote working directory on remote server (cluster)
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';

% Create cluster object and use a local 'matlab' directory 
% for the JobStorageLocation
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );

% Define other cluster object job submission variables
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);

% Name and associate a job submission function (script) with the cluster object
% To run an indepedent parallel job, you must specify the IndependentSubmitFcn
set(andy, 'IndependentSubmitFcn', {@independentSubmitFcn_Prod, clusterHost, remoteJobStorageLocation});

% Create the independent parallel job
ijob = createJob(andy);

% Create the individual sub-tasks (4) in the indepedent parallel job
itask = createTask(ijob,@rand,1);
itask = createTask(ijob,@rand,1);
itask = createTask(ijob,@rand,1);
itask = createTask(ijob,@rand,1);

% Submit the independent parallel job (4 tasks)
submit(ijob);

% Wait until the independent job registers a finished state on the 
% submitting client
wait(ijob, 'finished');

% Collect and print results
results = fetchOutputs(ijob)

References to function files in the default remote MATLAB working directory on BOB or ANDY are preceded by the '@' sign, and those files are presumed to be have been made available there (/share/apps/matlab/default/toolbox/local). Files placed in other locations may be referenced with the full remote file system path or through the MATLAB addpath command. The last two commands in the script wait for the 4 job tasks to achieve a 'finished' state on KARLE, the submitting client (wait()) and grab the results for display on the client (fetchOutputs(ijob)). The runtime functions needed above (and below) can be obtained for further customization from their distribution location in:

$(MATLAB_ROOT}/toolbox/local/pbs/unix

Further information can be found on submitting MATLAB Independent Parallel jobs at the MATLAB website here: [47].

The MATLAB script for Communicating Parallel job submission is listed here. The function 'colsum.m' must be provided in MATLAB's local submitting client working directory on KARLE:

%
% Initialize arguments to MatLab SubmitFcn for a communicating
% parallel to be job submitted to the PBS production queue. --rbw
%

% Define destination cluster for the remote job
clusterHost = 'andy.csi.cuny.edu';

% Define the remote working directory on remote server (cluster)
remoteJobStorageLocation = '/home/richard.walsh/matlab_remote';

% Create cluster object and use a local 'matlab' directory 
% for the JobStorageLocation
andy = parallel.cluster.Generic( 'JobStorageLocation', '/home/richard.walsh/matlab' );

% Define other cluster object job submission variables
set(andy, 'ClusterMatlabRoot', '/share/apps/matlab/default');
set(andy, 'HasSharedFilesystem', false);
set(andy, 'OperatingSystem', 'unix');
set(andy, 'GetJobStateFcn', @getJobStateFcn);
set(andy, 'DeleteJobFcn', @deleteJobFcn);

% Name and associate a job submission function (script) with cluster object
% To run a communicating parallel jobs, you must specify a CommunicatingSubmitFcn
set(andy, 'CommunicatingSubmitFcn', {@communicatingSubmitFcn_Prod, clusterHost, remoteJobStorageLocation});

% Create a communicating paralel job
cjob = createCommunicatingJob(andy);

% Define the maximum and minimum number of processes (4) for this job
set(cjob, 'NumWorkersRange', [4,4]);

% Create a dependency on the parallel function colsum.m to get it transferred to cluster
set(cjob, 'AttachedFiles', {'colsum.m'});

% Create the individual sub-task(s) (1) in the communicating parallel job
ctask = createTask(cjob,@colsum,1,{});

% Submit the communicating parallel job
submit(cjob);

% Wait until the commuicating job registers a finished state on the
% submitting client
wait(cjob, 'finished');

% Collect and print results
results = fetchOutputs(cjob)

Parallel function 'colsum':

function total_sum = colsum
if labindex == 1
    % Send magic square to other labs
    A = labBroadcast(1,magic(numlabs))
else
    % Receive broadcast on other labs
    A = labBroadcast(1)
end

% Calculate sum of column identified by labindex for this lab
column_sum = sum(A(:,labindex))

% Calculate total sum by combining column sum from all labs
total_sum = gplus(column_sum)

It is important to point out that any user-authored code residing on the local client (KARLE) and used in your MATLAB script will need to be copied over to the cluster (ANDY or BOB) remote directory made available in the MATLAB path. The setting of the 'AttachedFiles' property above in the Parallel job script illustrates how to accomplish this automatically as part of the job submission process. In this example, the job is dependent on the user supplied function 'colsum.m' that is local to the client. The line:

set(cjob, 'AttachedFiles', {'colsum.m'});

accomplishes the fine transfer automatically during the execution of the script. Because 'colsum.m' is written as a MATLAB script, it can be transferred as text. However, user defined functions that need to be compiled (typically ending in the suffix '.mex') must be compiled in the environment in which they will be used. This may mean that users will need to compile these supporting application files on the destination machine (the head node of the cluster, ANDY or BOB, in our case) and provide the compiled result in the remote working directory defined in their submit script.

Further information on file dependencies can be found in the MATLAB User's Guide [48].

Much of the scripting described above, when tested and functioning, can be reduced in the MATLAB GUI to a MATLAB configuration which is implied as boilerplate in the drop down menu job submission tab.

Further information can be found on submitting MATLAB distributed parallel jobs at the MATLAB website here [49].

An Outline of the Major Steps Involved in Remote Job Submission

A successful MATLAB job submission ( submit(cjob) ) to ANDY or BOB from KARLE driven by the commands above completes the following steps:

1. The creation of a client-local job directory 'JobXX' in the current local
     MATLAB working directory on the KARLE where XX is the MATLAB
     job number.

2. The transfer of the contents of the local 'JobXX' directory via 'ssh'
     from the client to ANDY's head node (server) for execution to a mirror
     remote working directory on ANDY (matlab_remote).  Running 'get(cjob)'
     from MATLAB on KARLE will show a state of 'queued' or 'pending' for the
     job at this point.

3. The assignment of compute nodes to the job by PBS Pro and the queuing of
     the job for execution.  The job will then become visible as queued when 'qstat'
     is run on ANDY.  This command is the PBS job monitoring utility.  A production
     job can remain in the Q-state for a while depending on both MATLAB
     and general user activity on ANDY.  Users should be familiar with logging into
     ANDY and BOB to track PBS batch jobs there with 'qstat'.  See the PBS section
     elsewhere in this document.

4. The start of a MATLAB job on the cluster compute node(s).  The job will now
     be listed as running when 'qstat' is run. This will be visible in the 'qstat' output.

5.  Job completion on ANDY is indicated by a 'finished' state listed in the 'Job.state.mat' 
     file.  The 'qstat' command run on ANDY will now show the PBS job has completed.

6.  Job files are transferred back to the client-local directory on KARLE marking the
    'Job.state.mat' on KARLE (the client) also as 'finished'.  At this point, running the
    'get(job)' command from the client GUI or CLI will show a job state of 'finished'.

7. Job results will be available to MATLAB via the 'results = fetchOutputs(cjob);'
     command upon successful end-to-end completion in MATLAB.

There will be slight differences in the protocol between Independent Parallel and Communicating Parallel jobs with Independent Parallel jobs showing N separate queued jobs with N job IDs, one for each MATLAB 'worker' task running on its own compute node; and Communicating Parallel jobs showing only one queued job per MATLAB 'lab' running in concert on N compute nodes.

Migrate

Migrate estimates population parameters, effective population sizes and migration rates of n populations, using genetic data. It uses a coalescent theory approach taking into account the history of mutations and the uncertainty of the genealogy. The estimates of the parameter values are achieved by either a Maximum likelihood (ML) approach or Bayesian inference (BI). Migrate's output is presented in an TEXT file and in a PDF file. The PDF file eventually will contain all possible analyses including histograms of posterior distributions. Currently only the main tables (ML + BI), profile likelihood tables (ML), percentiles tables (ML), and posterior histograms (BI) are supported in the PDF. For more detail on Migrate please visit the Migrate web site here [50], the manual here [51], and in the introductory README files in /share/apps/migrate/default/docs.

The current default version of Migrate installed at the CUNY HPC Center on BOB is version 3.4.1. This version can be run in serial mode, in threaded parallel mode, or in MPI parallel mode. In the directory '/share/apps/migrate/example' you can find some example data sets. We demonstrate the execution of the 'parmfiles.testml' example using a PBS batch script suitable to each mode of execution. Two input files from the above directory are required ('infile.msat' and 'parmfile.testml') to complete this simulation, and the '-nomenu' command-line option is generally required to run a batch job to eliminate the normal command-line prompt.

A PBS Pro batch script must be created to run your job. The first script shown initiates a MIGRATE MPI parallel run. It requests 8 processors to complete its work.

i#!/bin/bash
#PBS -q production
#PBS -N MIGRATE_mpi
#PBS -l select=8:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo ">>>> Begin Migrate MPI Parallel Run ..."
echo ""
mpirun -np 8 -machinefile $PBS_NODEFILE /share/apps/migrate/default/bin/migrate-n-mpi ./parmfile.testml -nomenu
echo ""
echo ">>>> End   Migrate MPI Parallel Run ..."

This script can be dropped into a file (say 'migrate_mpi.job) on BOB, and run with:

qsub migrate_mpi.job

It should take less than 10 minutes to run and will produce PBS output and error files beginning with the job name 'MIGRATE_mpi', as well as output files specific to MIGRATE. Details on the meaning of the PBS script are covered in the PBS section of this Wiki. The most important lines are the lines '#PBS -l select=8:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 8 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are to be found (freely). The PBS master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes will potentially be used as well. See the PBS section for details

The CUNY HPC Center also provides a serial version of MIGRATE. A PBS batch script for running the serial version of MIGRATE (migrate_serial.job) follows:

#!/bin/bash
#PBS -q production
#PBS -N MIGRATE_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

echo ">>>> Begin Migrate Serial Run ..."
echo ""
/share/apps/migrate/default/bin/migrate-n-serial ./parmfile.testml -nomenu
echo ""
echo ">>>> End   Migrate Serial Run ..."

The only changes appear in the new name for the job (MIGRATE_serial), the '-l select' line, which requests only 1 resource 'chunk' instead of 8, and in the name of the MIGRATE executable used, which is now 'migrate-n-serial' instead of 'migrate-n-mpi'.

The threaded version of the PBS script can be created by making similar substitutions and a change to 'packed' rather than 'free' placement (again see PBS section for details):

3,5c3,5
< #PBS -N MIGRATE_serial
< #PBS -l select=1:ncpus=1:mem=1920mb
< #PBS -l place=free
---
> #PBS -N MIGRATE_threads
> #PBS -l select=1:ncpus=8:mem=15360mb
> #PBS -l place=pack
15c15
< echo ">>>> Begin Migrate Serial Run ..."
---
> echo ">>>> Begin Migrate Pthreads Parallel Run ..."
17c17
< /share/apps/migrate/default/bin/migrate-n-serial ./parmfile.testml -nomenu
---
> /share/apps/migrate/default/bin/migrate-n-threads ./parmfile.testml -nomenu
19c19
< echo ">>>> End   Migrate Serial Run ..."
---
> echo ">>>> End   Migrate Pthreads Parallel Run ..."

NOTE: HPC Center staff has noticed that the performance of MIGRATE on the 'parmfile.testml' test case used here appears to be slow relative to MIGRATE web site benchmark performance data. The threaded version of the code seems particularly slow. We are investigating this to see if it is a real issue that needs correction or is related to an important difference in the input files (11-4-11). You may wish to inquire about the state of this issue with HPC Center staff before running your MIGRATE jobs.

MPFR

The MPFR library is a C library for multiple-precision floating-point computations with correct rounding. MPFR has continuously been supported by the INRIA and the current main authors come from the Caramel and AriC project-teams at Loria (Nancy, France) and LIP (Lyon, France) respectively; see more on the credit page. MPFR is based on the GMP multiple-precision library. The main goal of MPFR is to provide a library for multiple-precision floating-point computation which is both efficient and has a well-defined semantics. It copies the good ideas from the ANSI/IEEE-754 standard for double-precision floating-point arithmetic (53-bit significant). The library is installed on PENZIAS.

MRBAYES

MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on certain observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

MrBayes version 3.2.1 is installed on ANDY. In order to set-up the environment required to run MrBayes corresponding module needs to be loaded first. This is done with

module load mrbayes

Running MrBayes is a two-step process that first requires the creation of the NEXUS-formatted MrBayes input file and then the PBS Pro script to run it. MrBayes can be run in serial, MPI-parallel, or GPU-accelerated mode.

Here is NEXUS input file (primates.nex) which includes both a DATA block and a MRBAYES block. The MRBAYES block simply contains the MrBayes runtime commands terminated with a semi-colon. The example below shows 12 mitochondrial DNA sequences of primates and yields at least 1,000 samples from the posterior probability distribution. If you need more detail on generating the NEXUS file or on MrBayes in general, please check the MrBayes Wiki here [52] and the online manual here on-line manual.

#NEXUS

begin data;
dimensions ntax=12 nchar=898;
format datatype=dna interleave=no gap=-;
matrix
Tarsius_syrichta	AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCTCCCTATTATTTTGCCTAGCAAATACAAACTACGAACGAGTCCACAGTCGAACAATAGCACTAGCCCGTGGCCTTCAAACCCTATTACCTCTTGCAGCAACATGATGACTCCTCGCCAGCTTAACCAACCTGGCCCTTCCCCCAACAATTAATTTAATCGGTGAACTGTCCGTAATAATAGCAGCATTTTCATGGTCACACCTAACTATTATCTTAGTAGGCCTTAACACCCTTATCACCGCCCTATATTCCCTATATATACTAATCATAACTCAACGAGGAAAATACACATATCATATCAACAATATCATGCCCCCTTTCACCCGAGAAAATACATTAATAATCATACACCTATTTCCCTTAATCCTACTATCTACCAACCCCAAAGTAATTATAGGAACCATGTACTGTAAATATAGTTTAAACAAAACATTAGATTGTGAGTCTAATAATAGAAGCCCAAAGATTTCTTATTTACCAAGAAAGTA-TGCAAGAACTGCTAACTCATGCCTCCATATATAACAATGTGGCTTTCTT-ACTTTTAAAGGATAGAAGTAATCCATCGGTCTTAGGAACCGAAAA-ATTGGTGCAACTCCAAATAAAAGTAATAAATTTATTTTCATCCTCCATTTTACTATCACTTACACTCTTAATTACCCCATTTATTATTACAACAACTAAAAAATATGAAACACATGCATACCCTTACTACGTAAAAAACTCTATCGCCTGCGCATTTATAACAAGCCTAGTCCCAATGCTCATATTTCTATACACAAATCAAGAAATAATCATTTCCAACTGACATTGAATAACGATTCATACTATCAAATTATGCCTAAGCTT
Lemur_catta		AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCATCCATATTATTCTGTCTAGCCAACTCTAACTACGAACGAATCCATAGCCGTACAATACTACTAGCACGAGGGATCCAAACCATTCTCCCTCTTATAGCCACCTGATGACTACTCGCCAGCCTAACTAACCTAGCCCTACCCACCTCTATCAATTTAATTGGCGAACTATTCGTCACTATAGCATCCTTCTCATGATCAAACATTACAATTATCTTAATAGGCTTAAATATGCTCATCACCGCTCTCTATTCCCTCTATATATTAACTACTACACAACGAGGAAAACTCACATATCATTCGCACAACCTAAACCCATCCTTTACACGAGAAAACACCCTTATATCCATACACATACTCCCCCTTCTCCTATTTACCTTAAACCCCAAAATTATTCTAGGACCCACGTACTGTAAATATAGTTTAAA-AAAACACTAGATTGTGAATCCAGAAATAGAAGCTCAAAC-CTTCTTATTTACCGAGAAAGTAATGTATGAACTGCTAACTCTGCACTCCGTATATAAAAATACGGCTATCTCAACTTTTAAAGGATAGAAGTAATCCATTGGCCTTAGGAGCCAAAAA-ATTGGTGCAACTCCAAATAAAAGTAATAAATCTATTATCCTCTTTCACCCTTGTCACACTGATTATCCTAACTTTACCTATCATTATAAACGTTACAAACATATACAAAAACTACCCCTATGCACCATACGTAAAATCTTCTATTGCATGTGCCTTCATCACTAGCCTCATCCCAACTATATTATTTATCTCCTCAGGACAAGAAACAATCATTTCCAACTGACATTGAATAACAATCCAAACCCTAAAACTATCTATTAGCTT
Homo_sapiens		AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCTCATTACTATTCTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATCCTCTCTCAAGGACTTCAAACTCTACTCCCACTAATAGCTTTTTGATGACTTCTAGCAAGCCTCGCTAACCTCGCCTTACCCCCCACTATTAACCTACTGGGAGAACTCTCTGTGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTTACAGGACTCAACATACTAGTCACAGCCCTATACTCCCTCTACATATTTACCACAACACAATGGGGCTCACTCACCCACCACATTAACAACATAAAACCCTCATTCACACGAGAAAACACCCTCATGTTCATACACCTATCCCCCATTCTCCTCCTATCCCTCAACCCCGACATCATTACCGGGTTTTCCTCTTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTTA-CGACCCCTTATTTACCGAGAAAGCT-CACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACCATGCACACTACTATAACCACCCTAACCCTGACTTCCCTAATTCCCCCCATCCTTACCACCCTCGTTAACCCTAACAAAAAAAACTCATACCCCCATTATGTAAAATCCATTGTCGCATCCACCTTTATTATCAGTCTCTTCCCCACAACAATATTCATGTGCCTAGACCAAGAAGTTATTATCTCGAACTGACACTGAGCCACAACCCAAACAACCCAGCTCTCCCTAAGCTT
Pan	  		AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCTCATTATTATTCTGCCTAGCAAACTCAAATTATGAACGCACCCACAGTCGCATCATAATTCTCTCCCAAGGACTTCAAACTCTACTCCCACTAATAGCCTTTTGATGACTCCTAGCAAGCCTCGCTAACCTCGCCCTACCCCCTACCATTAATCTCCTAGGGGAACTCTCCGTGCTAGTAACCTCATTCTCCTGATCAAATACCACTCTCCTACTCACAGGATTCAACATACTAATCACAGCCCTGTACTCCCTCTACATGTTTACCACAACACAATGAGGCTCACTCACCCACCACATTAATAACATAAAGCCCTCATTCACACGAGAAAATACTCTCATATTTTTACACCTATCCCCCATCCTCCTTCTATCCCTCAATCCTGATATCATCACTGGATTCACCTCCTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGACAACAGAGGCTCA-CGACCCCTTATTTACCGAGAAAGCT-TATAAGAACTGCTAATTCATATCCCCATGCCTGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCCATCCGTTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACCATGTATACTACCATAACCACCTTAACCCTAACTCCCTTAATTCTCCCCATCCTCACCACCCTCATTAACCCTAACAAAAAAAACTCATATCCCCATTATGTGAAATCCATTATCGCGTCCACCTTTATCATTAGCCTTTTCCCCACAACAATATTCATATGCCTAGACCAAGAAGCTATTATCTCAAACTGGCACTGAGCAACAACCCAAACAACCCAGCTCTCCCTAAGCTT
Gorilla   		AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCATCATTATTATTCTGCCTAGCAAACTCAAACTACGAACGAACCCACAGCCGCATCATAATTCTCTCTCAAGGACTCCAAACCCTACTCCCACTAATAGCCCTTTGATGACTTCTGGCAAGCCTCGCCAACCTCGCCTTACCCCCCACCATTAACCTACTAGGAGAGCTCTCCGTACTAGTAACCACATTCTCCTGATCAAACACCACCCTTTTACTTACAGGATCTAACATACTAATTACAGCCCTGTACTCCCTTTATATATTTACCACAACACAATGAGGCCCACTCACACACCACATCACCAACATAAAACCCTCATTTACACGAGAAAACATCCTCATATTCATGCACCTATCCCCCATCCTCCTCCTATCCCTCAACCCCGATATTATCACCGGGTTCACCTCCTGTAAATATAGTTTAACCAAAACATCAGATTGTGAATCTGATAACAGAGGCTCA-CAACCCCTTATTTACCGAGAAAGCT-CGTAAGAGCTGCTAACTCATACCCCCGTGCTTGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGACCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAACTATGTACGCTACCATAACCACCTTAGCCCTAACTTCCTTAATTCCCCCTATCCTTACCACCTTCATCAATCCTAACAAAAAAAGCTCATACCCCCATTACGTAAAATCTATCGTCGCATCCACCTTTATCATCAGCCTCTTCCCCACAACAATATTTCTATGCCTAGACCAAGAAGCTATTATCTCAAGCTGACACTGAGCAACAACCCAAACAATTCAACTCTCCCTAAGCTT
Pongo     		AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCTCCCTACTGTTCTGCCTAGCAAACTCAAACTACGAACGAACCCACAGCCGCATCATAATCCTCTCTCAAGGCCTTCAAACTCTACTCCCCCTAATAGCCCTCTGATGACTTCTAGCAAGCCTCACTAACCTTGCCCTACCACCCACCATCAACCTTCTAGGAGAACTCTCCGTACTAATAGCCATATTCTCTTGATCTAACATCACCATCCTACTAACAGGACTCAACATACTAATCACAACCCTATACTCTCTCTATATATTCACCACAACACAACGAGGTACACCCACACACCACATCAACAACATAAAACCTTCTTTCACACGCGAAAATACCCTCATGCTCATACACCTATCCCCCATCCTCCTCTTATCCCTCAACCCCAGCATCATCGCTGGGTTCGCCTACTGTAAATATAGTTTAACCAAAACATTAGATTGTGAATCTAATAATAGGGCCCCA-CAACCCCTTATTTACCGAGAAAGCT-CACAAGAACTGCTAACTCTCACT-CCATGTGTGACAACATGGCTTTCTCAGCTTTTAAAGGATAACAGCTATCCCTTGGTCTTAGGATCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAACAGCCATGTTTACCACCATAACTGCCCTCACCTTAACTTCCCTAATCCCCCCCATTACCGCTACCCTCATTAACCCCAACAAAAAAAACCCATACCCCCACTATGTAAAAACGGCCATCGCATCCGCCTTTACTATCAGCCTTATCCCAACAACAATATTTATCTGCCTAGGACAAGAAACCATCGTCACAAACTGATGCTGAACAACCACCCAGACACTACAACTCTCACTAAGCTT
Hylobates 		AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTTCCCTGCTATTCTGCCTTGCAAACTCAAACTACGAACGAACTCACAGCCGCATCATAATCCTATCTCGAGGGCTCCAAGCCTTACTCCCACTGATAGCCTTCTGATGACTCGCAGCAAGCCTCGCTAACCTCGCCCTACCCCCCACTATTAACCTCCTAGGTGAACTCTTCGTACTAATGGCCTCCTTCTCCTGGGCAAACACTACTATTACACTCACCGGGCTCAACGTACTAATCACGGCCCTATACTCCCTTTACATATTTATCATAACACAACGAGGCACACTTACACACCACATTAAAAACATAAAACCCTCACTCACACGAGAAAACATATTAATACTTATGCACCTCTTCCCCCTCCTCCTCCTAACCCTCAACCCTAACATCATTACTGGCTTTACTCCCTGTAAACATAGTTTAATCAAAACATTAGATTGTGAATCTAACAATAGAGGCTCG-AAACCTCTTGCTTACCGAGAAAGCC-CACAAGAACTGCTAACTCACTATCCCATGTATGACAACATGGCTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGACCCAAAAATTTTGGTGCAACTCCAAATAAAAGTAATAGCAATGTACACCACCATAGCCATTCTAACGCTAACCTCCCTAATTCCCCCCATTACAGCCACCCTTATTAACCCCAATAAAAAGAACTTATACCCGCACTACGTAAAAATGACCATTGCCTCTACCTTTATAATCAGCCTATTTCCCACAATAATATTCATGTGCACAGACCAAGAAACCATTATTTCAAACTGACACTGAACTGCAACCCAAACGCTAGAACTCTCCCTAAGCTT
Macaca_fuscata		AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTTCCATATATTTCTGCCTAGCCAATTCAAACTATGAACGCACTCACAACCGTACCATACTACTGTCCCGAGGACTTCAAATCCTACTTCCACTAACAGCCTTTTGATGATTAACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATCAATCTACTAGGTGAACTCTTTGTAATCGCAACCTCATTCTCCTGATCCCATATCACCATTATGCTAACAGGACTTAACATATTAATTACGGCCCTCTACTCTCTCCACATATTCACTACAACACAACGAGGAACACTCACACATCACATAATCAACATAAAGCCCCCCTTCACACGAGAAAACACATTAATATTCATACACCTCGCTCCAATTATCCTTCTATCCCTCAACCCCAACATCATCCTGGGGTTTACCTCCTGTAGATATAGTTTAACTAAAACACTAGATTGTGAATCTAACCATAGAGACTCA-CCACCTCTTATTTACCGAGAAAACT-CGCAAGGACTGCTAACCCATGTACCCGTACCTAAAATTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAACATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCCATCATTATAACAACCCTTATCTCCCTAACTCTCCCAATTTTTGCCACCCTCATCAACCCTTACAAAAAACGTCCATACCCAGATTACGTAAAAACAACCGTAATATATGCTTTCATCATCAGCCTCCCCTCAACAACTTTATTCATCTTCTCAAACCAAGAAACAACCATTTGGAGCTGACATTGAATAATGACCCAAACACTAGACCTAACGCTAAGCTT
M_mulatta		AAGCTTTTCTGGCGCAACCATCCTCATGATTGCTCACGGACTCACCTCTTCCATATATTTCTGCCTAGCCAATTCAAACTATGAACGCACTCACAACCGTACCATACTACTGTCCCGGGGACTTCAAATCCTACTTCCACTAACAGCTTTCTGATGATTAACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATCAACCTACTAGGTGAACTCTTTGTAATCGCGACCTCATTCTCCTGGTCCCATATCACCATTATATTAACAGGATTTAACATACTAATTACGGCCCTCTACTCCCTCCACATATTCACCACAACACAACGAGGAGCACTCACACATCACATAATCAACATAAAACCCCCCTTCACACGAGAAAACATATTAATATTCATACACCTCGCTCCAATCATCCTCCTATCTCTCAACCCCAACATCATCCTGGGGTTTACTTCCTGTAGATATAGTTTAACTAAAACATTAGATTGTGAATCTAACCATAGAGACTTA-CCACCTCTTATTTACCGAGAAAACT-CGCGAGGACTGCTAACCCATGTATCCGTACCTAAAATTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAATATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCTATCATAATAACAACCCTTATCTCCCTAACTCTCCCAATTTTTGCCACCCTCATCAACCCTTACAAAAAACGTCCATACCCAGATTACGTAAAAACAACCGTAATATATGCTTTCATCATCAGCCTCCCCTCAACAACTTTATTCATCTTCTCAAACCAAGAAACAACCATTTGAAGCTGACATTGAATAATAACCCAAACACTAGACCTAACACTAAGCTT
M_fascicularis		AAGCTTCTCCGGCGCAACCACCCTTATAATCGCCCACGGGCTCACCTCTTCCATGTATTTCTGCTTGGCCAATTCAAACTATGAGCGCACTCATAACCGTACCATACTACTATCCCGAGGACTTCAAATTCTACTTCCATTGACAGCCTTCTGATGACTCACAGCAAGCCTTACTAACCTTGCCCTACCCCCCACTATTAATCTACTAGGCGAACTCTTTGTAATCACAACTTCATTTTCCTGATCCCATATCACCATTGTGTTAACGGGCCTTAATATACTAATCACAGCCCTCTACTCTCTCCACATGTTCATTACAGTACAACGAGGAACACTCACACACCACATAATCAATATAAAACCCCCCTTCACACGAGAAAACATATTAATATTCATACACCTCGCTCCAATTATCCTTCTATCTCTCAACCCCAACATCATCCTGGGGTTTACCTCCTGTAAATATAGTTTAACTAAAACATTAGATTGTGAATCTAACTATAGAGGCCTA-CCACTTCTTATTTACCGAGAAAACT-CGCAAGGACTGCTAATCCATGCCTCCGTACTTAAAACTACGGTTTCCTCAACTTTTAAAGGATAACAGCTATCCATTGACCTTAGGAGTCAAAAACATTGGTGCAACTCCAAATAAAAGTAATAATCATGCACACCCCCATCATAATAACAACCCTCATCTCCCTGACCCTTCCAATTTTTGCCACCCTCACCAACCCCTATAAAAAACGTTCATACCCAGACTACGTAAAAACAACCGTAATATATGCTTTTATTACCAGTCTCCCCTCAACAACCCTATTCATCCTCTCAAACCAAGAAACAACCATTTGGAGTTGACATTGAATAACAACCCAAACATTAGACCTAACACTAAGCTT
M_sylvanus		AAGCTTCTCCGGTGCAACTATCCTTATAGTTGCCCATGGACTCACCTCTTCCATATACTTCTGCTTGGCCAACTCAAACTACGAACGCACCCACAGCCGCATCATACTACTATCCCGAGGACTCCAAATCCTACTCCCACTAACAGCCTTCTGATGATTCACAGCAAGCCTTACTAATCTTGCTCTACCCTCCACTATTAATCTACTGGGCGAACTCTTCGTAATCGCAACCTCATTTTCCTGATCCCACATCACCATCATACTAACAGGACTGAACATACTAATTACAGCCCTCTACTCTCTTCACATATTCACCACAACACAACGAGGAGCGCTCACACACCACATAATTAACATAAAACCACCTTTCACACGAGAAAACATATTAATACTCATACACCTCGCTCCAATTATTCTTCTATCTCTTAACCCCAACATCATTCTAGGATTTACTTCCTGTAAATATAGTTTAATTAAAACATTAGACTGTGAATCTAACTATAGAAGCTTA-CCACTTCTTATTTACCGAGAAAACT-TGCAAGGACCGCTAATCCACACCTCCGTACTTAAAACTACGGTTTTCTCAACTTTTAAAGGATAACAGCTATCCATTGGCCTTAGGAGTCAAAAATATTGGTGCAACTCCAAATAAAAGTAATAATCATGTATACCCCCATCATAATAACAACTCTCATCTCCCTAACTCTTCCAATTTTCGCTACCCTTATCAACCCCAACAAAAAACACCTATATCCAAACTACGTAAAAACAGCCGTAATATATGCTTTCATTACCAGCCTCTCTTCAACAACTTTATATATATTCTTAAACCAAGAAACAATCATCTGAAGCTGGCACTGAATAATAACCCAAACACTAAGCCTAACATTAAGCTT
Saimiri_sciureus	AAGCTTCACCGGCGCAATGATCCTAATAATCGCTCACGGGTTTACTTCGTCTATGCTATTCTGCCTAGCAAACTCAAATTACGAACGAATTCACAGCCGAACAATAACATTTACTCGAGGGCTCCAAACACTATTCCCGCTTATAGGCCTCTGATGACTCCTAGCAAATCTCGCTAACCTCGCCCTACCCACAGCTATTAATCTAGTAGGAGAATTACTCACAATCGTATCTTCCTTCTCTTGATCCAACTTTACTATTATATTCACAGGACTTAATATACTAATTACAGCACTCTACTCACTTCATATGTATGCCTCTACACAGCGAGGTCCACTTACATACAGCACCAGCAATATAAAACCAATATTTACACGAGAAAATACGCTAATATTTATACATATAACACCAATCCTCCTCCTTACCTTGAGCCCCAAGGTAATTATAGGACCCTCACCTTGTAATTATAGTTTAGCTAAAACATTAGATTGTGAATCTAATAATAGAAGAATA-TAACTTCTTAATTACCGAGAAAGTG-CGCAAGAACTGCTAATTCATGCTCCCAAGACTAACAACTTGGCTTCCTCAACTTTTAAAGGATAGTAGTTATCCATTGGTCTTAGGAGCCAAAAACATTGGTGCAACTCCAAATAAAAGTAATA---ATACACTTCTCCATCACTCTAATAACACTAATTAGCCTACTAGCGCCAATCCTAGCTACCCTCATTAACCCTAACAAAAGCACACTATACCCGTACTACGTAAAACTAGCCATCATCTACGCCCTCATTACCAGTACCTTATCTATAATATTCTTTATCCTTACAGGCCAAGAATCAATAATTTCAAACTGACACTGAATAACTATCCAAACCATCAAACTATCCCTAAGCTT
;
end;

begin mrbayes; 
    set autoclose=yes nowarn=yes; 
    lset nst=6 rates=gamma; 
    mcmc nruns=1 ngen=10000 samplefreq=10; 
end;

A PBS Pro batch script must be created to run your job. The first script below shows a MPI parallel run of the above '.nex' input file. This script selects 4 processors (cores) and allows PBS to put them on any compute node. Note, that when running any parallel program one must be cognizant of the scaling properties of its parallel algorithm; in other words, how much does a given job's running time drop as one doubles the number of processors used. All parallel programs arrive at point of diminishing returns that depend on the algorithm, size of the problem being solved, and the performance features of the system that it is being run on. We might have chosen to run this job on 8, 16, or 32 processors (cores), but would only do so if the improvement in performance scales. Improvements of less than 25% after a doubling are an indication of a reasonable maximum number of processors under those particular set of circumstances

Here is the 4 processor MPI parallel PBS batch script:

#!/bin/bash
#PBS -q production
#PBS -N MRBAYES_mpi
#PBS -l select=4:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin MRBAYES MPI Run ..."
echo ""
mpirun -np 4 -machinefile $PBS_NODEFILE /share/apps/mrbayes/default/bin/mb ./primates.nex > primates.out 2>&1
echo ""
echo ">>>> End   MRBAYES MPI Run ..."

This script can be dropped into a file (say 'mrbayes_mpi.job) on either BOB or ANDY and run with:

qsub mrbayes_mpi.job

This test case should take no more than a couple of minutes to run and will produce PBS output and error files beginning with the job name 'MRBAYES_mpi'. Other MrBayes specific outputs will also be produced. Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=4:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 4 resource 'chunks' each with 1 processor (core) and 1,920 MBs of memory in it for the job (on ANDY as much as 2,880 MBs might have been selected). The second line instructs PBS to place this job wherever the least used resources are found (i.e. freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes may also be called into service to complete this job.

The CUNY HPC Center also provides a serial version of MrBayes. A PBS batch script for running the serial version is easy to prepare from the above by making a few changes. Here is a listing of the differences between the above MPI script and the serial script:

3,4c3,4
< #PBS -N MRBAYES_mpi
< #PBS -l select=4:ncpus=1:mem=1920mb
---
> #PBS -N MRBAYES_serial
> #PBS -l select=1:ncpus=1:mem=1920mb
16c16
< echo ">>>> Begin MRBAYES MPI Run ..."
---
> echo ">>>> Begin MRBAYES Serial Run ..."
18c18
< mpirun -np 4 -machinefile $PBS_NODEFILE /share/apps/mrbayes/default/bin/mb ./primates.nex
---
> /share/apps/mrbayes/default/bin/mb-serial ./primates.nex
20c20
< echo ">>>> End   MRBAYES MPI Run ..."
---
> echo ">>>> End   MRBAYES Serial Run ..."

Finally, it is possible to run MrBayes in GPU-accelerated mode on ANDY. This is an experimental version of the code and users are cautioned to check their results and note their performance to be sure they are getting accurate answers in shorter time periods. Nothing is worse in HPC than going in the wrong direction, more slowly (this principle applies to NYC Taxi rides as well). Here is yet another script that will run the GPU-accelerated version of MrBayes (again, on ANDY only).

#!/bin/bash
#PBS -q production_gpu
#PBS -N MRBAYES_gpu
#PBS -l select=1:ncpus=1:ngpus=1:mem=2880mb:accel=fermi
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the GPU parallel executable to run
echo ">>>> Begin MRBAYES GPU Run ..."
echo ""
/share/apps/mrbayes/default/bin/mb-gpu ./primates.nex
echo ""
echo ">>>> End   MRBAYES GPU Run ..."

There are several differences worth pointing out. First, this job is submitted to the 'production_gpu' queue which ensures that PBS select those ANDY compute nodes with attached GPUs (i.e. the compute nodes beginning with 'gpute-'). Second, the resource request line '-l select' request more that just a processor and some memory, but also a GPU (ncpus=1) and a particular flavor of GPU (accel=fermi). The NVIDIA Fermi GPUs on ANDY (96 in all) each have 448 processors. In requesting 1 GPU here, we are getting an 448 processors assign to our task. Individually GPU processors are less powerful that CPU processors, but in such numbers (if they can be used in parallel), they can deliver significant performance improvements.

The last difference worth noting is the name of the executable, 'mb-gpu', which selects the GPU-accelerated version of the code.

msABC

msABC is a program for simulating various neutral evolutionary demographic scenarios based on the software ms (Hudson 2002). msABC extends ms, calculating a multitude of summary statistics. Therefore, msABC is suitable for performing the sampling step of an Approximate Bayesian Computation analysis (ABC), under various neutral demographic models. The main advantages of msABC are (i) use of various prior distributions, such as uniform, Gaussian, log-normal, gamma, (ii) implementation of a multitude summary statistics for one or more populations, (iii) efficient implementation, which allows the analysis of hundrends of loci and chromosomes even in a single computer, (iv) extended flexibility, such as simulation of loci of variable size and simulation of missing data.

msABC is suitable, for analysing multi-locus data. A major assumption is that the loci are independent. Therefore, it is suitable for cases when a multitude of loci from a single chromosome or a genome are available and the loci are located far enough in order to be considered independent.

Complete documentation of msABC software can be found here.

msABC version 20120315 has been installed at the CUNY HPC Center on Andy cluster. It can be run in a serial mode (on 1 core) and all msABC jobs need to be submitted to the PBS queue. Here is a step-by-step example of setting up and starting typical msABC job.

First you need to load corresponding module (you can read more about modules here)

module load msabc

After that create a PBS script and save it to a file (let's call it "msabc.job" for example):

#!/bin/bash
#PBS -q production
#PBS -N msABC
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

echo "Starting msABC job ..."

cd $PBS_O_WORKDIR

msABC 40 10 -t 100 -I 2 20 20 0.01 > test.out

echo "Finishing msABC job"


Here 1 core is requested from PBS manager. Actual msACB process is started at line "msABC 40 10 -t 100 -I 2 20 20 0.01 > test.out". Job output is redirected to the file "test.out".

Submit the job to the PBS queue using

qsub msabc.job

You can check the status of the job with "qstat" command. Upon successful completion a few files will be generated in your working directory:

# ls
log.txt  msABC.e245823  msABC.o245823  seedms  msabc.job  test.out

Files msABC.e245823 and msABC.o245823 will contain standard error and standard output respectively. "seedms" is a seed file and test.out is a file with all the outputs as discussed above.

MSMS

MSMS is a tool to generate sequence samples under both neutral models and single locus selection models. MSMS permits the full range of demographic models provided by its relative MS (Hudson, 2002). In particular, it allows for multiple demes with arbitrary migration patterns, population growth and decay in each deme, and for population splits and mergers. Selection (including dominance) can depend on the deme and also change with time. The program is designed to be command line compatible to MS, however no prior knowledge of MS is assumed for this document.

Applications of MSMS include power studies, analytical comparisons, approximated Bayesian computation among many others. Because most applications require the generation of a large number of independent replicates, the code is designed to be efficient and fast. For the neutral case, it is comparable to MS and even faster for large recombination rates. For selection, the performance is only slightly slower, making this one of the fastest tools for simulation with selection. MSMS was developed in Java and can run on any hardware that supports Java 1.6.

MSMS version 1.3 has been installed at the CUNY HPC Center on BOB. It can be run in serially (1 core) or in multi-threaded parallel mode on a multi-core compute node (e.g. 8 on BOB and 4 on ANDY). MSMS is a command-line only program; there is no GUI, and you cannot use a mouse to set up simulations. The command line may look intimidating, but in reality it is quite easy to build up very complicated models if need be. The trick is to build the model up one step at a time. MSMS generates sample sequences outputs, and such does not generally require inputs from files. All of MSMS's command-line options are summarized here [53] and a more complete user manual can be found here [54].

Here is a PBS batch script that will will start a serial MSMS job on one processor (core) of a single BOB compute node:

#!/bin/bash
#PBS -q production
#PBS -N MSMS_serial
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin MSMS Serial Run ..."
echo ""
/share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1
echo ""
echo ">>>> End   MSMS Serial Run ..."

This PBS batch script can be dropped into a file (say 'msms_serial.job') and started with the command:

qsub msms_serial.job

This test case should take no more than a minute to run and will produce PBS output and error files beginning with the job name 'MSMS_serial'. The MSMS specific output will be written to the PBS output file. Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free'. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 1,920 MBs of memory. The second line instructs PBS to place this job wherever the least used resources are to be found (i.e. freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The MSMS command itself considers a single diploid population:

msms -N 10000 -ms 10 1000 -t 1

This command tells msms to use an effective population size of 10000 with the -N option. This option is unique to msms and is important even when not considering selection. Generally, its important to use a large number. While selection is not included in this parameter, it does not affect run times in any way. The -ms 10 1000 option is the same as the first two options to MS. The first is the number of samples; the second is the number of replicates. After this option, all the normal options of ms can be used and have the same meanings as per MS. The last option is -t 1 and specifies the theta parameter. We have assumed a diploid population, so theta is (4 * N * mutation rate). All parameters are scaled with N in some way.

To run the same simulation in 4-way thread parallel mode, a few minor changes to the serial script above are required:

3,5c3,5
< #PBS -N MSMS_serial
< #PBS -l select=1:ncpus=1:mem=1920mb
< #PBS -l place=free
---
> #PBS -N MSMS_threads
> #PBS -l select=1:ncpus=4:mem=7680mb
> #PBS -l place=pack
18c18
< /share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1
---
> /share/apps/msms/default/bin/msms -N 10000 -ms 10 1000 -t 1 -threads 4

The '-l select' line requests a PBS 'chunk' of 4 cores and 4 times as much memory. In the threaded job, we ask PBS to 'pack' the 4 cores on the same node. Thread-based parallel programs can make use only of processors (cores) on the same physical node. On BOB, there are 8 cores per compute node. NOTE: Before running in thread-parallel mode generally, please compare the total run time and results between a serial run and the identical thread-parallel run to make sure that the run time is less and the results are the same.

NAMD

NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. [55].

MPI parallel versions of NAMD without GPU support is installed on SALK, ANDY and PENZIAS. Apart from that the parallel, but GPU enabled version of the code is installed on PENZIAS as well. When non GPU version is used the low-latency, custom Gemini interconnect on SALK should provide superior scaling performance at higher core counts. In order to use the code please load modules first. The following line will load application environment for non GPU enabled NAMD on ANDY and PENZIAS.

module load namd

A batch submit script for NAMD that runs the CPU-only version on ANDY and PENZIAS using 'mpirun' on 16 processors, 4 to a compute node, follows. Please note that on PENZIAS the que is called production. Example below is for ANDY. For PENZIAS use production instead of production_qdr and comment line 19 (below ON ANDY line) and uncomment line 21 (below ON PENZIAS line).

#!/bin/bash
# for PENZIAS use production instead of production_qdr
#PBS -q production_qdr
#PBS -N NAMD_MPI
#PBS -l select=4:ncpus=4:mpiprocs=4
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the Non-Threaded MPI parallel executable to run
echo ">>>> Begin Non-Threaded NAMD MPI Parall Run ..."
# '''ON ANDY use the line:'''
mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/namd/default/Linux-x86_64-icc/namd2  ./hivrt.conf > hivrt_mpi.out
#  '''ON PENZIAS use the below line and comment above one:'''
# mpirun -np 16  -machinefile $PBS_NODEFILE namd2  ./hivrt.conf > hivrt_mpi.out

echo ">>>> End   Non-Threaded NAMD MPI Parall Run ...

In order to use GPU enabled NAMD versions users must use PENZIAS. First load the module file:

module load namd/2.9_cuda_5_5

The similar as above job, but one that uses 4 CPUs for the bonded interactions and an additional 4 GPUs for the non-bonded interactions, the following script could be used. Please note that this set up is valid only on PENZIAS since it is the only server which has GPU.

#!/bin/bash
#PBS -q production
#PBS -N NAMD_GPU
#PBS -l select=2:ncpus=2:ngpus=2:accel=kepler:mpiprocs=2
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the Non-Threaded MPI parallel executable to run
echo ">>>> Begin Non-Threaded NAMD MPI-GPU Parall Run ..."
mpirun -np 4  -machinefile $PBS_NODEFILE namd2  +idlepoll +devices 0,1,0,1   ./hivrt.conf > hivrt_gpu.out
echo ">>>> End    Non-Threaded NAMD MPI-GPU Parall Run ..."

Job submission on SALK via PBS is somewhat different. First there is no module file. The script below shows how to run 16 processor (core) job on SALK:

#!/bin/bash
#PBS -q production
#PBS -N NAMD_MPI
#PBS -l select=16:ncpus=1:mem=2048mb
#PBS -l place=free
#PBS -j oe
#PBS -o NAMD_MPI.out
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'aprun' and point to the Non-Threaded MPI parallel executable to run
echo ">>>> Begin NAMD Non-Threaded MPI Run ..."
aprun -n 16 -N 16 -cc cpu /share/apps/namd/default/CRAY-XT-g++/namd2  ./hivrt.conf > hivrt.out
echo ">>>> End   NAMD Non-Threaded MPI Run ..."

The most important difference to note is that on SALK is that the 'mpirun' command is replaced with the Cray's 'aprun' command. The 'aprun' command is used to start all jobs on SALK's compute nodes and mediates the interaction between the PBS script's resource requests and the ALPS resource manager on the Cray. SALK users should familiarize themselves with 'aprun' and its options by reading 'man aprun' on SALK. Users cannot request more resources on their 'aprun' command-lines than are defined by the PBS script's resource request lines. There is useful discussion elsewhere on the Wiki about the interaction between PBS and ALPS as mediated by the 'aprun' command and the error message generated when there is a mismatch.

For any of these jobs to run, all the required auxiliary files must be present in the directory from which the job is run.

Network Simulator-2 (NS2)

NS2 is a discrete event simulator targeted at networking research. NS2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. Version 2.31 and 2.33 are installed on BOB at the CUNY HPC Center. For more detailed information look here here.

Running NS2 is a four step process.

Prepare a Tcl script for NS2 like the example (ex.tcl) shown below. This example has 2 nodes with 1 link and uses UDP agent with the CBR traffic generator.

set ns [new Simulator]
set tr [open trace.out w]
$ns trace-all $tr

proc finish {} {
        global ns tr
        $ns flush-trace
        close $tr
        exit 0
}

set n0 [$ns node]
set n1 [$ns node]

$ns duplex-link $n0 $n1 1Mb 10ms DropTail

set udp0 [new Agent/UDP]
$ns attach-agent $n0 $udp0
set cbr0 [new Application/Traffic/CBR]
$cbr0 set packetSize_ 500
$cbr0 set interval_ 0.005
$cbr0 attach-agent $udp0
set null0 [new Agent/Null]
$ns attach-agent $n1 $null0
$ns connect $udp0 $null0  

$ns at 0.5 "$cbr0 start"
$ns at 4.5 "$cbr0 stop"
$ns at 5.0 "finish"

$ns run

Create a PBS batch submit script like the one shown here:

#!/bin/bash
#PBS -q production
#PBS -N NS2-job
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# You must explictly change to your working
# directory in PBS

cd $HOME/my_NS2_wrk

/share/apps/ns2/ns-allinone-2.31/ns-2.31/ns ./ex.tcl

Submit the job with:

qsub submit

Graph the result. At the HPC Center, 'nam' files can be produced, but cannot be run because they require a graphical environment for execution. Trace Graph is a free network trace file analyzer developed for NS2 trace processing. Trace Graph can support any trace format if converted to its own or NS2 trace format. Supported NS2 trace file formats include:

wired,satellite,wireless,new trace,wired-wireless. 

For more information on Trace Graph look here [56].

Users graphing results in Trace Graph must use Linux with X window system and perform the following steps:

1. SSH to BOB with X11 forwarding.

ssh -X  your.name@bob.csi.cuny.edu

2. Start tracegraph by typing the command "trgraph".

trgraph

NWChem

NWChem is an ab initio computational chemistry software package which also includes molecular dynamics (MM, MD) and coupled, quantum mechanical and molecular dynamics functionality (QM-MD). It was designed to run on high-performance parallel supercomputers as well as conventional workstation clusters. It aims to be scalable both in its ability to treat large problems efficiently, and in its usage of available parallel computing resources, both processors and memory, whether local or distributed. NWChem has been developed by the Molecular Sciences Software group of the Theory, Modeling & Simulation program of the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL). Most of the implementation has been funded by the EMSL Construction Project. The CUNY HPC Center is currently running NWChem 6.3 revision 2 built using its InfiniBand communications interface. It is strongly recommend that users utilize NWChem on PENZIAS. Indeed the software is also installed on ANDY, but it will be phased out in the near future. The ability to run and performance of a run depends very much on a proper settings of start up directives in NWChem input files. For details on the content and structure of each section of the NWChem input deck users should consult the NWChem Users Manual at http://www.emsl.pnl.gov/capabilities/computing/nwchem/docs/usermanual.pdf.

One particular directive - memory - allows the user to specify the amount of memory PER PROCESSOR CORE that NWChem can use for the job in. If this directive is omitted in input file, the NWChem will use the default setting which currently is only 400MB. In NWChem there are three distinct regions of memory: stack, heap, and global. On PENZIAS (and on all distributed memory systems) all 3 types of memory compete for the same pool i.e. the memory has total size of stack+heap+global. The default partition is 25% heap, 25% stack, and 50% global, thus 4096mb will be partitioned as 1024 MB for stack, 1024 MB for heap and 2048 MB for global. In the following example first 2 lines are equivalent and they will allocate total available per core memory on PENZIAS node and will use default partitioning. The third line does the same but it will change the partition by allocating 75% of the total memory as a global one.

memory 3686 mb
memory total 3686 mb
memory total 3686 global 2764 mb

NWChem recognizes the following memory units:

real 
double
integer
byte
kb (kilobyte)
mb (megabyte)
mw (megawords - 8 bytes)

A sample NWChem input file which does SCF calculation on water with 6-31g* basis set is shown here:

echo
start water2
title "an example simple water calculation"

# The memory options are system specific see below 

memory total   .... mb global .... mb

geometry units au
 O 0       0              0
 H 0       1.430    -1.107
 H 0     -1.430    -1.107
end

basis
  O library 6-31g*
  H library 6-31g*
end

task scf gradient

If the run has to be done on parallel and on PENZIAS, the standard (and maximum) per-core quantity of memory available on a node along with a portion of it used by NWChem Global Arrays computing model should not exceed 3686mb per core. Thus the memory line for parallel runs on PENZIAS should look like:

memory total 3686 mb global 2764 mb

on ANDY it would be 'memory total 2880 mb global 2160 mb'. Single core runs on PENZIAS require different settings.

On servers with slow IB interconnect, NWChem can cause a problems with ARMCI communications conduit. In simple language that means on ANDY is recommended to use 1 core per chunk. The interconnect at PENZIAS can support full utilization of the nodes i.e. the maximum number of cores per chunk is 8.

A PBS batch submit script to run above example on 16 processors (cores) on PENZIAS. Note that on PENZIAS the main que is called production.

#!/bin/csh

#PBS -q production
#PBS -N testwater_wikig
#PBS -l select=16:ncpus=1:mem=3686mb
#PBS -l place=free
#PBS -V

echo "This job's process 0 host is: " `hostname`; echo ""

# Must explicitly change to your working directory under PBS

cd $PBS_O_WORKDIR

# Set up NWCHEM environment, permanent, and scratch directory

setenv NWCHEM_ROOT /share/apps/nwchem/default/
setenv NWCHEM_BASIS_LIBRARY ${NWCHEM_ROOT}/data/libraries/

# This line set up the permanent directory for the run. 
setenv PERMANENT_DIR $PBS_O_WORKDIR

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

# ATTENTION: REPLACE the test.nw with the name of your INPUT DATA fail. Do not edit RUN_FILE part.
# DO not change the syntax.

set INPUT_FILE="test.nw"
set RUN_FILE="runner.nw"

# UNCOMMENT below line when run on ANDY
# setenv SCRATCH_DIR /home/nwchem/nw6.3_scr/${MY_SCRDIR}_$$

# COMMENT below line when run on ANDY
setenv SCRATCH_DIR /state/partition1/nw6.3_scr/${MY_SCRDIR}_$$

mkdir -p $SCRATCH_DIR

# This line names and inserts the  scratch directive into input file. DO not omit this line.

`sed "1i\scratch_dir $SCRATCH_DIR" $INPUT_FILE > $RUN_FILE`

echo "The scratch directory for this run is: $SCRATCH_DIR"; echo ""

# Start NWCHEM job. Replace test.out with you own output file. DO NOT replace RUN_FILE. 

mpirun -np 16 -machinefile $PBS_NODEFILE ${NWCHEM_ROOT}/bin/nwchem ./$RUN_FILE > test.out

# Clean up scratch files by default

/bin/rm -r $SCRATCH_DIR
/bin/rm -r $RUN_FILE

echo 'Job is done'   
=== 

Please consult the sections on the PBS Pro Batch scheduling system for information on how to modify this sample deck for different processor counts and about the meaning of each of the PBS script lines.

On PENZIAS in order to utilize all available memory and depend on concrete studied system sometimes is better to use the below pbs construction. Please note that to run successfully the memory placement must match the job requirements. Please read carefully the page 2-10 on NWChem user manual.

#PBS -l select=8:ncpus=2:mem=3686mb

In order to run on 128 cores on PENZIAS the line should look like:

#PBS -l select=16:ncpus=8:mem=3686mb

Please do not forget to adjust memory requirements both in PBS script and in NWChem input files according to your particular molecular system requirements. Remember that performance depends on how memory is allocated as well so try few allocation schemes before pick the best one for your job. For instance for some small molecular systems sometimes it is possible to gain performance by reducing memory and keep job in a single node.

The optimal number of cores also depend on studied molecular system. There is no such thing like "one fits all". Users can find the optimal number of processor cores by starting with some small number i.e. 4 or 8 and double the number of cores for each consecutive run. Repeat the process till no significant improvement in performance is recorded in 2 follow up runs. Further increase of cpu cores should be avoided.

Finally to get their NWChem jobs to run each user will need to copy or create a symbolic link to a ".nwchemrc" file in their $HOME directory to the site specific "default.nwchemrc" file located in:

/share/apps/nwchem/default/data/

Symbolic link can be created with a command:

ln -s /share/apps/nwchem/default/data/default.nwchemrc $HOME/.nwchemrc

Users may also check the Q/A section of this document for common mistakes and their solution.

Octopus

Octopus is a pseudopotential real-space package aimed at the simulation of the electron-ion dynamics of one-, two-, and three-dimensional finite systems subject to time-dependent electromagnetic fields. The program is based on time-dependent density-functional theory (TDDFT) in the Kohn-Sham scheme. All quantities are expanded in a regular mesh in real space, and the simulations are performed in real time. The program has been successfully used to calculate linear and non-linear absorption spectra, harmonic spectra, laser induced fragmentation, etc. of a variety of systems. Complete information about the octopus package can be found at its homepage, http://www.tddft.org/programs/octopus. The on-line user manual is available at http://www.tddft.org/programs/octopus/wiki/index.php/Manual.

The MPI parallel version Octopus 4.1.1 has been installed on PENZIAS and ANDY (the older 4.0.0 release is also installed on ANDY) with all its associated libraries (metis, netcdf, sparsekit, etsfio, etc.). It was built with an Intel compiled version of the OpenMPI 1.6.4 and has passed all its internal test cases.

A sample Octopus input file (required to have the name 'inp') is provided here:

# Sample data file:
#
# This is a simple data file. It will complete a gas phase ground-state
# calculation for a neon atom. Please consult the Octopus manual for a
# brief explanation of each section and the variables.
#
FromScratch = yes

CalculationMode = gs

ParallelizationStrategy = par_domains

Dimensions = 1
Spacing = 0.2
Radius = 50.0
ExtraStates = 1

TheoryLevel = independent_particles

%Species
  "Neon1D" | 1 | spec_user_defined | 10 | "-10/sqrt(0.25 + x^2)"
%

%Coordinates
  "Neon1D" | 0
%

ConvRelDens = 1e-7

Octopus offers its users two distinct and combinable strategies to parallelize its runs. The first and default is to parallelize by domain decomposition of the mesh (METIS is used). In the input deck above, this method is chosen explicitly (ParallelizationStrategy = par_domains). The second is to compute the entire doman on each processor, but to do so for some number of distinct temporal states (ParallelizationStrategy = par_states). Users wishing to control the details of Octopus when run in parallel are advised to consult the advanced options section of the manual at http://www.tddft.org/programs/octopus/wiki/index.php/Manual:Advanced_ways_of_running_Octopus.

A sample PBS Pro batch job submission script that will run on PENZIAS for the above input file is show here:

#!/bin/csh
#PBS -q production
#PBS -N neon_gstate
# The next statements select 8 chunks of 1 core and
# 3840mb of memory each (the pro-rated limit per
# core on PENZIAS), and allow PBS to freely place
# those resource chunks on the least loaded nodes.
#PBS -l select=8:ncpus=1:mem=3840mb
#PBS -l place=free
#PBS -V

# Check to see if the Octopus module is loaded.
(which octopus_mpi > /dev/null) >& /dev/null
if ($status) then
echo ""
echo "Please run: 'module load octopus'"
echo "before submitting this script. Exiting ... "
echo ""
exit
else
echo ""
endif

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname
echo ""

# Must explicitly change to your working directory under PBS
cd $PBS_O_WORKDIR

# Set up OCTOPUS environment, working, and temporary directory

setenv OCTOPUS_ROOT /share/apps/octopus/default

setenv OCT_WorkDir \'$PBS_O_WORKDIR\'

setenv MY_SCRDIR `whoami;date '+%m.%d.%y_%H:%M:%S'`
setenv MY_SCRDIR `echo $MY_SCRDIR | sed -e 's; ;_;'`

setenv SCRATCH_DIR  /state/partition1/oct4.1_scr/${MY_SCRDIR}_$$
mkdir -p $SCRATCH_DIR
setenv OCT_TmpDir \'/state/partition1/oct4.1_scr/${MY_SCRDIR}_$$\'

echo "The scratch directory for this run is: $OCT_TmpDir"

# Start OCTOPUS job

echo ""
echo ">>>> Begin OCTOPUS MPI Parallel Run ..."
mpirun -np 8 -machinefile $PBS_NODEFILE octopus_mpi > neon_gstate.out
echo ">>>> End   OCTOPUS MPI Parallel Run ..."
echo ""

# Clean up scratch files by default

/bin/rm -r $SCRATCH_DIR

echo 'Your Octopus job is done!'


This script requests 8 resource 'chunks' each with 1 processor. The memory selected on the '-l select' line is sized to PENZIAS's pro-rated maximum memory per core. Please consult the sections on the PBS Pro Batch scheduling system below for information on how to modify this sample deck for different processor counts. The rest of the script describes its action in comments. Before this script will run the user must load the Octopus module with:

module load Octopus

which by default default loads Octopus version 4.1.1. This script would need to be modified as follows to run on ANDY:

< # 3840mb of memory each (the pro-rated limit per
---
> # 2880mb of memory each (the pro-rated limit per
8c8
< #PBS -l select=8:ncpus=1:mem=3840mb
---
> #PBS -l select=8:ncpus=1:mem=2880mb
41c41
< setenv SCRATCH_DIR  /state/partition1/oct4.1_scr/${MY_SCRDIR}_$$
---
> setenv SCRATCH_DIR  /home/octopus/oct4.1_scr/${MY_SCRDIR}_$$
43c43
< setenv OCT_TmpDir \'/state/partition1/oct4.1_scr/${MY_SCRDIR}_$$\'
---
> setenv OCT_TmpDir \'/home/octopus/oct4.1_scr/${MY_SCRDIR}_$$\'

Users should become aware of the scaling properties of their work by taking note of the run times at various processor counts. When doubling processor count improves SCF cycle time only by a modest percentage then further increases in processor counts should be avoided. The ANDY system has two distinct interconnects. One is a DDR InfiniBand network that delivers 20 Gbits per second of performance and the other is a QDR InfiniBand network that delviers 40 Gbits per second. Either will serve Octopus users well, but the QDR network should provide somewhat better scaling. PENZIAS has a still faster FDR InfiniBand network and should provide the best scaling. The HPC is interested in the scaling you observe on its systems and reports are welcome.

In the example above, the 'production' queue has been requested which works on both ANDY (DDR InfiniBand) and PENZIAS (FDR InfiniBand), but by adding a terminating '_qdr' one can select the QDR interconnect on ANDY. Selecting the right queue based on system activity will ensure that your job starts as soon as possible.

OpenMM

OpenMM is both a library and a stand-alone application which provides tools for modern molecular modeling simulation. As a library it can be hooked into any code, allowing that code to do molecular modeling with minimal extra coding. Moreover, OpenMM has a strong emphasis on hardware acceleration via GPU, thus providing not just a consistent API, but much greater performance than what one could get from just about any other code available. OpenMM was developed as a part of Physics-Based Simulation project with project leader prof. Pande.

OpenSees

OpenSees, the Open System for Earthquake Engineering Simulation, is an object-oriented, open source software framework. It allows users to create both serial and parallel finite element computer applications for simulating the response of structural and geotechnical systems subjected to earthquakes and other hazards. OpenSees is primarily written in C++ and uses several Fortran and C numerical libraries for linear equation solving, and material and element routines. The software is installed on PENZIAS.

ParGAP

ParGAP is build on top of GAP system. The later is a system for computational discrete algebra, with particular emphasis on Computational Group Theory. GAP provides a programming language, a library of thousands of functions implementing algebraic algorithms written in the GAP language as well as large data libraries of algebraic objects. The ParGAP (Parallel GAP) package itself provides a way of writing parallel programs using the GAP language. Former names of the package were ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface, a well-known standard for parallelism. ParGAP is based on the MPI standard, and this distribution includes a subset implementation of MPI, to provide a portable layer with a high level interface to BSD sockets. Since knowledge of MPI is not required for use of this software, we now refer to the package as simply ParGAP. For more information visit the author's ParGAP home page at: http://www.ccs.neu.edu/home/gene/pargap.html. The package is installed on penzias cluster. Users must load module pargap.


module load pargap

The example below named parlist.g is a parallelization of a list.


#WARNING:  Read this with Read(), _not_ ParRead()

#Environment: None
#TaskInput:   elt, where elt is an element of argument, list
#TaskOutput:  fnc(elt), where fnc is argument
#Task:        Compute fnc(elt) from elt [ Hence, DoTask = fnc ]
#UpdateEnvironment:  None

ParInstallTOPCGlobalFunction( "MyParList",
function( list, fnc )
  local result, iter;
  result := [];
  iter := Iterator(list);
  MasterSlave( function() if IsDoneIterator(iter) then return NOTASK;
                          else return NextIterator(iter); fi; end,
               fnc,
               function(input,output) result[input] := output;
                                      return NO_ACTION; end,
               Error
             );
  return result;
end );

ParInstallTOPCGlobalFunction( "MyParListWithAglom",
function( list, fnc, aglomCount )
  local result, iter;
  result := [];
  iter := Iterator(list);
  MasterSlave( function() if IsDoneIterator(iter) then return NOTASK;
                          else return NextIterator(iter); fi; end,
               fnc,
               function(input,output)
                 local i;
                 for i in [1..Length(input)] do
                   result[input[i]] := output[i];
                 od;
                 return NO_ACTION;
               end,
               Error,  # Never called, can specify anything
               aglomCount
             );
  return result;
end );

The pbs script to run this program on penzias is as follows:

#!/bin/bash
#PBS -N par_list
#PBS -q production
#PBS -l select=1:ncpus=4
#PBS -l place=free
#PBS -V

cd $PBS_O_WORKDIR

mpirun -np 4 -machinefile  $PBS_NODEFILE  pargap ./parlist.g >> parlist_out

 

This will create a master and 3 slave processes . Further information and detailed manual how to program with parallel GAP can be found in http://www.gap-system.org/Manuals/pkg/pargap/doc/manual.pdf

POPABC

PopABC is a computer package to estimate historical demographic parameters of closely related species/populations (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration model. The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC, Beaumont et al, 2002). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios. The program can be used either for research or for education and teaching purposes. Further details and a manual can be found at the POPABC website here [57]

POPABC version 13.07.09 is currently install on ANDY at the CUNY HPC Center. POPABC is a collection of serial programs. Here, we show how to run the 'toydata' example input case provided with the downloaded code. The example input files may be copied to the user's working directory for submission with:

cp /share/apps/bayescan/default/examples/toy*  .

To include all required environmental variables and the path to the POPABC executables run the module load command (the modules utility is discussed in detail above):

module load popabc

Here is PBS batch script that runs this example input case:

#!/bin/bash
#PBS -q production
#PBS -N POPABC_test
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executables to run
echo ">>>> Begin POPABC Serial Run ..."
echo ">>>> Running summdata ... "
summdata.exe toytable.len toysim.sst toytarget > POPABC_SUMM.out 2>&1
echo "Done ..."
echo ">>>> Running simulate ... "
simulate.exe toyprior.prs toytarget.ssz toysim.sst toydata 0 0 0 > POPABC_SIM.out 2>&1
echo "Done ..."
echo ">>>> Running rejection ... "
rejection.exe toydata.dat toytarget.trg toyresults 16 18 0.01 > POPABC_REJECT.out 2>&1
echo "Done ..."
echo ">>>> End   POPABC Serial Run ..."

This PBS batch script can be dropped into a file (say popabc_serial.job) on ANDY and run with the following command:

qsub popabc_serial.job

It should take less than minutes to run and will produce PBS output and error files beginning with the job name 'POPABC_test' along with a number of POPABC-specific files. The primary POPABC application results will be written into the user-specified file at the end of the POPABC command line after the greater-than sign. Here, three executables are run and written named 'POPABC_*.out'. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l place=fee' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processors (cores) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this chunk on any compute node with the required resources available.

POPABC is described in detail in the manual [58].

PHOENICS

PHOENICS is an integrated Computational Fluid Dynamics (CFD) package for the preparation, simulation, and visualization of processes involving fluid flow, heat or mass transfer, chemical reaction, and/or combustion in engineering equipment, building design, and the environment. More detail is available at the CHAM website, here http://www.cham.co.uk.

Although we expect most users to pre- and post-process their jobs on office-local clients, the CUNY HPC Center has installed the Unix version of the entire PHOENICS package on ANDY. PHOENICS is installed in /share/apps/phoenics/default where all the standard PHOENICS directories are located (d_allpro, d_earth, d_enviro, d_photo, d_priv1, d_satell, etc.). Of particular interest on ANDY is the MPI parallel version of the 'earth' executable 'parexe' which makes full use of the parallel processing power of the ANDY cluster for larger individual jobs. While the parallel scaling properties of PHOENICS jobs will vary depending on the job size, processor type, and the cluster interconnect, larger work loads will generally scale and run efficiently on from 8 to 32 processors, while smaller problems will scale efficiently only up to about 4 processors. More detail on parallel PHOENICS is available at http://www.cham.co.uk/products/parallel.php. Aside from the tightly coupled MPI parallelism of 'parexe', users can run multiple instances of the non-parallel modules on ANDY (including the serial 'earexe' module) when a parametric approach can be used to solve their problems.

As suggested, the entire PHOENICS 2011 package is installed on ANDY and users can run the X11 version of the PHOENICS Commander display tool from ANDY's head node if they have connected using 'ssh -X andy.csi.cuny.edu' where the '-X' option ensures that X11 images are passed back to the original client. Doing this from outside the College of Staten Island campus where the CUNY HPC Center is located may produce poor results because the X11 traffic will have to be forwarded through the HPC Center gateway system. CUNY has also licensed a number of seats for office-local desktop installations of PHOENICS (for either Windows or Linux) so that this should not be necessary. Job preparation and post-processing work is generally most efficiently accomplished on the local desktop using the Windows version of PHOENICS VR, which can be run directly or from PHOENICS Commander.

A rough general outline of the PHOENICS work cycle is:

1.  The user runs VR Editor (preprocessor) on their workstation (or on ANDY) and
    perhaps selects a library case (e.g. 274) making changes to this case to match
    his/her specific requirements.
 
2.  The user leaves the VR editor where input files 'q1' and 'eardat' are created.  
    If the user is preprocessing from their desktop, these files would then be 
    transferred to ANDY using the 'scp' command or via the 'PuTTy' utility for 
    Windows.
 
3.  The user runs the solver on ANDY (typically the parallel version, 'parexe') from
    their working directory using the PBS batch submit script presented below.  This
    script reads the files 'q1' and 'eardat' (and potentially some other input files)
    and writes the key output files 'phi' and 'result'. 
 
4.  The user copies these output files back to their desktop (or not) and runs VR
    Viewer (postprocessor) which reads the graphics output file 'phi', or the user
    views tabular results manually in the 'result' file.

POLIS, available in Linux and Windows, has further useful information on running PHOENICS including tutorials, viewing documentation, and on all PHOENICS commands and topics here [59]. Graphical monitoring should be deactivated during parallel runs in ANDY's batch queue. To do this users should place two leading spaces in front of the command TSTSWP in the 'q1' file. The TSTSWP command is present in most library cases, including case 274 which is a useful test case. Graphical monitoring can be left turned on when running sequential 'earexe' on the desktop. This gives useful realtime information on sweeps, values, and the convergence progress.

Details on the use of the display and non-parallel PHOENICS tools can be found at the CHAM website and in the CHAM Encyclopaedia here [60].

The process of setting up a PHOENICS working directory and running the parallel version of 'earth' (parexe) on ANDY is described below. As a first step, users would typically create a directory called 'phoenics' in their $HOME directory as follows:

cd; mkdir phoenics

Next, the default PHOENICS installation root directory (version 2011 is the current default) named above should be symbolically linked to the 'lp36' subdirectory:

cd phoenics
ln -s /share/apps/phoenics/default ./lp36

The user must then generate the required input files for the 'earth' module which, as mentioned above in the PHOENICS work cycle section, are the 'q1' and 'eardat' files created by the VR Editor. These can be generated on ANDY, but it is generally easier to do this from the user's desktop installation of PHOENICS.

Because the current default version of PHOENICS, version 2011, was built with an earlier, older and now no longer default version of MPI, users must use the modules command to unload the current defaults and load the previous set before submitting the PHOENICS PBS script below. This is a fairly simple procedure:

$module list
Currently Loaded Modulefiles:
  1) pbs/11.3.0.121723     2) cuda/5.0              3) intel/13.0.1.117      4) openmpi/1.6.3_intel
$
$module unload intel/13.0.1.117
$module unload openmpi/1.6.3_intel
$
$module load intel/12.1.3.293
$
$module load openmpi/1.5.5_intel
Note: Intel compilers will be set to version 12.1.3.293
$
$module list
Currently Loaded Modulefiles:
  1) pbs/11.3.0.121723     2) cuda/5.0              3) intel/12.1.3.293      4) openmpi/1.5.5_intel

Once the input files are created and placed (transferred to) into the working directory and the older modules have been loaded on ANDY, the following PBS Pro batch script can be used run the job on ANDY. The progress of the job can be tracked with the PBS 'qstat' command.

#!/bin/bash
#PBS -q production_qdr
#PBS -N phx_test
#PBS -l select=8:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# Take a look at the set of compute nodes that PBS gave you
echo $PBS_NODEFILE
cat  $PBS_NODEFILE

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the parallel executable to run
echo ">>>> Begin PHOENICS MPI Parallel Run ..."
echo ""
echo "mpirun -np 8 -machinefile $PBS_NODEFILE ./lp36/d_earth/parexe"
mpirun -np 8 -machinefile $PBS_NODEFILE ./lp36/d_earth/parexe
echo ""
echo ">>>> End   PHOENICS MPI Parallel Run ..."

The job can be submitted with:

qsub 8Proc.job

Constructing a PBS batch script is described in detail elsewhere in this Wiki document, but in short this script requests the QDR Infiniband production queue ('production_qdr') which runs the job on the side of ANDY with the fastest interconnect. It asks for 8 processors (cores) each with 2880 Mbytes of memory and allows PBS to select those processors based on least loaded criteria. Because this is just an 8 processor job, it could be packed onto a single physical node on ANDY for better scaling using '-l place=pack' but this would delay its start by PBS as PBS would have to locate a completely free node.

During the run, 'parexe' creates (N-1) directories (named Proc00#) where N is the number of processors requested (note: if the Proc00# directories do not exist already they will be created, but there will be an error message in the PBS error log, which can be ignored). The output from process zero is written into the working directory from which the script was submitted. The output from each of the other MPI processes is written into its associated 'Proc00#' directory. Upon successful completion, the 'result' file should show that the requested number of iterations (sweeps) was completed and print the starting and ending wall-clock time. At this point, the results (the 'phi' and 'results' files) from the PBS parallel job can be copied back to the users desk top for post processing.

NOTE: A bug is present in the non-graphical, batch version of PHOENICS that is used on the CUNY HPC Clusters. This problem does not occur in Windows runs. To avoid the problem a go-around modification to the 'q1' input file is required. The problem occurs only in jobs that require SWEEP counts that are greater than 10,000 (i.e. SWEEP=20000). Users requesting larger SWEEP counts must include the following in their 'q1' input files to avoid having their jobs terminated at 10,000 SWEEPS.

USTEER=F

This addition forces a bypass of the graphical IO monitoring capability in PHOENICS and prevents that section of code from capping the SWEEP count at 10,000 SWEEPs.

Finally, PHOENICS has been licensed broadly by the CUNY HPC Center, and it can provide activation keys for any desktop copies whose annual activation keys expire.

PHRAP-PHRED

PHRAP and PHRED are part of the DNA sequence analysis tool set that also includes the programs CROSSMATCH and SWAT. These tools are describe in detail here [61], but a brief description of both, extracted from their manuals, follows. PHRED and PHRAP (along with CONSED) can be used for both small sequence assemblies and larger shotgun analyses. This makes the tools a perhaps under-utilized set for smaller non-genomic groups. Some variables may need to be adjusted, particularly in CONSED, but researchers that have multiple sequences from a small locus can use the suite, starting from their chromatogram files.

PHRAP is a program for shotgun sequence assembly, but it can also be used for small sequence assemblies. Its key features include its use of data quality information, both direct (from phred trace analysis) and indirect (from pairwise read comparisons), to delineate the likely accurate base calls in each read. This helps discriminate repeats. It permits the use of the full reads in assembly, and allows a highly accurate consensus sequence to be generated. A probability of error is computed for each consensus sequence position, which can be used to focus human editing on particular regions. This helps to automate decision-making about where additional data are needed provides users of the final sequence with information about local variations in quality. The PHRAP documentation is available here [62]

PHRED reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files. Phred can read trace data from chromatogram files in the SCF, ABI, and ESD formats. It automatically determines the file format, and whether the chromatogram file was compressed using gzip, bzip2, or UNIX compress. After calling bases, phred writes the sequences to files in either FASTA format, the format suitable for XBAP, PHD format, or the SCF format. Quality values for the bases are written to FASTA format files or PHD files, which can be used by the phrap sequence assembly program in order to increase the accuracy of the assembled sequence. The PHRED documentation is available here [63]

All the tools referenced above are installed at the CUNY HPC Center on both KARLE and ANDY. They may be run directly on KARLE, in either command-line interactive mode, in the background (Unix batch), or within the CONSED GUI framework using the 'phredPhrap' scripting tool. The run times are generally short. On ANDY, they should be run from the within the CUNY HPC Center PBS batch processing frame work if the jobs will take more than a minute or two of wall-clock time. On both KARLE and ANDY, PHRED version 71220 is the default. Similarly, version 1.090518 of PHRAP is the default.

Below is a sample PBS batch script for ANDY that reproduces each step the CONSED 'phredPhrap' script completes when it is run on KARLE. This script is meant to give you an idea of how any of these tools can be run in batch mode on ANDY. Not all these steps are always required. PBS scripts-jobs that run only one or two of the tools present in this example can also be constructed. Details on the command-line options for each tools can be found the the manuals pointed to above.

Prior to running this example, a directory with example starting input data and the environment for each tool must be set up. One can obtain the standard test case from the PHRED installation tree on ANDY as follows:

$mkdir mytest
$
$cd mytest
$
$tar -xvf /share/apps/phred/default/data/STD.tar
$

This will created a collection of directories, some with input files, that will be referenced by the PBS batch script. These directories are list here:

[richard.walsh@andy standard]$ls -l 
total 28
drwx------ 2 richard.walsh hpcadmin 4096 2012-12-28 17:37 chromat_dir
drwx------ 2 richard.walsh hpcadmin 4096 2012-12-28 17:37 chromats_to_add
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-02 12:56 edit_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 phdball_dir
drwx------ 3 richard.walsh hpcadmin 4096 2013-02-27 12:38 phd_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 sff_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 solexa_dir
[richard.walsh@andy standard]$

Next, the environment for each of the required tools must be loaded using the modules command.

$
$module load phred
$module load phrap
$module load consed
$

Although CONSED is not used directly in this PBS script, files in its installation tree are referenced and its module must therefore be loaded. With the above steps completed, the following PBS batch script can be run on ANDY:

#!/bin/bash
#PBS -q production
#PBS -N PHRED_PHRAP.job
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Echoing the location of the phred_phrap parameter file
echo ""
echo "Using parameter file: $PHRED_PARAMETER_FILE"
echo ""

# Define the location of the consed screen files for cross_match
export SCREEN_PATH=${CONSED_HOME}/lib/screenLibs

# Just point to the serial executable to run
echo ">>>> Begin PHRED-PHRAP Batch Serial Run ..."
echo ""
echo ">>>> Running phred ... "
phred -id chromat_dir -pd phd_dir > phred.out 2>&1
echo "Done ..."
echo ">>>> Running phd2fasta ... "
phd2fasta -id phd_dir -os seqs_fasta -oq seqs_fasta.screen.qual > phd2fasta.out 2>&1
echo "Done ..."
echo ">>>> Running cross_match ... "
cross_match seqs_fasta ${SCREEN_PATH}/vector.seq -minmatch 12 -minscore 20 -screen > cross_match.out 2>&1
echo "Done ..."
echo ">>>> Running phrap ... "
phrap seqs_fasta.screen -new_ace > phrap.out 2>&1
echo "Done ..."
echo ""
echo ">>>> End   PHRED-PHRAP Batch Serial Run ..."

This script should be copied into a file in the same directory that you 'untar-ed' the files in above (here the name is 'mytest'). This would be typically be done in a editor like 'vi' or 'emacs'. Assuming that the name given to this PBS script file is 'phred_phrap.job', the PBS job can be submitted with the following command:

qsub phred_phrap.job

This script walks the original sequence found in the 'chromat_dir' through all of the steps that the 'predPhrap' script would complete interactively on KARLE. Notice that four distinct programs are run, each with their own set of options. They produce all the required 'seqs_fasta' files required for viewing in CONSED. Users may wish to run only one of the tools in which case only one execution line for perhaps 'phred' or 'phd2fasta' would be required in the script.

It should take less than minutes to run and will produce PBS output and error files beginning with the job name 'PHRED_PHRAP' along with a number of tool-specific files output files. The primary application results will be written into the user-specified file at the end of each command line after the greater-than sign. Here, four executables are run and write named 'XXX.out' output files. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script options are covered above in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l place=fee' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processors (cores) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this chunk on any compute node with the required resources available. All the jobs run with this script are assume by PBS to be serial jobs.

Python

Python is a programming language that lets you work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs. [64]

There are two supported versions installed on Andy system:

  • Python 3.1.3 located under /share/apps/python/3.1.3/bin
  • Python 2.7.3 located under /share/apps/epd/7.3-2/bin


In order to make Python binaries available user needs to load corresponding module:

module load python/3.1.3

(for version 3.1.3) or

module load python/2.7.3

(for version 2.7.3). After the module is loaded Python interpreter can be simply called by "python" command.

Version 2.7.3 is installed as Enthought Python Distribution [65]. EPD contains over 100 libraries such as SciPy, NumPy, matplotlib, IPython, PyTables, PIL, wxPython, etc...

R

General Notes

R is a free software environment for statistical computing and graphics. R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis. R is available on the following HPCC's servers: Bob, Andy, Karle and Penzias. Karle is the only machine where R can be used without submitting jobs to PBS manager. On all other systems users must submit their R jobs via PBS batch scheduler.

In order to use R (ver 3.0.2) on Andy and Penzias please load the following modules:

   module load openmpi
   module load mkl
   module load r/3.0.2d

Complete R documentation may be found at http://www.r-project.org/

Running R on Karle

The following is the "Hello World" program written in R:

# Hello World example
a <- c("Hello, world!")
print(a)

To run R job on Karle server, save your R script into the file (for example "hello.R") and use the following command to launch it:

/share/apps/r/default/bin/R --vanilla --slave < helloworld.R
R GUI

GUI for R is installed on Karle. To use it login to Karle with "ssh -X" and type in

jgr

Running R on cluster machines

In order to run R job on any of HPCC's cluster machines (Bob or Andy) users should use PBS manager. Submitting serial R job to the PBS queue is exactly the same as submitting any other serial job.

Consider the above example. To run this simple "hello-world" R job users need PBS script:

#!/bin/bash
#PBS -q production
#PBS -N R_job
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

echo "Starting R job ..."

cd $PBS_O_WORKDIR

/share/apps/r/default/bin/R --vanilla --slave < helloworld.R

echo "R job is finished."

R jobs may be run in parallel (i.e. with the help of "multicore" package). To run SMP-parallel job PBS script should be modified as explained here :

#!/bin/bash
#PBS -q production
#PBS -N R_job
#PBS -l select=1:ncpus=8
#PBS -l place=pack
#PBS -V

echo "Starting SMP-parallel R job ..."

cd $PBS_O_WORKDIR

/share/apps/r/default/bin/R --vanilla --slave < myparalleljob.R

echo "R job is finished."

R packages

In order to install R package start R and run the following command:

install.packages("package.name")

and pick a mirror from the list. After package is installed use it starting with

library(package.name)

Please note, that the following packages are already available on "karle":

locfit
VGAM
network
sna
RGraphics
rgl
DMwR
RMySQL
randomForest
xts
tseries
jgr

RAXML

Randomized Axelerated Maximum Likelihood (RAxML) is a program for sequential and parallel maximum likelihood based inference of large phylogenetic trees. It is a descendent of fastDNAml which in turn was derived from Joe Felsentein’s DNAml which is part of the PHYLIP package. RAxML 7.4.5 is the latest version and is installed at the CUNY HPC Center on ANDY. Version 7.2.8 is also available. RAxML is available in both serial and MPI parallel versions. The MPI-parallel version should be run on four or more cores. Examples of running both parallel and serial jobs are presented below. More information can be found here [66]

To run RAxML first a PHYLIP file of aligned DNA or amino-acid sequences similar to the one shown here must be created. This file, 'alg.phy', is in interleaved format:

5 60
Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAG
Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGG
Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGG
Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGG
Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGG

GAAATGGTCAATATTACAAGGT
GAAATGGTCAACATTAAAAGAT
GAAATCGTCAATATTAAAAGGT
GAAATGGTCAATCTTAAAAGGT
GAAATGGTCAATATTAAAAGGT

For more detail about PHYLIP formatted files, please check look at the RAxML manual here [67] at the web site referenced above. There is also a tutorial here [68]

To include all required environmental variables and the path to the RAXML executable run the modules load command (the modules utility is discussed in detail above):

module load raxml

Next create a PBS batch script. Below is an example script that will run the serial version of RAxML. The program options -m,-n,-s are all required. In order, they specify the substitution model (-m), the output file name (-n), and the sequence file name (-s). Additional options are discussed in the manual.

#!/bin/bash
#PBS -q production
#PBS -N RAXML_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the serial executable to run
echo ">>>> Begin RAXML Serial Run ..."
raxmlHPC -y -m GTRCAT -n TEST1 -p 12345 -s alg.phy > raxml_ser.out 2>&1
echo ">>>> End   RAXML Serial Run ..."

This script can be dropped into a file (say raxml_serial.job) and submitted to PBS with the following command:

qsub raxml_serial.job

RAxML produces the following output files

  1. Parsimony starting tree is written to RAxML_parsimonyTree.TEST1.
  2. Final tree is written to RAxML_result.TEST1.
  3. Execution Log File is written to RAxML_log.TEST1.
  4. Execution information file is written to RAxML_info.TEST1.

RAxML also is available in a MPI-parallel version called raxmlHPC-MPI. The MPI-parallelized version can be run on all types of clusters to perform rapid parallel bootstraps, or multiple inferences on the original alignment. The MPI-version is for executing large production runs (i.e. 100 or 1,000 bootstraps). You can also perform multiple inferences on larger datasets in parallel to find a best-known ML tree for your dataset. Finally, the novel rapid BS algorithm and the associated ML search have also been parallelized with MPI.

The following MPI script script selects 4 processors (cores) and allows PBS to put them on any compute node. Note, that when running any parallel program one must be cognizant of the scaling properties of its parallel algorithm; in other words, how much does a given job's run time drop as one doubles the number of processors used. All parallel programs arrive at point of diminishing returns that depend on the algorithm, size of the problem being solved, and the performance features of the system that it is being run on. We might have chosen to run this job on 8, 16, or 32 processors (cores), but would only do so if the improvement in performance scales. Improvements of less than 25% after a doubling are an indication of a reasonable maximum number of processors under those particular set of circumstances.

#!/bin/bash
#PBS -q production
#PBS -N RAXML_mpi
#PBS -l select=4:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Use 'mpirun' and point to the MPI parallel executable to run
echo ">>>> Begin RAXML MPI Run ..."
mpirun -np 4 -machinefile $PBS_NODEFILE raxmlHPC-MPI -m GTRCAT -n TEST2 -s alg.phy -N 4 > raxml_mpi.out 2>&1
echo ">>>> End   RAXML MPI Run ..."

This test case should take no more than a minute to run and will produce PBS output and error files beginning with the job name 'RAXML_mpi'. Other RAxML-specific outputs will also be produced Details on the meaning of the PBS script are covered above in this Wiki's PBS section. The most important lines are '#PBS -l select=4:ncpus=1:mem=2880mb' and the '#PBS -l pack=free'. The first instructs PBS to select 4 resource 'chunks' each with 1 processor (core) and 2,880 MBs of memory in it for the job (on ANDY as much as 2,880 MBs might have been selected). The second line instructs PBS to place this job wherever the least used resources are found (i.e. freely).

The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command. As this is a parallel job, other compute nodes may also be called into service to complete this job. Note that the name of the parallel executable is 'raxmlHPC-MPI' and the in the parallel run we are complete four simulations (-N 4). The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

SAGE

Sage can be used to study elementary and advanced, pure and applied mathematics. This includes a huge range of mathematics, including basic algebra, calculus, elementary to very advanced number theory, cryptography, numerical computation, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra and much more. Sage is built out of nearly 100 open-source packages and features a unified interface that integrates their functionality into a common experience. It is well-suited for education and research. Details on SAGE, including tutorials on its use are available at [69].

The CUNY HPC Center has installed SAGE 5.7 (version 5.0 is also available) on KARLE, a 24 processor gateway server system that supports a number of interactive and GUI oriented applications such as MATLAB, SAS, R, and Mathematica. SAGE is not currently installed on our cluster systems and is there for not available for PBS batch use. If there is an interest in this and submitted large numbers of serial jobs to the batch queues, users should make a request through the HPC Center email address 'hpchelp@csi.cuny.edu'.

Users with accounts on KARLE can simply start SAGE with:

$sage
----------------------------------------------------------------------
| Sage Version 5.7, Release Date: 2013-02-19 
| Type "notebook()" for the browser-based notebook interface. 
| Type "help()" for help. 
----------------------------------------------------------------------
sage: 
sage: 

SAMTOOLS

SAMTOOLS provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM is compact format aims to be a format that:

Is flexible enough to store all the alignment information generated by various alignment programs;

Is simple enough to be easily generated by alignment programs or converted from existing formats;

Allows most of operations on the alignment to work without loading the whole alignment into memory;

Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.

SAMTOOLS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, CUFFLINKS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the SAMTOOLS home page here [70].

At the CUNY HPC Center SAMTOOLS version 0.1.18 is installed on ANDY. SAMTOOLS is a collection of utilities for extracting, reformatting, and displaying nucleotide sequences. The primary tool is called 'samtools' and offers a large number of command-line options. For smaller tasks, SAMTOOLS can be run interactively, but should be run in PBS batch mode when larger, longer tasks are anticipated. NOTE: that display tasks cannot be run in pure PBS batch mode because the output is displayed. Larger display tasks should be run in PBS interactive mode as described in the PBS section elsewhere in this document.

Below is an example PBS script that will convert the 'toy.sam' file provided with the distribution from SAM to BAM format. This and all the example files can be copied from the local installation directory to your current location as follows:

cp /share/apps/samtools/default/examples/* .

To include all required environmental variables and the path to the SAMTOOLS executables run the modules load command (the modules utility is discussed in detail above):

module load samtools

Running 'samtools' from the interactive prompt without any options will provide a brief description of the form of the command-line argument and options. Here is PBS batch script that does a short format conversion in batch mode:

#!/bin/bash
#PBS -q production
#PBS -N SAMTLS_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin SAMTLS Serial Run ..."
samtools view -bS toy.sam > toy.bam 2 > toy.err
echo ">>>> End   SAMTLS Serial Run ..."

This script can be dropped in to a file (say samtools_ser.job) and started with the command:

qsub samtools_ser.job

Running this conversion test case should take less than 1 minutes and will produce PBS output and error files beginning with the job name 'SAMTLS_serial'. The primary SAMTOOLS application results will be written into the user-specified file at the end of the TOPHAT command line after the greater-than sign. Here it is named 'toy.bam' The expression '2 > toy.err' at the end of the command-line directs Unix standard error to the file 'toy.err'. Users should always explicitly specify the name of the application's output files to ensure that they are written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job on the least buy node where the requested resources can be found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

SAS

SAS (pronounced "sass", originally Statistical Analysis System) is an integrated system of software products provided by SAS Institute Inc. that enables the programmer to perform:

  • data entry, retrieval, management, and mining
  • report writing and graphics
  • statistical analysis
  • business planning, forecasting, and decision support
  • operations research and project management
  • quality improvement
  • applications development
  • data warehousing (extract, transform, load)
  • platform independent and remote computing

In addition, SAS has many business solutions that enable large scale software solutions for areas such as IT management, human resource management, financial management, business intelligence, customer relationship management and more.

SAS software is currently installed at the Neptune server. In order to run it users need to

  • Login to "Karle" server. The procedure is described here. Note that X11 forwarding should be enabled. Read this article for details.
  • start SAS by typing the "sas_en" command.


Schrödinger

Schrödinger is a software suite that is a collection of routines computational chemistry, docking, homology modeling, protein x-ray crystallography refinement, bioinformatics, ADME prediction, cheminformatics, enterprise informatics, pharmacophore searching, molecular simulation, quantum mechanics, and materials science.

Schrödinger suite is currently installed on Karle under

/share/apps/schrodinger/2013-1 

In order to use GUI-based routines one needs to login to Karle with X-forwading enabled:

ssh user.name@karle.csi.cuny.edu -X

See here for more details.

After logging to Karle you need to set-up correct environment for Schrödinger:

export SCHRODINGER=/share/apps/schrodinger/2013-1
export SCHROD_LICENSE_FILE=53000@neptune.csi.cuny.edu

This needs to be done every time you login to Karle. Alternatively these 2 lines can be appended to the file .bashrc in your $HOME.

After this you are ready to use the software. For example to start Maestro execute the following command in the terminal:

$SCHRODINGER/maestro -SGL & 

To see what are other Schrödinger packages available on Karle do

ls $SCHRODINGER

Extensive documentation on Schrödinger is available here.

Stata/MP

Stata is a complete, integrated statistical package that provides tools for data analysis, data management, and graphics. Stata/MP takes advantage of multiprocessor computers. CUNY HPC Center is licensed to use Stata on up to 8 cores.

Currently Stata/MP is available for users on Karle (karle.csi.cuny.edu).

Stata can be run in two regimes:

  • using Command Line Interface
  • using GUI

To start Stata session on Karle:

1) login to "Karle" server. The procedure is described here. Note that to run Stata in GUI mode X11 forwarding should be enabled. Read this article for details.

2) set PATH for your user:

# export PATH=$PATH:/share/apps/stata/stata12

3) start Stata using

  • stata-mp for CLI
  • xstata-mp for GUI


4) after Stata is successfully started (in either CLI or GUI mode) welcome message will be printed to the screen:

./stata-mp 

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   12.0   Copyright 1985-2011 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
     MP - Parallel Edition            College Station, Texas 77845 USA
                                      800-STATA-PC        http://www.stata.com
                                      979-696-4600        stata@stata.com
                                      979-696-4601 (fax)

2-user 8-core Stata network perpetual license:
       Serial number:  50120553010
         Licensed to:  CUNY HPCC
                       New York

Notes:
      1.  (-v# option or -set maxvar-) 5000 maximum variables
      2.  Command line editing enabled


.

5) Stata command prompt '.' is waiting for the input. As an example consider

. use /share/apps/stata/stata12/auto.dta

This will load '/share/apps/stata/stata12/auto.dta' into Stata session.

Now Stata routines may be applied to this data:

. describe

Contains data from /share/apps/stata/stata12/auto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          13 Apr 2011 17:45
 size:         3,182                          (_dta has notes)
-------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------
Sorted by:  foreign

. summarize price, detail

                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188

. 

and so on...


5) Complete documentation on Stata usage can be found under

/share/apps/stata/stata12/docs
  • Users will need to copy pdf documents from this directory to their local workstations.

Structurama

Structurama is a program for inferring population structure from genetic data. The program assumes that the sampled loci are in linkage equilibrium and that the allele frequencies for each population are drawn from a Dirichlet probability distribution. Two different models for population structure are implemented.

First, Structurama offers the method of Pritchard et al. (2000) in which the number of populations is considered fixed. The program also allows the number of populations to be a random variable following a Dirichlet process prior(Pella and Masuda, 2006; Huelsenbeck and Andolfatto, 2007). Importantly, the program can estimate the number of populations under the Dirichlet process prior. Markov chain Monte Carlo (MCMC) is used to approximate the posterior probability that individuals are assigned to specific populations. Structurama also allows the individuals to be admixed. Structurama implements a number of methods for summarizing the results of a Bayesian MCMC analysis of population structure. Perhaps most interestingly, the program finds the mean partition, a partitioning of individuals among populations that minimizes the squared distance to the sampled partitions. More detailed information about Structurama can be found at the web site here [71] and in the manual here [72].

The February 2014 version of the Structurama is installed on ANDY and PENZIAS. Structurama is a serial program with only an interactive command-line interface; therefore, making PBS batch serial runs requires that the user to supply the exact and complete list of commands that an interactive use of the program would have required within the PBS batch script. In addition to referencng the executable 'st2', a Structurama data file must be present in the PBS working directory. The following PBS batch script shows how this is done using the Unix 'here-document' construction (i.e <<):

#!/bin/bash
#PBS -q production
#PBS -N STRAMA_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin STRUCTURE RAMA Serial Run ..."
echo ""

st2 << EOF
execute test.inp
yes
quit
EOF

echo ""
echo ">>>> End   STRUCTURE RAMA Serial Run ..."

After the Structurama module is loaded, this script can be dropped into a file (say 'strama_serial.job') and submitted for execution, as follows:


module avail structurama
----------------------------------- /share/apps/modules/default/modulefiles_UserApplications ------------------------------------
structurama/10.30.11        structurama/2.2.14(default)

module load structurama

qsub strama_serial.job

A basic test input file should take less than a minute to run and will produce PBS output and error files beginning with the job name 'STRAMA_serial'. Additional, Structurama specific output files can also be requested. This job will write an Structurama output file call 'strout.p.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that it finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The lines following the reference to the Structurama executable 'str2' show what is required to deliver input to an interactive program in a batch script. The input-equivalent sequence of commands should be placed, one per line, between the first and last 'EOF' which demarcates the entire pseudo-interactive session. NOTE: If you forget to include the final command 'quit', your PBS job will never complete, as it will be waiting for its final termination instructions and will never receive them. Such, a job should be deleted with the PBS command 'qdel JID', where JID is the numerical PBS job identification number. If you would like a print out of all the Structurama options include the line 'help' in your command stream.

Structure

The program Structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs. More detailed information about Structure can be found at the web site here [73].

Version 2.3.3 of Structure is installed on BOB at the CUNY HPC Center. Structure is a serial program. The following PBS batch script shows how to run a single, basic Structure serial job:

#!/bin/bash
#PBS -q production
#PBS -N STRUCT_simple
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Set the root directory for the 'structure' binary
STROOT=/share/apps/structure/default/bin

# Point to the execution directory to run
echo ">>>> Begin STRUCTURE Serial Run ..."
echo ""
${STROOT}/structure -K 1 -m mainparams -i ./sim.str -o ./sim_k1_run1.out
echo ""
echo ">>>> End   STRUCTURE Serial Run ..."

This script can be dropped into a file (say 'struct_serial.job') and submitted for execution using the following PBS command:

qsub struct_serial.job

This test input file should take less than 5 minutes to run and will produce PBS output and error files beginning with the job name 'STRUCT_simple'. Additional, Structure-specific output files will also be created, including an output file called 'sim_k1_run1.out_f'. Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=1920mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' each with 1 processor (core) and 1,920 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources are found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

The Structure program requires its own input and data files, properly configured, to run successfully. For the example above these include the input file ('sim.str' above), the 'mainparams' file ('mainparams.10mil.k1' above), and the 'extraparams' file (the default name, 'extraparams' is used in the example above). The user is responsible for configuring these files correctly for each run, but the data files for this example and others can be found in the directory:

/share/apps/structure/default/examples

on BOB.

Often, Structure users are interested in making multiple runs over a large simulation regime-space. This requires appropriately configured input and parameter files for each individual run. Data file configuration can be done manually or with the help of the Python-based tool StrAuto. The HPC Center has installed StrAuto to support running multiple Structure runs. StrAuto is documented at its download site here [74] and all the files, including the primary Python-based tool, 'strauto-0.3.1.py' are available in:

/share/apps/strauto/default

In this process, the StrAuto script, 'strauto-0.3.1.py', (found in '/share/apps/strauto/default/bin') is run in the presence of a user-created, regime-space configuration file called 'input.py'. This produces a Unix script file called 'runstructure' that can then be used to run the user-defined spectrum of cases, one after another. NOTE: the 'strauto-0.3.1.py' script requires Python 2.7.2 to run correctly. This version is NOT the default version of Python installed on BOB; and therefore, users of StrAuto must invoke the 'strauto-0.3.1.py' script using a specially installed version of Python, as follows:

/share/apps/epd/7.3-2/bin/python ./strauto-0.3.1.py

The above command assumes that 'strauto-0.3.1.py' has been copied into the user's directory and that the required 'input.py' file is also present there. The contents of the 'runstructure' file produced can then be integrated into a PBS batch script similar to the simple, single-run script shown above, but designed to run each case in the simulation regime-space in succession. Here is an example of just such a runstructure-adapted PBS script:

#!/bin/bash
#PBS -q production
#PBS -N STRUCT_cmplx
#PBS -l select=1:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

#-----------------------------------------------------------------------------------
# This PBS batch script is based on the 'runstructure' script generated by 
# Vikram Chhatre's setup and pre-processing program 'strauto-0.3.1.py' written
# in Python at Texas A&M University to be used with the 'structure' application.
#
# Each 'runstructure' script is custom-generated by the 'strauto-0.3.0.py' python 
# based on a custom input file.  It completes a series of runs over a regime defined 
# by the 'structure' user for that custom input file only.  This means it will only 
# work for that input data file. 
#                   Email: crypticlineage (at) tamu.edu                        
#-----------------------------------------------------------------------------------

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Setup a directory structure for the multiple 'structure' runs
mkdir results_f log harvester
mkdir k1
mkdir k2
mkdir k3
mkdir k4
mkdir k5

cd log
mkdir k1
mkdir k2
mkdir k3
mkdir k4
mkdir k5

cd ..

# Set the root directory for the 'structure' binary
STROOT=/share/apps/structure/default/bin

# Point to the execution directory to run
echo ">>>> Begin Multiple STRUCTURE Serial Runs ..."
echo ""

${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run1 2>&1 | tee log/k1/sim_k1_run1.log
${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run2 2>&1 | tee log/k1/sim_k1_run2.log
${STROOT}/structure -K 1 -m mainparams -o k1/sim_k1_run3 2>&1 | tee log/k1/sim_k1_run3.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run1 2>&1 | tee log/k2/sim_k2_run1.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run2 2>&1 | tee log/k2/sim_k2_run2.log
${STROOT}/structure -K 2 -m mainparams -o k2/sim_k2_run3 2>&1 | tee log/k2/sim_k2_run3.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run1 2>&1 | tee log/k3/sim_k3_run1.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run2 2>&1 | tee log/k3/sim_k3_run2.log
${STROOT}/structure -K 3 -m mainparams -o k3/sim_k3_run3 2>&1 | tee log/k3/sim_k3_run3.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run1 2>&1 | tee log/k4/sim_k4_run1.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run2 2>&1 | tee log/k4/sim_k4_run2.log
${STROOT}/structure -K 4 -m mainparams -o k4/sim_k4_run3 2>&1 | tee log/k4/sim_k4_run3.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run1 2>&1 | tee log/k5/sim_k5_run1.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run2 2>&1 | tee log/k5/sim_k5_run2.log
${STROOT}/structure -K 5 -m mainparams -o k5/sim_k5_run3 2>&1 | tee log/k5/sim_k5_run3.log

# Consolidate all results in a single 'zip' file
mv k1 k2 k3 k4 k5  results_f/
cd results_f/
cp k*/*_f . && zip sim_Harvester-Upload.zip *_f && rm *_f
mv sim_Harvester-Upload.zip ../harvester/
cd ..

echo ""
echo ">>>> Zip Archive: sim_Harvester-Upload.zip is Ready ... "
echo ">>>> End  Multiple  STRUCTURE Serial Runs ..."

This script can be dropped into a file (say 'struct_cmplx.job') and submitted for execution using the following PBS command:

qsub struct_cmplx.job

The 'struct_cmplx.job' script runs one Structure job after another each with a slightly different set of input parameters. All the associated files and directories from a successful StrAuto-supported run of Structure using this script can be found on BOB in:

/share/apps/strauto/default/examples

TOPHAT

TOPHAT is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. TOPHAT is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, CUFFLINKS, and SAMTOOLS are also installed at the CUNY HPC Center. Additional information can be found at the TOPHAT home page here [75].

At the CUNY HPC Center TOPHAT version 2.0.7 is installed on ANDY. TOPHAT is a parallel threaded code (pthreads) that takes its input from a simple text file provided on the command line. Below is an example PBS script that will run the mRNA test case provided with the distribution and which can be copied from the local installation directory to your current location as follows:

cp /share/apps/tophat/default/examples/* .

To include all required environmental variables and the path to the TOPHAT executable run the modules load command (the modules utility is discussed in detail above):

module load tophat

Running 'tophat' from the interactive prompt without any options will provide a brief description of the form of the command-line argument and options. Here is PBS batch script that builds the index and aligns the sequences in serial mode:

#!/bin/bash
#PBS -q production
#PBS -N TOPHAT_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin TOPHAT Serial Run ..."
tophat -r 20 test_ref reads_1.fq reads_2.fq > tophat_mrna.out 2>&1
echo ">>>> End   TOPHAT Serial Run ..."

This script can be dropped in to a file (say tophat_ser.job) and started with the command:

qsub tophat_ser.job

Running the mRNA test case should take less than 2 minutes and will produce PBS output and error files beginning with the job name 'TOPHAT_serial'. The primary TOPHAT application results will be written into the user-specified file at the end of the TOPHAT command line after the greater-than sign. Here it is named 'tophat_mrna.out.' The expression '2>&1' at the end of the command-line combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the PBS spool directory on /var.

Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb' and the '#PBS -l pack=free' lines. The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs PBS to place this job wherever the least used resources can be found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file by the 'hostname' command.

To run TOPHAT in parallel-threads mode several changes to the script are required. Here is a modified script that shows how to run TOPHAT using two threads. ANDY has as many as 8 physical compute cores per compute node, and therefore as many as 8 cores-threads might be chosen. Once a parallel job starts it will generally (not always) complete in less time, but jobs requesting a larger the number of cores-threads or memory per node may wait longer to start on a busy system as PBS looks for a compute node with all the resources requested.

#!/bin/bash
#PBS -q production
#PBS -N TOPHAT_threads
#PBS -l select=1:ncpus=2:mem=5760mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
hostname

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin TOPHAT Threaded Run ..."
tophat -p 2 -r 20 test_ref reads_1.fq reads_2.fq > tophat_thrds.out 2>&1
echo ">>>> End   TOPHAT Threaded Run ..."

Notice the difference in the '-l select' line where the resource 'chunk' now includes 2 cores (ncpus=2) and requests twice as much memory as before. Also, notice that the TOPHAT command-line now includes the '-p 2' option to run the code with 2 threads working in parallel. Perfectly or 'embarrassingly' parallel workloads can run close to 2, 4, or more times as fast as the same workload in serial mode depending on the number of threads requested, but workloads cannot be counted on to be perfectly parallel.

The speed ups that you observe will typically be less than perfect and diminish as you ask for more cores-threads. Large data jobs will typically scale more efficiently as you add cores-threads, but users should take note of the performance gains that they see as cores-threads are added and select a core-thread count the provides efficient scaling and avoids diminishing returns.

Thrust Library (CUDA)

Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C. As of CUDA 4.1, Thrust has been integrated into the default CUDA distribution. The HPC Center is currently running CUDA 5.5 as the default on PENZIAS which includes Thrust library.

Thrust provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be combined together to implement complex algorithms with concise, readable source code. By describing your computation in terms of these high-level abstractions you provide Thrust with the freedom to select the most efficient implementation automatically. As a result, Thrust can be utilized in rapid prototyping of CUDA applications, where programmer productivity matters most, as well as in production, where robustness and absolute performance are crucial.

More detail on the Thrust library is available here [76]. There are a collection of example codes here [77]. The Thrust Manual is available here [78]

Here is a basic C++ example code, which creates and fills a vector on the Host, resizes it, copies it to the Device, modifies it there, and prints out the modified values.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <iostream>

int main(void)
{
    // H has storage for 4 integers
    thrust::host_vector<int> H(4);

    // initialize individual elements
    H[0] = 14;
    H[1] = 20;
    H[2] = 38;
    H[3] = 46;
    
    // H.size() returns the size of vector H
    std::cout << "H has size " << H.size() << std::endl;

    // print contents of H
    for(int i = 0; i < H.size(); i++)
        std::cout << "H[" << i << "] = " << H[i] << std::endl;

    // resize H
    H.resize(2);
    
    std::cout << "H now has size " << H.size() << std::endl;

    // Copy host_vector H to device_vector D
    thrust::device_vector<int> D = H;
    
    // elements of D can be modified
    D[0] = 99;
    D[1] = 88;
    
    // print contents of D
    for(int i = 0; i < D.size(); i++)
        std::cout << "D[" << i << "] = " << D[i] << std::endl;

    // H and D are automatically deleted when the function returns
    return 0;
}

Assuming this source file were called 'vectcopy.cu', it can be compiled on PENZIAS:

nvcc -o vectcopy.exe vectcopy.cu

Once compiled, the 'vectorcopy.exe' executable can be run using the following PBS script:

#!/bin/bash
#PBS -q production_gpu
#PBS -N THRUST_vcopy
#PBS -l select=1:ncpus=1:ngpus=1 
#PBS -l place=free
#PBS -V

# Find out which compute node the job is using
echo ""
echo -n "Running job on compute node ... " 
hostname

echo ""
echo "PBS node file is located here ... "  $PBS_NODEFILE
echo -n "PBS node file contains ... "
cat  $PBS_NODEFILE
echo ""

# Change to working directory
cd $PBS_O_WORKDIR

# Running executable on a single, gpu-enabled
# compute node using 1 CPU and 1 GPU.
echo "CUDA job is starting ... "
echo ""

./vectcopy.exe

echo ""
echo "CUDA job is done!"

VMD

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. It was developed by The Theoretical and Computational Biophysics Group at the University of Illinois. It is documented on the TCB's homepage.

VMD is installed on Karle. To use it within command-line interface login to Karle as usual and start VMD by typing "vmd" followed by return. Or alternatively use the full path: "/share/apps/vmd/default/bin/vmd"

In order to use VMD in GUI-mode, login to Karle with -X option (see this article for details) and start VMD as described above.

WRF

The Weather Research and Forecasting (WRF) model is a specific computer program with dual use for both weather forecasting and weather research. It was created through a partnership that includes the National Oceanic and Atmospheric Administration (NOAA), the National Center for Atmospheric Research (NCAR), and more than 150 other organizations and universities in the United States and abroad. WRF is the latest numerical model and application to be adopted by NOAA's National Weather Service as well as the U.S. military and private meteorological services. It is also being adopted by government and private meteorological services worldwide.

There are two distinct WRF development trees and versions, one for production forecasting and another for research and development. NCAR's experimental, advanced research version, called ARW (Advanced Research WRF) features very high resolution and is being used to explore ways of improving the accuracy of hurricane tracking, hurricane intensity, and rainfall forecasts, among a host of other meteorological questions. It is ARW version 3.4.1, along with its pre- and post- processing modules (WPS and WPP), and the MET and GRaDS display tools that are supported here at the CUNY HPC Center. ARW version 3.4.1 is supported on both the the CUNY HPC Center SGI (ANDY) and Cray (SALK). Versions 3.3.0 and 3.4.0 are also still installed and available if needed. The CUNY HPC Center build of 3.4.1 includes the NCAR Command Language (NCL) tools on both SALK and ANDY.

A complete start-to-finished use of ARW requires a significant number of steps in pre-processing, parallel production modeling, and post-processing and display. There are several alternative paths that can be taken through each stage. In particular, ARW itself offers users the ability to process either real or idealized weather data. Completing one type of simulation or the other requires different steps and even different user-compiled versions of the ARW executable. To help our users familiarize themselves with running ARW at the CUNY HPC Center, the steps required to complete a start-to-finish, real-case forecast are presented below. For more complete coverage, the CUNY HPC Center recommends that new users study the detailed description of the ARW package and how to use it at the University Corporation for Atmospheric Research (UCAR) website here [79].

WRF Pre-Processing with WPS

The WPS part of the WRF package is responsible for mapping time-equals-zero simulation input data onto the simulation domain's terrain. This process involves the execution of the preprocessing applications geogrid.exe, ungrib.exe, and metgrid.exe. Each of these applications reads its input parameters from the 'namelist.wps' input specifications file.

NOTE: In general, these steps do not take much processing time; however, in some cases they may. When users discover that pre-processing steps are running longer than five minutes as interactive jobs on the head node of either ANDY or SALK they should be instead run as batch jobs. HPC Center staff may decide to kill such long-running interactive pre-processing steps if they are slowing head node performance.

In the example presented here, we will run a weather simulation based on input data provided from January of 2000 for the eastern United States. These steps should work both on ANDY and SALK with minor differences as noted. To begin this example, create a working WPS directory and copy the test case namelist file into it.

mkdir -p $HOME/wrftest/wps
cd $HOME/wrftest/wps
cp /share/apps/wrf/default/WPS/namelist.wps .

Next, you should edit the 'namelist.wps' to point to the sample data made available in the WRF installation tree. This involves making sure that the 'geog_data_path' assignment in the geogrid section of the namelist file points to the sample data tree. From an editor make the following assignment:

geog_data_path = '/share/apps/wrf/default/WPS_DATA/geog'

Once this is completed, you must symbolically link or copy the geogrid data table directory to your working directory ($HOME/wrftest/wps here).

ln -sf /share/apps/wrf/default/WPS/geogrid ./geogrid

Now, you can run 'geogrid.exe', the geogrid executable, which defines the simulation domains and interpolates the various terrestrial data sets between the model's grid lines. The global environment on ANDY has been set to include the path to all the WRF-related executables including 'geogrid.exe'. On SALK, you must load the WRF module ('module load wrf') first to set the environment. The geogrid executable is an MPI parallel program which could be run in parallel as part of a PBS batch script to complete the combined WRF preprocessing and execution steps, but often it runs only a short while and can be run interactively on ANDY's head node before submitting a full WRF batch job.

First you will first have to load the WRF module with:

module load wrf

Once this is done from the $HOME/wrftest/wps working directory run:

geogrid.exe > geogrid.out

On Salk (Cray system) you will have to run:

 aprun -n 1 geogrid.exe > geogrid.out

Note that 'geogrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete PBS batch jobs, so that SALK's interactive nodes are not held by long running jobs.

Two domain files should be produced (geo_em.d01.nc geo_em.d02.nc) for this basic test case, as well as a log and output file which indicates success at the end with:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of geogrid.        !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The next required preprocessing step is to run 'ungrib.exe', the ungrib executable. The purpose of ungrib is to unpack 'GRIB' ('GRIB1' and 'GRIB2') meteorological data and pack it into an intermediate file format usable by 'metgrid.exe' in the final preprocessing step.

The data for the January 2000 simulation being documented here has already been downloaded and placed in the WRF installation tree in /share/apps/wrf/default/WPS_DATA. Before running 'ungrib.exe', the WRF installation 'Vtable' file must first be symbolically linked into the working directory with:

$ln -sf /share/apps/wrf/default/WPS/ungrib/Variable_Tables/Vtable.AWIP Vtable
$ls
geo_em.d01.nc  geo_em.d02.nc  geogrid  geogrid.log  namelist.wps  Vtable

The Vtable file specifies which fields to unpack from the GRIB files. The Vtables list the fields and their GRIB codes that must be unpacked. For this test case the required Vtable file has already been defined, but users may have to construct a custom Vtable file for their data.

Next, the GRIB files themselves must also be symbolically linked into the working directory. WRF provides a script to do this.

$link_grib.csh /share/apps/wrf/default/WPS_DATA/JAN00/2000012
$ls
geo_em.d01.nc  geogrid      GRIBFILE.AAA  GRIBFILE.AAC  GRIBFILE.AAE  GRIBFILE.AAG  GRIBFILE.AAI  GRIBFILE.AAK  GRIBFILE.AAM  namelist.wps
geo_em.d02.nc  geogrid.log  GRIBFILE.AAB  GRIBFILE.AAD  GRIBFILE.AAF  GRIBFILE.AAH  GRIBFILE.AAJ  GRIBFILE.AAL  GRIBFILE.AAN  Vtable

Note 'ls' shows that the 'GRIB' files are now present.

Next, more edits to the 'namelist.wps' file are required--one to set the start and end dates for the simulation to our January 2000 time frame, and the second to set the number of simulation seconds to complete (21600 / 3600 = 6.0 hours in this case). Edit the 'namelist.wps' file by setting the following in the shared section of the file:

 start_date = '2000-01-24_12:00:00','2000-01-24_12:00:00',
 end_date   = '2000-01-25_12:00:00','2000-01-25_12:00:00',
interval_seconds = 21600

Now you can run 'ungrib.exe' to create the intermediate files required by 'metgrid.exe':

$ungrib.exe > ungrib.out
$ls
FILE:2000-01-24_12  FILE:2000-01-25_06  geo_em.d02.nc  GRIBFILE.AAA  GRIBFILE.AAD  GRIBFILE.AAG  GRIBFILE.AAJ  GRIBFILE.AAM  ungrib.log
FILE:2000-01-24_18  FILE:2000-01-25_12  geogrid        GRIBFILE.AAB  GRIBFILE.AAE  GRIBFILE.AAH  GRIBFILE.AAK  GRIBFILE.AAN  ungrib.out
FILE:2000-01-25_00  geo_em.d01.nc       geogrid.log    GRIBFILE.AAC  GRIBFILE.AAF  GRIBFILE.AAI  GRIBFILE.AAL  namelist.wps  Vtable

Note that 'ungrib.exe', unlike the other pre-processing tools mentioned here, is NOT an MPI parallel program and for larger WRF jobs can run for a fairly long time. Long running 'ungrib.exe' pre-processing jobs should be run as complete PBS batch jobs, so that SALK's interactive nodes are not held for hours at a time.

After a successful 'ungrib.exe' run you should get the familiar message at the end of the output file:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Successful completion of ungrib.!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Like geogrid, the metgrid executable, 'metgrid.exe' needs to be able to find its table directory in the preprocessing working directory. The metgrid table directory may either be copied or symbolically linked into the working directory location.

ln -sf /share/apps/wrf/default/WPS/metgrid ./metgrid

Finally, all the files required for a successful run of 'metgrid.exe' have been provided. Like 'geogrid.exe', 'metgrid.exe' is an MPI parallel program that could be run in PBS batch mode, but often runs for only a short time and can be run on ANDY's head node, as follows:

$metgrid.exe > metgrid.out 
$ls
FILE:2000-01-24_12  geogrid       GRIBFILE.AAF  GRIBFILE.AAM                       met_em.d02.2000-01-24_12:00:00.nc  metgrid.out
FILE:2000-01-24_18  geogrid.log   GRIBFILE.AAG  GRIBFILE.AAN                       met_em.d02.2000-01-24_18:00:00.nc  namelist.wps
FILE:2000-01-25_00  GRIBFILE.AAA  GRIBFILE.AAH  met_em.d01.2000-01-24_12:00:00.nc  met_em.d02.2000-01-25_00:00:00.nc  ungrib.log
FILE:2000-01-25_06  GRIBFILE.AAB  GRIBFILE.AAI  met_em.d01.2000-01-24_18:00:00.nc  met_em.d02.2000-01-25_06:00:00.nc  ungrib.out
FILE:2000-01-25_12  GRIBFILE.AAC  GRIBFILE.AAJ  met_em.d01.2000-01-25_00:00:00.nc  met_em.d02.2000-01-25_12:00:00.nc  Vtable
geo_em.d01.nc       GRIBFILE.AAD  GRIBFILE.AAK  met_em.d01.2000-01-25_06:00:00.nc  metgrid
geo_em.d02.nc       GRIBFILE.AAE  GRIBFILE.AAL  met_em.d01.2000-01-25_12:00:00.nc  metgrid.log

If you are on SALK (Cray XE6), you will have to run:

 aprun -n 1 metgrid.exe > metgrid.out

Note that 'metgrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete PBS batch jobs, so that SALK's interactive nodes are not held by long running jobs.

Successful runs will produce an output file that includes:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of metgrid.  !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Note that the met files required by WRF are now present (see the 'ls' output above). At this point, the preprocessing phase of this WRF sample run is complete. We can move on to actually running this real (not ideal) WRF test case using the PBS Pro batch scheduler in MPI parallel mode.

Running a WRF Real Case in Parallel Using PBS

Our frame of reference now turns to running 'real.exe' and 'wrf.exe' in parallel on ANDY or SALK via PBS Pro. As you perhaps noticed in walking through the preprocessing steps above, the preprocessing files are all installed in their own subdirectory (WPS) under the WRF installation tree root (/share/apps/wrf/default). The same is true for the files to run WRF. They reside under the WRF install root in the 'WRFV3' subdirectory.

Within this 'WRFV3' directory, the 'run' subdirectory contains the all common files needed for a 'wrf.exe' run except the 'met' files that were just created in the preprocessing section above and those that are produced by 'real.exe' which is run before 'wrf.exe' in real-data weather forecasts.

Note that the ARW version of WRF allows one to produce a number of different executables depending on the type of run that is needed. Here, we are relying on the fact that the 'em_real' version of the code has already been built. Currently, the CUNY HPC Center has only compiled this version of WRF. Other versions can be compiled upon request. The subdirectory 'test' underneath the 'WRFV3' directory contains additional subdirectories for each type of WRF build (em_real, em_fire, em_hill2d_x, etc.).

To complete an MPI parallel run of this WRF real data case, a 'wrfv3/run' working directory for your run should be created, and it must be filled with the required files from the installation root's 'run' directory, as follows:

$cd $HOME/wrftest
$mkdir -p wrfv3/run
$cd wrfv3/run
$cp /share/apps/wrf/default/WRFV3/run/* .
$rm *.exe
$
$ls
CAM_ABS_DATA       ETAMPNEW_DATA.expanded_rain      LANDUSE.TBL            ozone_lat.formatted   RRTM_DATA_DBL      SOILPARM.TBL  URBPARM_UZE.TBL
CAM_AEROPT_DATA    ETAMPNEW_DATA.expanded_rain_DBL  MPTABLE.TBL            ozone_plev.formatted  RRTMG_LW_DATA      tr49t67       VEGPARM.TBL
co2_trans          GENPARM.TBL                      namelist.input         README.namelist       RRTMG_LW_DATA_DBL  tr49t85
ETAMPNEW_DATA      grib2map.tbl                     namelist.input.backup  README.tslist         RRTMG_SW_DATA      tr67t85
ETAMPNEW_DATA_DBL  gribmap.txt                      ozone.formatted        RRTM_DATA             RRTMG_SW_DATA_DBL  URBPARM.TBL
$

Note that the '*.exe' files were removed in the sequence above after the copy because they are already pointed to by ANDY's and SALK's system PATH variable.

Next, the 'met' files produced during the preprocessing phase above need to be copied or symbolically linked into the 'wrv3/run' directory.

$
$pwd
/home/guest/wrftest/wrfv3/run
$
$cp ../../wps/met_em* .
$ls
CAM_ABS_DATA                     grib2map.tbl                       namelist.input         RRTM_DATA_DBL      tr67t85
CAM_AEROPT_DATA                  gribmap.txt                        namelist.input.backup  RRTMG_LW_DATA      URBPARM.TBL
co2_trans                        LANDUSE.TBL                        ozone.formatted        RRTMG_LW_DATA_DBL  URBPARM_UZE.TBL
ETAMPNEW_DATA                    met_em.d01.2000-01-24_12:00:00.nc  ozone_lat.formatted    RRTMG_SW_DATA      VEGPARM.TBL
ETAMPNEW_DATA_DBL                met_em.d01.2000-01-25_12:00:00.nc  ozone_plev.formatted   RRTMG_SW_DATA_DBL
ETAMPNEW_DATA.expanded_rain      met_em.d02.2000-01-24_12:00:00.nc  README.namelist        SOILPARM.TBL
ETAMPNEW_DATA.expanded_rain_DBL  met_em.d02.2000-01-25_12:00:00.nc  README.tslist          tr49t67
GENPARM.TBL                      MPTABLE.TBL                        RRTM_DATA              tr49t85
$

The user may have edits to complete on the WRF 'namelist.input' file listed to craft the exact job they wish to run. The default namelist file copied into our working directory is in large part what is needed for this test run, but we will reduce the total simulation time (for the weather model, not the job) from from 12 to 1 hour by setting the 'run_hours' variable to 1.

At this point we are ready to submit a PBS job. The PBS Pro batch script below first runs 'real.exe' which creates the WRF input files 'wrfbdy_d01' and 'wrfinput_d01', and then runs 'wrf.exe' itself. Both executables are MPI parallel programs, and here they are both run on 16 processors. Here is the 'wrftest.job' PBS script that will run on ANDY:

#!/bin/bash
#PBS -q production_qdr
#PBS -N wrf_realem
#PBS -l select=16:ncpus=1:mem=2880mb
#PBS -l place=free
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
echo ""
hostname
echo ""

# Find out the contents of the PBS node file which names the node
# allocated by PBS
echo -n ">>>> PBS Node file contains: "
echo ""
cat $PBS_NODEFILE
echo ""

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Just point to the pre-processing executable to run
echo ">>>> Runnning REAL.exe executable ..."
mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/wrf/default/WRFV3/run/real.exe
echo ">>>> Running WRF.exe executable ..."
mpirun -np 16  -machinefile $PBS_NODEFILE /share/apps/wrf/default/WRFV3/run/wrf.exe
echo ">>>> Finished WRF test run ..."

The full path to each executable is used for illustrative purposes, but both binaries (real.exe and wrf.exe) are in the WRF install tree run directory and would be picked up from the system PATH environmental variable without the full path. This job requests 16 resource chunks, each with 1 processor and 2880 MBytes of memory. This job asks to be run on the QDR InfiniBand (faster interconnect) side of the ANDY system. Details on the use and meaning of the PBS option section of the job are available elsewhere in the CUNY HPC Wiki.

To submit the job type:

qsub wrftest.job

A slightly difference version of the script is required to run the same job on SALK (the Cray):

#!/bin/bash
#PBS -q production
#PBS -N wrf_realem
#PBS -l select=16:ncpus=1
#PBS -l place=free
#PBS -j oe
#PBS -o wrf_test16_O1.out
#PBS -V

# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
echo ""
hostname
echo ""

# Find out the contents of the PBS node file which names the node
# allocated by PBS
echo -n ">>>> PBS Node file contains: "
echo ""
cat $PBS_NODEFILE
echo ""

# You must explicitly change to the working directory in PBS
cd $PBS_O_WORKDIR

# Tune some MPICH parameters on the Cray
export MALLOC_MMAP_MAX=0
export MALLOC_TRIM_THRESHOLD=536870912
export MPICH_RANK_ORDER 3

# Just point to the pre-processing executable to run
echo ">>>> Runnning REAL.exe executable ..."
aprun -n 16  /share/apps/wrf/default/WRFV3/run/real.exe
echo ">>>> Running WRF.exe executable ..."
aprun -n 16  /share/apps/wrf/default/WRFV3/run/wrf.exe
echo ">>>> Finished WRF test run ..."

A successful run on either ANDY or SALK will produce an 'rsl.out' and 'rsl.error' file for each processor on which the job ran. So for this test case there will be 16 of each such files. The 'rsl.out' files reflect the run settings requested in the namelist file and then time-stamp the progress the job is making until the total simulation time is completed. The tail end of an 'rsl.out' file for a successful run should look like this:

:
:
v
Timing for main: time 2000-01-24_12:45:00 on domain   1:    0.06060 elapsed seconds.
Timing for main: time 2000-01-24_12:48:00 on domain   1:    0.06300 elapsed seconds.
Timing for main: time 2000-01-24_12:51:00 on domain   1:    0.06090 elapsed seconds.
Timing for main: time 2000-01-24_12:54:00 on domain   1:    0.06340 elapsed seconds.
Timing for main: time 2000-01-24_12:57:00 on domain   1:    0.06120 elapsed seconds.
Timing for main: time 2000-01-24_13:00:00 on domain   1:    0.06330 elapsed seconds.
 d01 2000-01-24_13:00:00 wrf: SUCCESS COMPLETE WRF
taskid: 0 hostname: gpute-2
taskid: 0 hostname: gpute-2

Post-Processing and Displaying WRF Results

Xmgrace

Grace is a WYSIWYG 2D plotting tool for the X Window System and M*tif. Xmgrace is developed at Plasma Laboratory, Weizmann Institute of Science. More information about it's capabilities can be found at the web page http://plasma-gate.weizmann.ac.il/Grace/

Grace is installed on Karle. To use it within command-line interface login to Karle as usual and start Grace by typing "xmgrace" followed by return. Or alternatively use the full path: "/share/apps/xmgrace/default/grace/bin/xmgrace" In order to use Grace in GUI-mode, login to Karle with -X option (see this article for details) and start Xmgrace as described above.

MET (Model Evaluation Tools)

MET was developed by the National Center for Atmospheric Research (NCAR) Developmental Testbed Center (DTC) through the generous support of the U.S. Air Force Weather Agency (AFWA) and the National Oceanic and Atmospheric Administration (NOAA).

Description

MET is designed to be a highly-configurable, state-of-the-art suite of verification tools. It was developed using output from the Weather Research and Forecasting (WRF) modeling system but may be applied to the output of other modeling systems as well.

MET provides a variety of verification techniques, including:

  • Standard verification scores comparing gridded model data to point-based observations
  • Standard verification scores comparing gridded model data to gridded observations
  • Spatial verification methods comparing gridded model data to gridded observations using neighborhood, object-based, and intensity-scale decomposition approaches
  • Probabilistic verification methods comparing gridded model data to point-based or gridded observations


Usage

MET is a collection of components. Each of them requires a special input deck and generates outputs upon successful run.


1. PB2NC. This tool is used to create NetCDF files from input PrepBufr files containing point observations.

  • Input: One PrepBufr point observation file and one configuration file.
  • Output: One NetCDF file containing the observations that have been retained.

2. ASCII2NC tool is used to create NetCDF files from input ASCII point observations. These NetCDF files are then used in the statistical analysis step.

  • Input: One ASCII point observation file that has been formatted as expected.
  • Output: One NetCDF file containing the reformatted observations.

3. Pcp-Combine Tool (optional) accumulates precipitation amounts into the time interval selected by the user – if a user would like to verify over a different time interval than is included in their forecast or observational dataset.

  • Input: Two or more gridded model or observation files in GRIB1 format

containing accumulated precipitation to be combined to create a new accumulation interval.

  • Output: One NetCDF file containing the summed accumulation interval.

4. Gen-Poly-Mask Tool will create a bitmapped masking area from a user specified polygon, i.e. a text file containing a series of latitudes / longitudes. This mask can then be used to efficiently limit verification to the interior of a user specified region.

  • Input: One gridded model or observation file in GRIB1 format and one ASCII

file defining a Lat/Lon masking polyline.

  • Output: One NetCDF file containing a bitmap for the masking region defined

by the polyline over the domain of the gridded input file.

5. Point-Stat Tool is used for grid-to-point verification, or verification of a gridded forecast field against a point-based observation (i.e., surface observing stations, ACARS, rawinsondes, and other observation types that could be described as a point observation).

  • Input: One model file either in GRIB1 format or in the NetCDF format output

from the Pcp-Combine tool, at least one point observation file in NetCDF format (as the output of the PB2NC or ASCII2NC tool), and one configuration file.

  • Output: One STAT file containing all of the requested line types, and several

ASCII files for each line type requested.

6. Grid-Stat Tool produces traditional verification statistics when a gridded field is used as the observational dataset.

  • Input: One model file and one observation file either in GRIB1 format or in the

NetCDF format output from the Pcp-Combine tool, and one configuration file.

  • Output: One STAT file containing all of the requested line types, several

ASCII files for each line type requested, and one NetCDF file containing the matched pair data and difference field for each verification region and variable type/level being verified.


7. The MODE (Method for Object-based Diagnostic Evaluation) tool also uses gridded fields as observational datasets.

  • Input: One model file and one observation file either in GRIB1 format or in the

NetCDF format output from the Pcp-Combine tool, and one or two configuration files.

  • Output: One ASCII file containing contingency table counts and statistics,

one ASCII file containing single and pair object attribute values, one NetCDF file containing object indices for the gridded simple and cluster object fields, and one PostScript plot containing a summary of the features-based

8. The Wavelet-Stat tool decomposes two-dimensional forecasts and observations according to the Intensity-Scale verification technique described by Casati et al. (2004).

  • Input: One model file and one gridded observation file either in GRIB1 format

or in the NetCDF format output from the Pcp-Combine tool, and one configuration file.

  • Output: One STAT file containing the ‘ISC” line type, one ASCII file

containing intensity-scale information and statistics, one NetCDF file containing information about the wavelet decomposition of forecast and observed fields and their differences, and one PostScript file containing plots and summaries of the intensity-scale verification.

9. The Stat-Analysis tool reads the STAT output of Point-Stat, Grid-Stat, and Wavelet-Stat and can be used to filter the STAT data and produce aggregated continuous and categorical statistics.

  • Input: One or more STAT files output from the Point-Stat and/or Grid-Stat

tools and, optionally, one configuration file containing specifications for the analysis job(s) to be run on the STAT data.

  • Output: ASCII output of the analysis jobs will be printed to the screen unless

redirected to a file using the “-out” option.

10. The MODE-Analysis tool reads the ASCII output of the MODE tool and can be used to produce summary information about object location, size, and intensity (as well as other object characteristics) across one or more cases.

  • Input: One or more MODE object statistics files from the MODE tool and,

optionally, one configuration file containing specification for the analysis job(s) to be run on the object data.

  • Output: ASCII output of the analysis jobs will be printed to the screen unless

redirected to a file using the “-out” option.


Detailed documentation of all MET tools can be found at http://www.dtcenter.org/met/users/docs/overview.php

Running MET at Andy with PBS

MET tools are available under

/share/apps/met/default/bin

As an example of running MET tools on Andy consider the following. We will run gen_polly_mask. This tool requires two input files. They can be taken from '/'share/apps/met/default/data directory.

 mkdir ~/met_test
cd ~/met_test
 cp /share/apps/met/default/data/poly/CONUS.poly ./
cp /share/apps/met/default/data/sample_fcst/2005080700/wrfprs_ruc13_24.tm00_G212 ./

Now one need to construct a PBS script that would send a job to a PBS queue. Use your favorite text editor and create file "sendpds" with the following content:

#!/bin/bash                                                                                                                                                                  
# Simple MPI PBS Pro batch job                                                                                                                                               
#PBS -N testMET                                                                                                                                                              
#PBS -q production                                                                                                                                                          
#PBS -l select=1:ncpus=1:mpiprocs=1                                                                                                                                          
#PBS -l place=free                                                                                                                                                           
#PBS -V                                                                                                                                                                      
                                                                                                                                                                             
cd $PBS_O_WORKDIR                                                                                                                                                            
                                                                                                             
export METHOME=/share/apps/met/default                                                                                                                                       
                                                                                                                                                                             
echo "*** Running Gen-Poly-Mask to generate a polyline mask file for the Continental United States ***"                                                                      
$METHOME/bin/gen_poly_mask ./wrfprs_ruc13_24.tm00_G212 CONUS.poly CONUS_poly.nc -v 2                                                                                         
echo "*** Job is done! ***"                                                          

Submit the job using

qsub sendpbs

Upon successful completion 3 files will be generated:

  • testMET.eXXXX -- file with stderr. Should be empty if everything goes right
  • testMET.oXXXX -- file with stdout. In this example should contain the following:
*** Running Gen-Poly-Mask to generate a polyline mask file for the Continental United States ***
Input Data File:        ./wrfprs_ruc13_24.tm00_G212
Input Poly File:        CONUS.poly
Parsed Grid:            Lambert Conformal (185 x 129)
Parsed Polyline:        CONUS containing 243 points
Points Inside Mask:     5483 of 23865
Output NetCDF File:     CONUS_poly.nc
*** Job is done! ***
  • CONUS_poly.nc -- NetCDF file containing a bitmap for the masking region defined by the polyline over the domain of the gridded input file. In this example should contain the following:
CDF
lat#lon#


        FileOrigins{File CONUS_poly.nc generated 20100831_183200 UTC on host r1i0n7 by the gen_poly_mask tool from the polyline file CONUS.poly
ProjectionLambert Conformalp1_deg25.000000 degrees_northp2_deg25.000000 degrees_northp0_deg12.190000 degrees_northl0_deg-133.459000 degrees_easlcen_deg-95.000000 degrees_eastd_km
     40.635000 kmr_km6367.470000 kmnx185 grid_pointsny129 grid_points
                                                                     CONUS

Italic text