Program Compilation: Difference between revisions

From HPCC Wiki
Jump to navigation Jump to search
No edit summary
Line 1,725: Line 1,725:
the job.
the job.


=== An Example Unified Parallel C (UPC) Code ===
=== An Example Unified Parallel C (UPC) Code (IN REVIEW) ===
The following simple example program includes the essential features of the Unified Parallel C (UPC)  
The following simple example program includes the essential features of the Unified Parallel C (UPC)  
programming model, including shared (globally distributed) variable declaration and blocking, one-
programming model, including shared (globally distributed) variable declaration and blocking, one-
Line 1,942: Line 1,942:
from PGAS languages on commodity cluster environments.
from PGAS languages on commodity cluster environments.


=== Submitting UPC Parallel Programs Using SLURM ===
=== Submitting UPC Parallel Programs Using SLURM (IN REVIEW) ===


Finally, two SLURM scripts that will run the above UPC executable.  First, one for the Cray XE6
Finally, two SLURM scripts that will run the above UPC executable.  First, one for the Cray XE6

Revision as of 02:21, 21 December 2024

Program Compilation and Job Submission

Serial Program Compilation

The CUNY HPC Center supports four different compiler suites at this time; those from Cray, Intel, The Portland Group, and GNU. Basic serial programs in C, C++, and Fortran can be compiled with any of these offerings, although the Cray compilers are available only on SALK. Man pages (e.g. for Cray, man cc, for Intel, man icc; for PGI, man pgcc; for GNU, man gcc) and manuals exist for each compiler in each suite and provide details on specific compiler flags. Optimized performance on a particular system with a particular compiler often depends on the compiler options chosen. Identical flags are accepted by the MPI-wrapped versions of each compiler (mpicc, mpif90, etc. [NOTE: SALK does not use mpi-prefixed MPI compile and run tools; it has its own]). Program debuggers and performance profilers are also part of each of these suites.

  • The Intel Compiler Suite
Intel's Cluster Studio (ICS) compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems.
  • The GNU Compiler Suite
The GNU compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems although unlike the other compilers mention, the default and mix of installed versions may not be the same on each system. This is because the HPC Center runs different version of Linux (SUSE and CentOS) at different release levels.

The Intel Compiler Suite

Intel's Cluster Studio (ICS) compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems. Note that icc suite is phased out by Intel and will be removed soon from HPCC environment.

To check for the default version installed on systems:

icc  -V

Compiling a serial C program on systems other than SALK:

icc  -O3 -unroll mycode.c

The line above invokes Intel's C compiler (also used by the default OpenMPI 'mpicc' wrapper for icc). It requests level 3 optimization and asks that loops be unrolled for performance. To find out more about 'icc', type 'man icc'.

Similarly for Intel Fortran and C++.

Compiling a serial Fortran program on systems other than SALK:

ifort -O3 -unroll mycode.f90

Compiling a serial C++ program on systems other than SALK:

icpc -O3 -unroll mycode.C

On SALK, Cray's generic wrappers (cc, CC, ftn) are used for each compiler suite (Intel, PGI, Cray, GNU). To map SALK's Cray wrappers to the Intel compiler suite, users must unload the default Cray compiler modules and load the Intel compiler modules, as follows:

module unload cce
module unload PrgEnv-cray
module load PrgEnv-intel
module load intel

This completes the following mappings and also sets the Cray environment to link to Cray's custom interconnect linked version of MPICH2, as well as other Intel-specific Cray library builds. Once the Intel modules are loaded, you may compile either serial or MPI parallel programs on SALK:

cc      ==>   icc
CC     ==>   icpc
ftn     ==>   ifort

Once the mapping is complete the mapped commands listed above will invoke the corresponding Intel compiler and recognize that compiler's Intel options. NOTE: Using the Intel compiler names directly on SALK will likely cause a problem as the Cray specific libraries (such as the Cray version of MPI) will not be included in the link phase, unless the intention is to run the executable only on the Cray login node.

So to compile a serial (or MPI parallel) C program on SALK after loading the Intel modules:

cc -O3 -unroll mycode.c

Doing the same on SALK for Intel Fortran and C++ programs is left as an exercise for the reader.

The GNU Compiler Suite

The GNU compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems although unlike the other compilers mention, the default and mix of installed versions may not be the same on each system. This is because the HPC Center runs different version of Linux (SUSE and CentOS) at different release levels.

To check for the default version installed:

gcc  -v

Compiling a serial C program on systems other than SALK:

gcc  -O3 -funroll-loops mycode.c

The line above invokes GNU's C compiler (also used by GNU mpicc). It requests level 3 optimization and that loops be unrolled for performance. To find out more about 'gcc', type 'man gcc'.

Similarly for Fortran and C++.

Compiling a serial Fortran program on systems other than SALK:

gfortran -O3 -funroll-loops mycode.f90

Compiling a serial C++ program (uses gcc) on systems other than SALK:

gcc -O3 -funroll-loops mycode.C

On SALK, Cray's generic wrappers (cc, CC, ftn) are used for each compiler suite (Intel, PGI, Cray, GNU). To map SALK's Cray wrappers to the GNU compiler suite, users must unload the default Cray compiler modules and load the GNU compiler modules, as follows:

module unload PrgEnv-intel
module load PrgEnv-gnu

This completes the following mappings and also sets the environment to link to Cray's custom interconnect linked version of MPICH2, as well as other GNU-specific Cray library builds. Once the GNU modules are loaded, you may compile either serial or MPI parallel programs on SALK:

cc      ==>   gcc
CC     ==>   g++
ftn     ==>   gfortran

Once the mapping is complete the mapped commands listed above will invoke the corresponding GNU compiler and recognize that compiler's GNU options. NOTE: Using the GNU names directly on SALK will likely cause a problem as the Cray specific libraries (such as the Cray version of MPI) will not be included in the link phase, unless the intention is to run the executable only on the Cray login node.

So to compile a serial (or MPI parallel) C program on SALK after loading the GNU modules:

cc -O3 -funroll-loops mycode.c

Doing the same on SALK for GNU Fortran and C++ programs is left as an exercise for the reader.

OpenMP, OpenMP SMP-Parallel Program Compilation, and SLURM Job Submission

All the compute nodes on all the the systems at the CUNY HPC Center include at least 2 sockets and multiple cores. Some have 8 cores (ZEUS, ANDY), and some have 16 (SALK and PENZIAS). These multicore, SMP compute nodes offer the CUNY HPC Center user community the option of creating parallel programs using the OpenMP Symmetric Multi-Processing (SMP) parallel programming model. The SMP parallel programming with the OpenMP model (and other SMP models) is the original parallel processing model because the earliest parallel HPC systems were built only with shared memories. The Cray-XMP (circa 1982) was among the first systems in this class. Shared memory, multi-socket and multi-core designs are now typical of even today's desktop and portable PC and Mac systems. As the CUNY HPC Center systems, each compute node is similarly a shared-memory, symmetric multi-processing system that can compute in parallel using the OpenMP shared-memory model.

In the SMP model, multiple processors work simultaneously within a single program's memory space (image). This eliminates the need to copy data from one program (process) image to another (required by MPI) and simplifies the parallel run-time environment significantly. As such, writing parallel programs to the OpenMP standard is generally easier and requires many fewer lines of code. However, the size of the problem that can be addressed using OpenMP is limited by the amount of memory on a single compute node, and the similarly the parallel performance improvement to be gained is limited by the number of processors (cores) within that single node.

As of Q4 2012 at CUNY's HPC Center, OpenMP applications can run with a maximum of 16 cores (this is on SALK, the Cray XE6m system). Most of the HPC Center's other systems are limited to 8 core OpenMP parallelism.

  • Compiling OpenMP Programs Using the Intel Compiler Suite
  • Compiling OpenMP Programs Using the GNU Compiler Suite
  • Submitting an OpenMP Program to the SLURM Batch Queueing System


Here, a simple OpenMP parallel version of the standard C "Hello, World!" program is set to run on 8 cores:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define NPROCS 8

int main (int argc, char *argv[]) {

   int nthreads, num_threads=NPROCS, tid;

  /* Set the number of threads */
  omp_set_num_threads(num_threads);

  /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Each thread obtains its thread number */
  tid = omp_get_thread_num();

  /* Each thread executes this print */
  printf("Hello World from thread = %d\n", tid);

  /* Only the master thread does this */
  if (tid == 0)
     {
      nthreads = omp_get_num_threads();
      printf("Total number of threads = %d\n", nthreads);
     }

   }  /* All threads join master thread and disband */

}

An excellent and comprehensive tutorial on OpenMP with examples can be found at the Lawrence Livermore National Lab web site: (https://computing.llnl.gov/tutorials/openMP)

Compiling OpenMP Programs Using the Intel Compiler Suite

The intel C compiler requires the '-openmp' option, as follows:

icc  -o hello_omp.exe -openmp hello_omp.c

When run, the program above produces the following output:

$ ./hello_omp.exe
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 1
Hello World from thread = 2 
Hello World from thread = 6
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 7

OpenMP is supported in Intel's C, C++, and Fortran compilers; as such, a Fortran version of the program above could be used to produce similar results. An important feature of OpenMP threads is that they are logical entities that are not by default locked to physical processors. The code above requesting 8 threads would run and produce similar results on a compute node with only 2 or 4 processors, or even 1 processor. In these cases, they would simply take more wall-clock time to complete.

When threads in excess of the physical number of processors present on the motherboad are requested they simply compete for access to actual number of physical cores available. Under such circumstaces, maximum program speed ups are limited to the number of unshared physical processors (cores) available to the OpenMP job less the overhead required to start OpenMP (this ignores Intel's 'hyperthreading' which allows two threads to share sub-resources not in simultaneous use within a single processor).

Compiling OpenMP Programs Using the GNU Compiler Suite

The GNU C compiler requires its '-fopenmp' option for OpenMP programs, as follows:

gcc  -o hello_omp.exe -fopenmp hello_omp.c

The program produces the same output, and again the order of the print statements cannot be predicted and will not necessarily be the same over repeated runs.

OpenMP is supported in both GNU's C, C++, and Fortran compilers; therefore a Fortran version of the program above could be used to produce similar results.

Submitting an OpenMP Program to the SLURM Batch Queueing System

All non-trivial jobs (development or production, parallel or serial) must be submitted to HPC Center system compute nodes from each system's head or login node using a SLURM script. Jobs run interactively on system head nodes that place a significant and sustained load on the head node will be terminated. Details on the use of SLURM are presented later in this document; however, here we present a basic SLURM script ('my_ompjob') that can be used to submit any OpenMP SMP program for batch processing on one of the CUNY HPC Center compute nodes.

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=8
#SLURM -l place=free
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

./hello_omp.exe

When submitted with 'qsub my_ompjob' a job ID XXXX is returned and the output will be written to the file 'openMP_job.oXXXX' where XXXX is the job ID, unless otherwise redirected on the command-line.

The key lines in the script are '-l select' and '-l place'. The first defines (1) resource chunk with '-l select=1' and assigns (8) cores to it with ':ncpus=8'. SLURM must allocate these (8) cores on a single node because they are all part of a single SLURM resource 'chunk' ('chunks' are atomic) to be used in concert by our OpenMP executable, hello_omp.exe.

Next, the line '-l place=free' instructs SLURM to place this chunk anywhere it can find 8 free cores. As mentioned, SLURM resource 'chunks' are indivisible across compute nodes; and therefore, this job can only be run on a single compute node. It would therefore never run on a system with only 4 cores per compute node and on those with only 8 core per node SLURM would have to find a node with no other jobs running on it. This is exactly what we want for an OpenMP job, a one-to-one mapping of physically free cores to OpenMP threads requested with no other jobs schedule by SLURM (or our of SLURM's purvey) to run and compete for those 8 cores.

Placement on a node with as many free physical cores as OpenMP threads is optimal for OpenMP jobs because each processor assigned to an OpenMP job works within that single program's memory space or image. If the processors assigned by SLURM were on another compute node they would not be usable; if they were assigned to another job on the same compute they would not be fully available to the OpenMP program and would delay its completion.

Here, the selection of 8 cores will consume all the cores available on a single compute node on ANDY. This forces SLURM to find and allocate an entire compute node to the OpenMP job. In this case, the OpenMP job will also have all of the memory the compute has at its disposal knowing that no other jobs will be assigned to it by SLURM. If fewer cores were selected (say 4), SLURM could place another job on the same ANDY compute node using as many as (4) cores. This job would compete for memory resources proportionally, but would have its own cores. SLURM offers the 'pack:excl' option to force exclusive placement even if the job uses less than all the cores on the physical node. One might wish to do this to run a single core job and have it use all the memory on the compute node.

One thing that should be kept in mind when defining SLURM resource requirements and in submitting any SLURM script is that jobs with resource requests that are impossible to fulfill on the system where the job is submitted will be queued forever and never run. In our case here, we must know that the system that we are submitting this job to has at least 8 processors (cores) available on a single physical compute node. At the HPC Center this job would run on either ANDY or SALK, but would be queued indefinitely on any system that has fewer than 8 cores per physical node. This resource mapping requirement applies to any resource that you might request in your SLURM script, not just cores. Resource definition and mapping is discussed in greater detail in the SLURM section later in this document.

Note that on SALK, the Cray XE6m system, the SLURM script would require the use of Cray's compute-node, job launch command 'aprun', as follows:

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=16:mem=32768mb
#SLURM -l place=free
#SLURM -j oe
#SLURM -o openMP_job.out
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

aprun -n 1 -d 16 ./hello_omp.exe

Here, 'aprun' is requesting that one process be allocated to a compute ('-n 1') and that it be given all 16 cores available on a single SALK compute node. Because the production queue on SALK allows no jobs requesting fewer than 16 cores, the '-l select' was also changed. The define in the original C source code should also best be change to set the number of OpenMP threads to 16 so that no allocated cores are wasted on the compute node, as in:

#define NPROCS 16

MPI, MPI Parallel Program Compilation, and SLURM Batch Job Submission

The Message Passing Interface (MPI) is a hardware-independent parallel programming and communications library callable from C, C++, or Fortran. Quoting from the MPI standard:

MPI is a message-passing application programmer interface (API), together with protocol and semantic specifications for how its features must behave in any implementation.

MPI has become the de facto standard approach for parallel programming in HPC. MPI is a collection of well-defined library calls composing an Applications Program Interface (API) for transfering data (packaged as messages) between completely independent processes with independent address spaces. These processes might be running within a single physical node, as required above with OpenMP, or distributed across nodes connected by an interconnect such as GigaBit Ethernet or InfiniBand. MPI communication is generally two-sided with both the sender and receiver of the data actively participating in the communication events. Both point-to-point and collective communication (one-to-many; many-to-one; many-to-many) are supported. MPI's goals are high performance, scalability, and portability. MPI remains the dominant parallel programming model used in high-performance computing today, although it is sometimes criticized as difficult to program with.

  • An Overview of the CUNY MPI Compilers and Batch Scheduler
  • Sample Compilations and Production Batch Scripts
    • Intel OpenMPI Parallel C
    • Intel OpenMPI Parallel FORTRAN
    • Intel OpenMPI SLURM Submit Script
    • GNU OpenMPI Parallel C
    • GNU OpenMPI Parallel FORTRAN
    • GNU OpenMPI SLURM Submit Script
    • Other System-Local Custom Versions of the MPI Stack
  • Setting Your Preferred MPI and Compiler Defaults
  • Getting the Right Interconnect for High Performance MPI


The original MPI-1 release was not designed with any special features to support traditional shared-memory or distributed, shared-memory parallel architectures, and MPI-2 provides only limited distributed, shared-memory support with some one-sided, remote direct memory access routines (RDMA). Nonetheless, MPI programs are regularly run on shared memory computers because the MPI model is an architecture-neutral parallel programming paradigm. Writing parallel programs using the MPI model (as opposed to shared-memory models such as OpenMP described above) requires the careful partitioning of program data among the communicating processes to minimize the communication events that can sap the performance of parallel applications, especially when they are run at larger scale (with more processors).

The CUNY HPC Center supports several versions of MPI, including proprietary versions from Intel, SGI, and Cray; however, with the exception of the Cray, CUNY HPC Center systems by default have standardized on the public domain release of MPI called OpenMPI (not to be confused with OpenMP [yes, this is confusing]). While this version will not always perform as well as the proprietary versions mentioned above, it is a reliable version that can be run on most HPC cluster systems. Among the systems currently running at the CUNY HPC Center, only the Cray (SALK) does not support OpenMPI. It instead uses a custom version of MPICH2 based on Cray's Gemini interconnect communication protocol. In the discussion below, we therefore emphasize OpenMPI (except in our treatment of MPI on the Cray) because it can be run on almost every system the CUNY HPC Center supports. Details on how to use Intel's and SGI's proprietary MPIs, and on using MPICH, another public domain version of MPI will be added later.

OpenMPI (completely different from and not to be confused with OpenMP described above) is a project combining technologies and resources from several previous MPI projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) with the stated aim of building the best freely available MPI library. OpenMPI represents the merger between three well-known MPI implementations:

  • FT-MPI from the University of Tennessee
  • LA-MPI from Los Alamos National Laboratory
  • LAM/MPI from Indiana University

with contributions from the PACX-MPI team at the University of Stuttgart. These four institutions comprise the founding members of the OpenMPI development team which has grown to include many other active contributors and a very active user group.

These MPI implementations were selected because OpenMPI developers thought that each excelled in one or more areas. The stated driving motivation behind OpenMPI is to bring the best ideas and technologies from the individual projects and create one world-class open source MPI implementation that excels in all areas. The OpenMPI project names several top-level goals:

  • Create a free, open source software, peer-reviewed, production-quality complete MPI-2 implementation.
  • Provide extremely high, competitive performance (low latency or high bandwidth).
  • Directly involve the high-performance computing community with external development and feedback (vendors, 3rd party researchers, users, etc.).
  • Provide a stable platform for 3rd party research and commercial development.
  • Help prevent the "forking problem" common to other MPI projects.
  • Support a wide variety of high-performance computing platforms and environments.

At the CUNY HPC Center, OpenMPI may be used to run jobs compiled with the Intel, or GNU compilers. Two simple MPI programs, one written in C and another in Fortran are shown below as examples. For details on programming in MPI, users should consider attending the CUNY HPC MPI workshop (3 days in length), refer to the many online tutorials, or read one of books on the subject. A good online tutorial on MPI can be found at LLNL here [1]. A tutorial on parallel programming in general can be found here [2].

Parallel implementations of the "Hello world!" program in C and Fortran are presented here to give the reader a feel for the look of MPI code. These sample codes can be used as test cases in the sections below describing parallel application compilation and job submission. Again, refer to the tutorials mentioned above or attend the CUNY HPC Center MPI workshop for details on MPI programming.

Example 1. C Example (hello_mpi.c)
#include <stdio.h>

/* include MPI specific data types and definitions */
#include <mpi.h>

int main (argc, argv)
int argc;
char *argv[];
{
 int rank, size;

/* set up the MPI runtime environment */
 MPI_Init (&argc, &argv);  

/* get current process id */
 MPI_Comm_rank (MPI_COMM_WORLD, &rank);

/* get number of processes */
 MPI_Comm_size (MPI_COMM_WORLD, &size);

 printf( "Hello world from process %d of %d\n", rank, size );

/* break down the MPI runtime environment */
 MPI_Finalize();

 return 0;

}


Example 2. Fortran example (hello_mpi.f90)
program hello

! include MPI specific data types and definitions
include 'mpif.h'

integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

! set up the MPI runtime environment
call MPI_INIT(ierror)

! get current process id
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)

! get number of processes
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

print*, 'Hello world from process ', rank, ' of ', size

! break down the MPI runtime environment
call MPI_FINALIZE(ierror)

end

An excellent and comprehensive tutorial on MPI with examples can be found at the Lawrence Berkeley National Lab web site: https://computing.llnl.gov/tutorials/mpi)

Sample Compilations and Production Batch Scripts

These examples could be used to compile the sample programs above and should run consistently on all CUNY HPC Center systems except SALK, which as mentioned has its own compiler wrappers.

OpenMPI (Intel compiler) Parallel C code

Compilation (again, because the Intel-compiled version of OpenMPI is the default, the full path shown here is NOT required):

/share/apps/openmpi-intel/default/bin/mpicc -o hello_mpi.exe ./hello_mpi.c

OpenMPI (Intel compiler) Parallel FORTRAN code

Compilation (again, because the Intel-compiled version of OpenMPI is the default, the full path shown here is NOT required):

/share/apps/openmpi-intel/default/bin/mpif90 -o hello_mpi.exe ./hello_mpi.f90

OpenMPI (Intel compiler) SLURM Submit Script

The script below (my_mpi.job) requests that SLURM schedule an 8 processor (core) job and allows SLURM to freely distributed the 8 processors requested to any free nodes. For details on the meaning of all the options in this script please see the full section SLURM Pro section below.

#!/bin/bash
#SLURM -q production
#SLURM -N openmpi_intel
#SLURM -l select=8:ncpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)

echo -n ">>>> SLURM Master compute node is: "
hostname
echo ""

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORKDIR variable is automatically filled with the
# path to the directory you submit your job from

cd $SLURM_O_WORKDIR

# The SLURM_NODEFILE file contains the compute nodes assigned
# to your job by SLURM.  Uncommenting the next line will show them.

echo ">>>> SLURM Assigned these nodes to your job: "
echo ""
cat $SLURM_NODEFILE
echo ""

# Because OpenMPI compiled with the Intel compilers is the default,
# the full path here is NOT required.

/share/apps/openmpi-intel/default/bin/mpirun -np 8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

When submitted with 'qsub myjob' a job ID is returned and output will be written to the file called 'openmpi_intel.oXXXX' where XXXX is the job ID. Errors will be written to 'openmpi_intel.eXXXX' where XXXX is the job ID.

MPI hello world output:


>>>> SLURM Master compute node is: r1i0n6

>>>> SLURM Assigned these nodes to your job: 

r1i0n6
r1i0n7
r1i0n8
r1i0n9
r1i0n10
r1i0n14
r1i1n0
r1i1n1

Hello world from process 0 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 3 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8

OpenMPI (GNU compiler) Parallel C

Coming soon.

OpenMPI (GNU compiler) Parallel FORTRAN

Coming soon.

OpenMPI (GNU compiler) SLURM Submit Script

This script sends SLURM an 8 processor (core) job allowing SLURM to freely distributed the 8 processors to the least loaded nodes. (Note: the only real difference between this script and the Intel script above is in the path to the mpirun command.) For details on the meaning of all the options in this script please see the full SLURM Pro section below.

#!/bin/bash
#SLURM -q production
#SLURM -N openmpi_gnu
#SLURM -l select=8:ncpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)

echo -n ">>>> SLURM Master compute node is: "
hostname
echo ""

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORKDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

echo ">>>> SLURM Assigned these nodes to your job: "
echo ""
cat $SLURM_NODEFILE
echo ""

# Because OpenMPI GNU is NOT the default, the full path is show here,
# but this does not guarantee a clean run. You must ensure that the
# environment has been toggled to GNU either in this batch script or
# within your init files (see section below).

/opt/openmpi/bin/mpirun -np 8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

When submitted with 'qsub myjob' a job ID is returned and output will be written to the file called 'openmpi_intel.oXXXX' where XXXX is the job ID. Errors will be written to 'openmpi_intel.eXXXX' where XXXX is the job ID.

MPI hello world output:


>>>> SLURM Master compute node is: r1i0n3

>>>> SLURM Assigned these nodes to your job:

r1i0n3
r1i0n7
r1i0n8
r1i0n9
r1i0n10
r1i0n14
r1i1n0
r1i1n1

Hello world from process 0 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 3 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8

NOTE: The paths used above for the gcc version of OpenMPI apply only to ZEUS, which has a GE interconnect. On BOB, the path to the InfiniBand version of the gcc OpenMPI commands and libraries is:

/usr/mpi/gcc/openmpi-1.2.8/[bin,lib]

Setting Your Preferred MPI and Compiler Defaults

As mentioned above the default version of MPI on the CUNY HPC Center clusters is OpenMPI 1.5.5 compiled with the Intel compilers. This default is set by scripts in the /etc/profile.d directory (i.e. smpi-defaults.[sh,csh]). When the mpi-wrapper commands (mpicc, mpif90, mpirun, etc.) are used WITHOUT full path prefixes, these Intel defaults will be invoked. To use either of the other supported MPI environments (OpenMPI compiled with the PGI compilers, or OpenMPI compiled with the GNU compilers) users should set their local environment either in their home directory init files (i.e. .bashrc, .cshrc) or manually in their batch scripts. The script provided below can be used for this.

WARNING: Full path references by itself to non-default mpi-commands will NOT guarantee error free compiles and runs because of the way OpenMPI references the environment it runs in!!

CUNY HPC Center staff recommend fully toggling the site default environment away from Intel to PGI or GNU when the non-default environments are preferred. This can be done relatively easily by commenting out the default and commenting in one of the preferred alternatives referenced in the script provided below. Users may copy the script smpi-default.sh (or smpi-defaults-csh) from /etc/profile.d. A copy is provided here for reference. (NOTE: This discussion does NOT apply on the Cray which uses the 'modules' system to manage its default applications environment.)

# general path settings 
#PATH=/opt/openmpi/bin:$PATH
#PATH=/usr/mpi/gcc/openmpi-1.2.8/bin:$PATH
#PATH=/share/apps/openmpi-pgi/default/bin:$PATH
#PATH=/share/apps/openmpi-intel/default/bin:$PATH
export PATH

# man path settings 
#MANPATH=/opt/openmpi/share/man:$MANPATH
#MANPATH=/usr/mpi/gcc/openmpi-1.2.8/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-pgi/default/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-intel/default/share/man:$MANPATH
export MANPATH

# library path settings 
#LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-1.2.8/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-pgi/default/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-intel/default/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

By selectively commenting in the appropriate line in each paragraph above the default PATH, MANPATH, and LD_LIBRARY_PATH can be set to the MPI compilation stack that the user prefers. The right place to do this is inside the user's .bashrc file (or .cshrc file in the C-shell) in the user's HOME directory. Once done, full path references in the SLURM submit scripts listed above become unecessary and one script would work for any compilation stack.

This approach can be used to set the MPI environment to older non-default versions of OpenMPI still in installed in /share/apps/openmpi-[intel,pgi].

Getting the Right Interconnect for High Performance MPI

A few comments should be made about interconnect control and selection under OpenMPI. First, this question applies ONLY to ANDY and HERBERT which have both InfiniBand and Gigabit Ethernet interconnects. InfiniBand provides both greater bandwidth and lower latencies than Gigabit Ethernet, and it should be chosen on these systems because it will deliver better performance at a given processor count and greater application scalability.

Both the Intel and Portland Group versions of OpenMPI installed on both ANDY and HERBERT have been compiled to include the OpenIB libraries. This means that by default the mpirun command will attempt to use the OpenIB libraries at runtime without any special options. If this cannot be done because no InfiniBand devices can be found, a runtime error message will be reported in SLURM Pro's error file, and mpirun will attempt to use other libraries and interfaces (namely GigaBit Ethernet, which is TCP/IP based) to run the job. If successful, the job will run to completion, but perform in a sub-optimal way.

To avoid this, or to establish with certainty which communication libraries and devices are being used by your job, there are options that can be used with mpirun to force the choice of one communication device, or the other.

To force the job to use the OpenIB interface (ib0) or fail, use:

mpirun  -mca btl openib,self -np  8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

To force the job to use the GigaBit Ethernet interface (eth0) or fail, use:

mpirun  -mca btl tcp,self -np  8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

Note, this discussion does not apply on the Cray which uses its own proprietary Gemini interconnect. It is worth noting that the Cray's interconnect is not switched-based like the other systems, but rather a 2D toroidal mesh for which being aware of job placement on the mesh can be an important consideration when tuning a job for performance at scale.

GPU Parallel Program Compilation and SLURM Job Submission

The CUNY HPC Center supports computing with Graphics Processing Units (GPUs). GPUs can be thought of of as highly parallel co-processors (or accelerators) connected to a node's CPUs via a PCI Express bus. The HPC Center provides GPU accelerators on two systems, on PENZIAS. It has 144 NVIDIA Tesla K20m GPUs (two per every compute node in the rack). Specifications of each GPU (as found by the 'deviceQuery' utility) are as follows:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Clock rate:                                706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Each of 144 GPU devices shows performance of 3,524 GFLOPS. K20m are installed on the motherboard and connected via PCIe 2.0 x16 interface.


  • GPU Parallel Programming with the Portland Group Compiler Directives
  • Submitting Portland Group, GPU-Parallel Programs Using SLURM
  • GPU Parallel Programming with NVIDIA's CUDA C or PGI's CUDA Fortran Programming Models
    • A Sample CUDA GPU Parallel Program Written in NVIDIA's CUDA C
    • A Sample CUDA GPU Parallel Program Written in PGI's CUDA Fortran
  • Submitting CUDA (C or Fortran), GPU-Parallel Programs Using SLURM
  • Submitting CUDA (C or Fortran), GPU-Parallel Programs and Functions Using MATLAB


Two distinct parallel programming approaches for the HPC Center's GPU resources are described here. The first (a compiler directive's based extension available in the Portland Group's Inc. (PGI) C and Fortran compilers) delivers ease of use at the expense of somewhat less than highly tuned performance. The second (NVIDIA's Compute Unified Device Architecture, CUDA C or PGI's CUDA Fortran GPU programming model) provides the ability within C or Fortran to more directly address the GPU hardware for better performance, but at the expense of a somewhat greater programming effort. We will introduce both approaches here, and present the basic steps for GPU parallel program compilation and job submission using SLURM for both as well.

GPU Parallel Programming with the Portland Group Compiler Directives

The Portland Group, Inc. (PGI) has taken the lead in building a general purpose, accelerated parallel computing model into its compilers. Programmers can access this new technology at CUNY using PGI's compiler, which supports the use of GPU-specific, compiler directives in standard C and Fortran programs. Compiler directives simplify the programmer's job of mapping parallel kernels onto accelerator hardware and do so without compromising the portability of the user's application. Such a directives-parallelized code can be compiled and run on either the CPU-GPU together, or on the CPU alone. At this time, PGI supports the current, HPC-oriented GPU accelerator products from NVIDIA, but intends to extend its compiler-directives-based approach in the future to other accelerators.

The simplicity of coding with directives is illustrated here with a sample code ('vscale.c') that does a simple iteration independent scaling of a vector on both the GPU and CPU in single precision and compares the results:

        #include <stdio.h>
        #include <stdlib.h>
        #include <assert.h>
        
        int main( int argc, char* argv[] )
        {
            int n;      /* size of the vector */
            float *restrict a;  /* the vector */
            float *restrict r;   /* the results */
            float *restrict e;  /* expected results */
            int i;

            /* Set array size */
            if( argc > 1 )
                n = atoi( argv[1] );
            else
                n = 100000;
            if( n <= 0 ) n = 100000;
        
            /* Allocate memory for arrays */
            a = (float*)malloc(n*sizeof(float));
            r = (float*)malloc(n*sizeof(float));
            e = (float*)malloc(n*sizeof(float));

            /* Initialize array */
            for( i = 0; i < n; ++i ) a[i] = (float)(i+1);
        
            /* Scale array and mark for acceleration */
            #pragma acc region
            {
                for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
            }

            /* Scale array on the host to compare */
                for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;

            /* Check the results and print */
            for( i = 0; i < n; ++i ) assert( r[i] == e[i] );

            printf( "%d iterations completed\n", n );

            return 0;
        }

In this simple example, the only code and instruction to the compiler required to direct this vector scaling kernel to the GPU is the compiler directive:

 #pragma acc region

that precedes the second C 'for' loop. A user can build a GPU-ready executable ('c1.exe' in this case) for execution on ZEUS or ANDY with the following compilation statement:

pgcc -o vscale.exe vscale.c -ta=nvidia -Minfo=accel -fast

The option '-ta=nvidia' declares to the compiler what the destination hardware acceleration technology is going to be (PGI's model is intended to be general, although its implementation for NVIDIAs GPU accelerators is the most advanced to date), and the '-Minfo=accel' option requests output describing what the compiler did to accelerate the code. This output is included here:

main:
     29, Generating copyout(r[:n-1])
           Generating copyin(a[:n-1])
           Generating compute capability 1.0 binary
           Generating compute capability 2.0 binary
     31, Loop is parallelizable
           Accelerator kernel generated
           31, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
               CC 1.0 :   3 registers; 48 shared,   4 constant, 0 local memory bytes;   100% occupancy
               CC 2.0 : 10 registers;   4 shared, 60 constant, 0 local memory bytes;   100% occupancy

In the output, the compiler explains where and what it intends to copy to (and from) CPU memory to GPU accelerator memory. It explains that the C 'for' loop has no loop iteration dependencies and can be run on the accelerator in parallel. It also indicates the vector length (256, the block size of the work to be done on the GPU). Because the array pointer 'a[]' is declared 'restricted', it will point only into 'a'. This ensures the compiler that pointer-alias-related, loop dependencies cannot occur.

The Portland Group C and Fortran Programming Guides provide a complete description its accelerator compiler directives programming model [3]. Additional introductory material can be found in four PGI white paper tutorials (part1, part2, part3, part4), here: [4], [5], [6], [7].

Submitting Portland Group, GPU-Parallel Programs Using SLURM

GPU job submission is very much like other batch job submission under SLURM. Here is a SLURM example script that can be used to run the GPU-ready executable created above on PENZIAS:

#!/bin/bash
#SLURM -q production
#SLURM -N pgi_gpu_job
#SLURM -l select=1:ncpus=1:ngpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR


echo ">>>> Begin PGI GPU Compiler Directives-based run ..."
echo ""
./vscale.exe
echo ""
echo ">>>> End   PGI GPU Compiler Directives-based run ..."

The only difference from the non-gpu submit script is in the "select" statement. By adding "ngpus=1" directive user instructs SLURM to allocate 1 GPU device per chunk. Altogether 1 CPU and 1 GPU are requested in the above script. Consider different script:

#!/bin/bash
#SLURM -q production
#SLURM -N pgi_gpu_job
#SLURM -l select=4:ncpus=4:ngpus=2
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR


echo ">>>> Begin PGI GPU Compiler Directives-based run ..."
echo ""
./vscale.exe
echo ""
echo ">>>> End   PGI GPU Compiler Directives-based run ..."

Here SLURM is instructed to allocate 4 chunks of resources with each chunk having 4 CPUs and 2 GPUs totaling in 16 CPUs and 8 GPUs. Note that ngpus parameter may only take values of 0, 1 or 2: there are 2 GPUs per compute node, and therefore if asked for more then 2 GPUs per chunk SLURM will fail to find a compute node that matches such request (SLURM chunks are 'atomic' with respect to actual hardware). This is important limitation one needs to keep in mind while creating SLURM scripts.

These are the essential SLURM script requirements for submitting any GPU-Device-ready executable. This applies to the one with compiler directives compiled above, but might also be used to run GPU-ready executable code generated from native CUDA C or Fortran code as described in the next example. In the case above, the PGI compiler-directive marked loops will run in parallel on a single NVIDIA GPU after the data in array 'a[]' is copied to it across the PCI-Express bus.

Other variations are possible, including jobs that combine MPI or OpenMP (or even both of these) and GPU parallel programming in a single GPU-SMP-MPI multi-parallel job. There is not enough space to cover these approaches here, but the HPC Center staff has created code examples that illustrate these multi-parallel programming model approaches and will provide them to interested users at the HPC Center.

GPU Parallel Programming with NVIDIA's CUDA C or PGI's CUDA Fortran Programming Models

The previous section described the recent advances in compiler development from PGI that make utilizing the data- parallel compute power of the GPU more accessible to C and Fortran programmers. This trend has continued with the definition and adoption of the OpenACC standard by PGI, Cray, and CAPS. OpenACC is an OpenMP-like portable standard for obtaining accelerated performance on GPUs and other accelerators using compiler directives. It based on the approaches already developed by PGI, Cray, and CAPS over the last several years.

Yet, for over 5 years NVIDIA has offered and continued to develop its Compute Unified Device Architecture (CUDA), and its direct, NVIDIA-GPU-specific programming environment for C programmers. More recently, PGI has released CUDA Fortran jointly with NVIDIA offering a second language choice for programming NVIDIA GPUs using CUDA.

In this section, the basics of compiling and running CUDA C and CUDA Fortran applications at the CUNY HPC Center are covered. The current default version of CUDA in use at the CUNY HPC Center as of 11-27-12 is CUDA release 5.0.

CUDA is a complete programming environment that includes:

1.  A modified version of the C or Fortran programming language for programming the GPU Device and
   moving data between the CPU Host and the GPU Device.

2. A runtime environment and translator that generates and runs device-specific, CPU-GPU
  executables from more generic, single, mixed-instruction-set executables.

3. A Software Development Kit (SDK), HPC application-related libraries, and documentation
  to support the development of CUDA applications.

NVIDIA and PGI have put a lot of effort into making CUDA a flexible, full-featured, and high-performance program- ming environment similar to those in use in HPC to program CPUs. However, CUDA is still a 2-instruction-set, CPU-GPU programming model that must manage two separate memory spaces linked only by the compute node's PCI-Express bus. As such, programming GPUs using CUDA is more complicated than PGI's compiler-directives-based approach presented above which hides the many details of this approach from the programmer. Still, CUDA's more explicit, close-to-the-hardware approach offers CUDA programmers the chance to get the best possible performance from the GPU for their particular application by carefully controlling SM register use and occupancy.

Adapting a current application or writing a new one for the CUDA CPU-GPU programming model involves dividing that application into those parts that are highly data-parallel and better suited for the GPU Device (the so-called GPU Device code, or device kernel(s)) and those parts that have little or limited data-parallelism and are better suited for execution on the CPU Host (the driver code, or the CPU Host code). In addition, one should inventory the amount of data that must be moved between the CPU Host and GPU Device relative to the amount of GPU computation for each candidate data-parallel GPU kernel. Kernels whose compute-to-communication time ratios are too small should be executed on the CPU.

With the natural GPU-CPU divisions in the application identified, what were once host kernels (usually substantial looping sections in the host code) must be recoded in CUDA C or Fortran for the GPU Device. Also, Host CPU- to-GPU interface code for transferring data to and from the GPU, and for calling the GPU kernel must be written. Once these steps are completed and the host driver and GPU kernel code are compiled with NVIDIA's 'nvcc' compiler driver (or PGI CUDA Fortran compiler), the result is a fully executable mixed CPU-GPU binary (single file, dual instruction set) that typically does the following for each GPU kernel it calls:

1.  Allocates memory for required CPU source and destination arrays on the CPU Host.

2.  Allocates memory for GPU input, intermediate, and result arrays on the GPU Device.

3.  Initializes and/or assigns values to these arrays.

4.  Copies any required CPU Host input data to the GPU Device.

5.  Defines the GPU Device grid, block, and thread dimensions for each GPU kernel.

6.  Calls (executes) the GPU Device kernel code from the CPU Host driver code.

7.  Copies the required GPU Device results back the CPU Host.

8.  Frees (and perhaps zeroes) memory on the CPU Host and GPU Device that is no longer needed.

The details of the actual coding process are beyond the scope of the discussion here, but are treated in depth in NVIDIA's CUDA C Training Class notes, in NVIDIA's CUDA C Programming Guide, and in PGI's CUDA Fortran Programming Guide [8], [9] and in many tutorials and articles on the web [10].

A Sample CUDA GPU Parallel Program Written in NVIDIA's CUDA C

Here, we present a basic example of a CUDA C application that includes code for all the steps outlined above. It fills and then increments a 2D array on the GPU Device and returns the results to the CPU Host for printing. The example code is presented in two parts--the CPU Host setup or driver code, and the GPU Device or kernel code. This example comes from the suite of examples used by NVIDIA in its CUDA Training Class notes. There are many more involved and HPC-relevant examples (matrixMul, binomialOptions, simpleCUFFT, etc.) provided in NVIDIA's Software Development Toolkit (SDK) which any user of CUDA may download and install in their home directory on their CUNY HPC Center account.

The basic example's CPU Host CUDA C code or driver, simple3_host.cu, is:

#include <stdio.h>

extern __global__ void mykernel(int *d_a, int dimx, int dimy);

int main(int argc, char *argv[])
{
   int dimx = 16;
   int dimy = 16;
   int num_bytes = dimx * dimy * sizeof(int);

   /* Initialize Host and Device Pointers */
   int *d_a = 0, *h_a = 0;

   /* Allocate memory on the Host and Device */
   h_a = (int *) malloc(num_bytes);
   cudaMalloc( (void**) &d_a, num_bytes);

   if( 0 == h_a || 0 == d_a ) {
       printf("couldn't allocate memory\n"); return 1;
   }

   /* Initialize Device memory */
   cudaMemset(d_a, 0, num_bytes);

   /* Define kernel grid and block size */
   dim3 grid, block;
   block.x = 4;
   block.y = 4;
   grid.x = dimx/block.x;
   grid.y = dimy/block.y;

   /* Call Device kernel, asynchronously */
   mykernel<<<grid,block>>>(d_a, dimx, dimy);

   /* Copy results from the Device to the Host*/
   cudaMemcpy(h_a,d_a,num_bytes,cudaMemcpyDeviceToHost);

   /* Print out the results from the Host */
   for(int row = 0; row < dimy; row++) {
      for(int col = 0; col < dimx; col++) {
         printf("%d", h_a[row*dimx+col]);
      }
      printf("\n");
   }

   /* Free the allocated memory on the Device and Host */
   free(h_a);
   cudaFree(d_a);

   return 0;

}

The GPU Device CUDA C kernel code, simple3_device.cu, is:

__global__ void mykernel(int *a, int dimx, int dimy)
{
   int ix = blockIdx.x*blockDim.x + threadIdx.x;
   int iy = blockIdx.y*blockDim.y + threadIdx.y;
   int idx = iy * dimx + ix;

   a[idx] = a[idx] + 1;
}

Using these simple CUDA C routines (or code that you have developed yourself), one can easily create a CPU-GPU executable that is ready to run on one of the CUNY HPC Center's GPU-enabled systems (PENZIAS).

Because of the variety of source and destination code states that the CUDA programming environment must source, generate, and manage, NVIDIA has provided a master program, 'nvcc', called the CUDA compiler driver to handle all of these possible compilation phase translations as well as other compiler driver options. The detailed use of 'nvcc' is documented on "PENZIAS" by 'man nvcc' and also in NVIDIA's Compiler Driver Manual [11]. NOTE: Compiling CUDA Fortran programs can be accomplished using PGI's standard release Fortran compiler making sure that the CUDA Fortran code is marked with the '.CUF' suffix as in 'matmul.CUF'. More on this a bit later.

Among the 'nvcc' command's many groups of options are a series of options that determine what source files 'nvcc' should expect to be offered and what destination files it is expected to produce. A sampling of these compilation phase options includes:

--compile    or -c       ::    Compile whatever input files are offered (.c, .cc, .cpp, .cu) into object files (*.o file).
--ptx      or -ptx     ::    Compile all .gpu or .cu input files into device-only .ptx files.
--link     or -link     ::    Compile whatever input files are offered into an executable (the default).
--lib     or -lib        ::    Compile whatever input files are offered into a library file (*.a file).

For a typical compilation to an executable, the third and default option above (which is to supply nothing or simply the string '-link') is used. There are a multitude of other 'nvcc' options that control file and path specifications for libraries and include files, control and pass options to 'nvcc' companion compilers and linkers (this includes much of the gcc stack, which must be in the user's path for 'nvcc' to work correctly), and for code generation, among other things. For a complete description, please see the manual referred to above or the 'nvcc' man page. All this complexity relates to the fact that with CUDA one is working in a multi-source and meta-code environment.

Our concern here is generating an executable from the simple example files presented above that can be used (like the PGI executables generated in the previous section) in a SLURM batch submission script. First, we will produce object files (*.o files), and then we will link them into a GPU-Device-ready executable. Here are the 'nvcc' commands for generating the object files:

nvcc -c  simple3_host.cu
nvcc -c  simple3_device.cu

The above commands should be familiar to C programmers and will produce 2 object files, simple3_host.o and simple3_device.o in the working directory. Next, the GPU-Device-ready executable is created with:

nvcc -o simple3.exe *.o

Again, this should be very familiar to C programmers. It should be noted that these two steps can be combined as follows:

nvcc -o simple3.exe *.cu

No additional libraries or include files are required for this simple example, but in a more complex case like those provided in the CUDA Software Development Kit (SDK), library paths and libraries might be specified using the '-L' and '-l' options, include file paths with the '-I' option, among others. Again, details are provided in the 'nvcc' man page or NVIDIA Compiler Driver manual.

We now have an an executable code, 'simple3.exe', that can be submitted with SLURM to one of the GPU-enabled compute nodes on PENZIAS and that will create and increment a 2D matrix on the GPU, return the results to the CPU, and print them out.

A Sample CUDA GPU Parallel Program Written in PGI's CUDA Fortran

As mentioned, in addition to CUDA C, PGI and NVIDIA have jointly developed a CUDA Fortran programming model and CUDA Fortran compiler. CUDA Fortran has been fully integrated into PGI's Fortran programming environment. The HPC Center's version of the PGI Fortran compiler fully supports CUDA Fortran.

Here, the same example presented above in CUDA C has been translated by HPC Center staff into CUDA Fortran. The CUDA Fortran host driver or main program that runs on the compute node host is presented first followed by the CUDA Fortran device or GPU code. The CUDA Fortran model proves to be economical and elegant because it can take advantage of Fortran's array-based syntax. For instance in CUDA Fortran moving data to and from the device does NOT require calls to cudaMemcpy() or cudaMemset(), but is accomplished using Fortran's native array assignment capability across a simple assignment '=' sign.

   program simple3()
!
   use cudafor
   use mykernel
!
   implicit none
!
   integer :: dimx = 16, dimy = 16
   integer :: row = 1, col = 1
   integer :: fail = 0
   integer :: asize = 0
!
   integer, allocatable, dimension(:) :: host_a
   integer, device, allocatable, dimension(:) :: dev_a
!
   type(dim3) :: grid, block

   asize = dimx * dimy

   allocate(host_a(asize),dev_a(asize),stat=fail)

   if(fail /= 0) then
      write(*,'(a)') 'couldn''t allocate memory'
      stop
   end if

   dev_a(:) = 0

   block = dim3(4,4,1)
   grid  = dim3(dimx/4,dimy/4,1)

   call mykernel<<<grid,block>>>(dev_a,dimx,dimy)

   host_a(:) = dev_a(:)

   do row=1,dimy
      do col=1,dimx
         write(*,'(i1)', advance='no') host_a((row-1)*dimx+col)
      end do
      write(*,'(/)', advance='no')
   end do

   deallocate(host_a,dev_a)

   end program

Here is the CUDA Fortran device code:

module mykernel
!
   contains
!
   attributes(global) subroutine mykernel(dev_a,dimx,dimy)
!
   integer, device, dimension(:) :: dev_a
   integer, value  :: dimx, dimy
!
   integer :: ix, iy
   integer :: idx

   ix = (blockidx%x-1)*blockdim%x + threadidx%x
   iy = (blockidx%y-1)*blockdim%y + (threadidx%y-1)
   idx = iy * dimx + ix

   dev_a(idx) = dev_a(idx) + 1

   end subroutine

end module mykernel

Compiling CUDA Fortran code is also simple, requiring nothing more than the default PGI compiler. Here is how the above code would be compiled in to a device-ready executable that could be submitted in the same manner as the CUDA C original.

pgf90 -Mcuda -fast -o simple3.exe simple3.CUF

The primary thing to remember is to use the '.CUF' suffix on all CUDA Fortran source files. As mentioned above the basics of CUDA Fortran are presented here [12].

Submitting CUDA (C or Fortran), GPU-Parallel Programs Using SLURM

The SLURM script for submitting the 'simple3.exe' executable generated by the 'nvcc' compiler driver to ANDY is very similar to the script used for the PGI executable provided above:

#!/bin/bash
#SLURM -q production
#SLURM -N CUDA_GPU_job
#SLURM -l select=1:ncpus=1:ngpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR

# Point to the CUDA executable to run the job
echo ">>>> Begin SIMPLE CUDA C or Fortran Run ..."
echo ""
./simple3.exe
echo ""
echo ">>>> End   SIMPLE CUDA C or Fortran Run ..."

This script is almost the same as the one explained in the article "Submitting Portland Group, GPU-Parallel Programs Using SLURM" above.

These are the essential SLURM script requirements for submitting any GPU-Device-ready executable. This applies to both GPU-ready executable code generated from native CUDA C or Fortran code, and compiler-directives-based GPU code. Other variations are possible, including jobs that combine the MPI or OpenMP (or even both of these) and GPU parallel programming in a single GPU-SMP-MPI multi-parallel job. These other options are discussed in the more detailed section on SLURM Pro below. The HPC Center staff has developed a series of sample codes showing all these multi- parallel programming model combinations based on a simple Monte Carlo algorithm for calculating the price of an option. To obtain this examples code suite, makefile, and submit scripts please send a request to hpchelp@csi.cuny.edu.

Submitting CUDA (C or Fortran), GPU-Parallel Programs and Functions Using MATLAB

Please refer to the details in the subsection on MATLAB GPU computing below within the larger section on using MATLAB at the CUNY HPC.

CoArray Fortran and Unified Parallel C (PGAS) Program Compilation and SLURM Job Submission

As part of its plan to offer CUNY HPC Center users a unique variety of HPC parallel programming alternatives (beyond even those described above), the HPC Center support a two cabinet 2816 core Cray XE6m system called SALK. This system supports two newer and similar, language-integrated and highly scalable approaches to parallel programming, CoArray Fortran (CAF) and Unified Parallel C (UPC). Both are extensions of their parent languages, Fortran and C respectively, and offer a symbolically concise alternative to the de facto standard, message-passing model, MPI. CAF and UPC are so-called Partitioned Global Address Space (PGAS) parallel programming models. Unlike MPI, CAF and UPC are not based on a subroutine library call API.

Both MPI and the PGAS approach to parallel programming rely on a Single Program Multiple Data (SPMD) model. In the SPMD parallel programming model, identical collaborating programs (with fully separate memory spaces, or program images) are executed by different processors that may or may not be separated by a network. Each processor-program produces different parts of the result in parallel by working on different data and taking conditionally different paths through the same code. The PGAS approach differs from MPI in that it abstracts away as much as possible, reducing the way that communication is expressed to minimal built-in extensions to the base language, in our case C and Fortran. In large part, CAF and UPC are free of extension-related, explicit library calls. With the underlying communication layer abstracted away, PGAS languages appear to provide a singular, global memory space spanning its processes.

In addition, communication among processes in a PGAS program is one-sided in the sense that any process can read and/or write into the memory of any other process without informing it of its actions. Such one-sided communication has the advantage of being economical, lowering the latency (first byte delay) that is part of the cost of communication among different parallel processes. Lower latency parallel programs are generally more scalable because they waste less time in communication, especially when the data to be moved are small in size, in finer-grained communication patterns.

  • An Example CoArray Fortran (CAF) Code
  • Submitting CoArray Fortran Parallel Programs Using SLURM
  • An Example Unified Parallel C (UPC) Code
  • Submitting UPC Parallel Programs Using SLURM

Summarizing, PGAS languages such as CAF and UPC offer the following potential advantages over MPI:

1. Explicit communication is abstracted out of the PGAS programming model.

2. Process memory is logically unified into a global address space.

3. Parallel work is economically expressed through simple extensions
    to a base language, rather than through a library-call-based API.

4. Parallel coding is easier and more intuitive.

5. Performance and scalability are better because communication latency is lower.

6. Implementation of fine-grained communication patterns is faster, easier.

The primary drawbacks of PGAS programming models include much less wide-spread support than MPI on common case HPC system architectures such as traditional HPC clusters, and the need for special hardware support to get best-case performance out of the PGAS model. Here at the CUNY HPC Center, the Cray XE6m system, SALK, has a custom interconnect (Gemini) that supports both UPC and CAF. These PGAS languages can be run on standard clusters, but the performance is not typically as good. The HPC Center supports Berkeley UPC and Intel CAF on top standard cluster interconnects without the advantage of PGAS hardware support.

An Example CoArray Fortran (CAF) Code

The following simple example program includes some of the essential features of the CoArray Fortran (CAF) programming model, including multiple processor, image-spanning co-array variable declaration; one-sided data transfer between CAF's memory-space-distinct images via simple assignment statements; and the use of critical regions and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the CAF; rather, the goal is to give the reader a feel for the CAF extensions adopted in the Fortran 2008 programming language standard that now includes CoArrays. This example, which computes PI by numerical integration, can be cut and pasted into a file and run on SALK.

A tutorial on the CAF parallel programming model can be found here [13], a more formal description of the language specifications here [14], and the actual CAF standard document as defined and adopted by the Fortran standard's committee for Fortran 2008 here [15].

! 
!  Computing PI by Numerical Integration in CAF
!

program int_pi()
!
implicit none
!
integer :: start, end
integer :: my_image, tot_images
integer :: i = 0, rem = 0, mseg = 0, nseg = 0
!
real :: f, x
!

! Declare two CAF scalar CoArrays, each with one copy per image

real :: local_pi[*], global_pi[*]

! Define integrand with Fortran statement function, set result
! accuracy through the number of segments

f(x) = 1.0/(1.0+x*x)
nseg = 4096

! Find out my image name and the total number of images

my_image   = this_image()
tot_images = num_images()

! Each image initializes its part of the CoArrays to zero

local_pi  = 0.0
global_pi = 0.0

! Partition integrand segments across CAF images (processors)

rem = mod(nseg,tot_images)

mseg  = nseg / tot_images
start = mseg * (my_image - 1)
end   = (mseg * my_image) - 1

if ( my_image .eq. tot_images ) end = end + rem

! Compute local partial sums on each CAF image (processor)

do i = start,end
  local_pi = local_pi + f((.5 + i)/(nseg))

! The above is equivalent to the following more explicit code:
!
! local_pi[my_image]= local_pi[my_image] + f((.5 + i)/(nseg))
!

enddo

local_pi = local_pi * 4.0 / nseg

! Add local, partial sums to single global sum on image 1 only. Use
! critical region to prevent read-before-write race conditions. In such
! a region, only one image at a time may pass.

critical
 global_pi[1] = global_pi[1] + local_pi
end critical

! Ensure all partial sums have been added using CAF 'sync all' barrier
! construct before writing out results

sync all

! Only CAF image 1 prints the global result

if( this_image() == 1) write(*,"('PI = ', f10.6)") global_pi

end program

This sample code computes PI in parallel using a numerical integration scheme. Taking its key CAF-specific features in order, first we find the declaration of two simple scalar co-arrays (local_pi and global_pi) using CAF's square-bracket notation for the co-array, (e.g. sname[*], vname(1:100)[*], or vname(1:8,1:4)[1:4,*]). The square bracket notation follows the standard Fortran array notation rules, except that the last dimension is always indicated with a asterisk (*) that is expanded to ensure that the number of co-arrays co-dimensioned is equal to the number of images (processes) the application has launched.

Next, the example uses the this_image() and num_images() intrinsic functions to determine each image's image ID (a number from 1 to the number of processors requested) and the total number of images or processes requested by the job. These functions return values are stored in typical, image-local, Fortran integer variables and are used later in the example to partition the work among the processors and define image-specific paths through the code. After the integral segments are partitioned among the CoArray images or processes (using the start and end variables), each image computes its piece of the integral in what is a a standard Fortran do loop. However, the variable local_pi, as noted above, is a co-array. Two notations, one implicit and one explicit (but commented out) are presented. The implicit code, with it square-bracket notation dropped, is allowed (and encouraged for optimization reasons) when only the image-local part of a co-array is referenced by a given image. The explicit code makes it clear through the square-bracket extension [my_image] that each image is working with a local element of the local_pi co-array. When the practice of dropping the []s is adopted as a notational covention, all remote, co-array references (which are more time consuming operations) in are immediately, visually identifiable by square-bracket suffixes in present the code. Optimal coding practice should seek to minimize the use of square-bracketed references where possible.

With the local, partial sums computed by each image and placed in their piece of the local_pi[*] co-array, a global sum is then safely computed and written out only on image 1 with the help of a CAF critical region. Within a critical region, only one image (process) may pass at a time. This ensures that global_pi[1] is accurately summed from each local_pi[my_image] avoiding mistakes that might be caused by simultaneous reads of the same still partially summed global_pi[1] before each image-specific increments were written. Here, we see the variable global_pi[1] with the square-bracket notation which is a reminder that each image (process) is writing its result into the memory space on image 1. This is a remote write for all images, except image 1.

The last section of the code synchronizes (sync all) the images to ensure all partial sums have been added, and then has image 1 write out the global result. Note that, as writtenhere, only image 1 has the global result. For a more detailed treatment of the CoArray Fortran language extension, now part of the Fortran 2008 standard, please see the web references included above.


The CUNY HPC Center supports CoArray Fortran on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where the Intel Cluster Studio provides a beta-level implementation of CoArray Fortran layered on top of Intel's MPI library, an approach that offers CAF's coding simplicity, but no performance advantage over MPI.

Here, the process of compiling a CAF program both for Cray's CAF on SALK, and for Intel's CAF on the HPC Center's other systems is described. On the Cray, compiling a CAF program, such as the example above, simply requires adding an option to the Cray Fortran compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: ftn -h caf -o int_PI.exe int_PI.f90
salk:
salk: ls
int_PI.exe
salk:

In the sequence above, first the Cray programming environment is loaded using the 'module' command; then the Cray Fortran compiler is invoked with the -h caf option to include the CAF features of the Fortran compiler. The result is a CAF-enabled executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (CAF images) can be selected at run time using the -n ## option to Cray's 'aprun' command. The required form of the 'aprun' command is shown below in the section on CAF program job submission using SLURM on the Cray.

To compile for a fixed number of processors (a static compile) or CAF images use the -X ## option on the Cray, as follows:

salk:
salk: ftn -X 32 -h caf -o int_PI_32.exe int_PI.f90
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or CAF images, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Intel Fortran compiler 'ifort' and requires a CAF configuration file to be defined by the user. Here is a typical configuration file to compile statically for 16 CAF images followed by the compilation command. This compilation requests a distributed mode compilation in which distinct CAF images are not expected to be on the same physical node.

andy$cat cafconf.txt
-rr -envall -n 16 ./int_PI.exe
andy$
andy$ifort -o int_PI.exe -coarray=distributed -coarray-config-file=cafconf.txt int_PI.f90

The Intel CAF compiler is relatively new and has had limited testing on CUNY HPC systems. It also makes use of Intel's MPI rather than the CUNY HPC Center default, OpenMPI, which means that Intel CAF jobs will not be properly account for. As such, we recommend that Intel CAF compiler be used for development and testing only, while production CAF codes be run on SALK using Cray's CAF compiler. An upgrade is planned for the Intel Compiler Suite in the near future, and this should improve the performance and functionality of Intel's CAF compiler release. Additional documentation on using Intel CoArray Fortran is available here.

Submitting CoArray Fortran Parallel Programs Using SLURM

Finally, two SLURM scripts that will run the above CAF executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#SLURM -q production
#SLURM -N CAF_example
#SLURM -l select=64:ncpus=1:mem=2000mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

aprun -n 64 -N 16 ./int_PI.exe

Above, the dynamically compiled executable is run on 64 SALK, Cray XE6 cores (-n 64) with 16 cores packed to a physical node (-N 16). More detail is presented below on SLURM job submission to the Cray and on the use of of the Cray's 'aprun' command. On the Cray, 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control and submission on the Cray (SALK) without understanding the 'aprun' command.

A SLURM script for the example code compiled dynamically (or statically) for 16 processors with the Intel compiler (ifort) for execution on one of the HPC Center's more traditional HPC clusters looks like this:

#!/bin/bash
#SLURM -q production
#SLURM -N CAF_example
#SLURM -l select=16:ncpus=1:mem=1920mb
#SLURM -l place=scatter
#SLURM -V

echo ""
echo -n "The primary compute node hostname is: "
hostname
echo ""
echo -n "The location of the SLURM nodefile is: "
echo $SLURM_NODEFILE
echo ""
echo "The contents of the SLURM nodefile are: "
echo ""
cat  $SLURM_NODEFILE
echo ""
NCNT=`uniq $SLURM_NODEFILE | wc -l - | cut -d ' ' -f 1`
echo -n "The node count determined from the nodefile is: "
echo $NCNT
echo ""

# Change to working directory
cd $SLURM_O_WORKDIR

echo "You are using the following 'mpiexec' and 'mpdboot' commannds: "
echo ""
type mpiexec
type mpdboot
echo ""

echo "Starting the Intel 'mpdboot' daemon on $NCNT nodes ... "
mpdboot -n $NCNT --verbose --file=$SLURM_NODEFILE -r ssh
echo ""

mpdtrace
echo ""

echo "Starting an Intel CAF job requesting 16 cores ... "

./int_PI.exe

echo "CAF job finished ... "
echo ""

echo "Making sure all mpd daemons are killed ... "
mpdallexit
echo "SLURM CAF script finished ... "
echo ""

Here, the SLURM script requests 16 processors (CAF images). It simply names the executable itself to setup the Intel CAF runtime environment, engage the 16 processors, and initiate execution. This script is more elaborate because it include the procedure for setting up and breaking down the Intel MPI environment on the nodes that SLURM has selected to run the job.

An Example Unified Parallel C (UPC) Code (IN REVIEW)

The following simple example program includes the essential features of the Unified Parallel C (UPC) programming model, including shared (globally distributed) variable declaration and blocking, one- sided data transfer between UPC's memory-space distinct threads via simple assignment statements, and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the UPC; rather the goal is to give the reader a feel for basic UPC extensions to the C programming language. A tutorial on the UPC programming model can be found here [16], a user guide here [17], and a more formal description of the language specifications here [18]. Cray also has its own documentation on UPC [19]

// 
//  Computing PI by Numerical Integration in UPC
//

// Select memory consistency model (default).

#include<upc_relaxed.h> 

#include<math.h>
#include<stdio.h>

// Define integrand with a macro and set result accuracy

#define f(x) (1.0/(1.0+x*x))
#define N 4096

// Declare UPC shared scalar, shared vector array, and UPC lock variable.

shared float global_pi = 0.0;
shared [1] float local_pi[THREADS];
upc_lock_t *lock;

void main(void)
{
   int i;

   // Allocate a single, globally-shared UPC lock. This 
   // function is collective, intial state is unlocked.

   lock = upc_all_lock_alloc();

   // Each UPC thread initializes its local piece of the
   // shared array.

   local_pi[MYTHREAD] = 0.0;

   // Distribute work across threads using local part of shared
   // array 'local_pi' to compute PI partial sum on thread (processor)

   for(i = 0; i <  N; i++) {
       if(MYTHREAD == i%THREADS) local_pi[MYTHREAD] += (float) f((.5 + i)/(N));
   } 

   local_pi[MYTHREAD] *= (float) (4.0 / N);

   // Compile local, partial sums to single global sum.
   // Use locks to prevent read-before-write race conditions.

   upc_lock(lock);
   global_pi += local_pi[MYTHREAD];
   upc_unlock(lock);

   // Ensure all partial sums have been added with UPC barrier.

   upc_barrier;

   // UPC thread 0 prints the results and frees the lock.

   if(MYTHREAD==0) printf("PI = %f\n",global_pi);
   if(MYTHREAD==0) upc_lock_free(lock);

}

This sample code computes PI in parallel using a numerical integration scheme. Taking the key UPC-specific features present in this example in order, first we find the declaration of the memory consistency model to be used in this code. The default choice is relaxed which is selected explicitly here. The relaxed choice places the burden of ensuring that shared memory operations in the code that are dependent and must be order on the programmer through the use of barriers, fences, and locks. This code includes explicit locks and barriers to ensure memory operations are complete and processor have been synchronized.

Next, three declarations outside the main body of the application demonstrate the use of UPC's shared type. First, a scalar shared variable global_pi is declared. This variable can be read from and written to by any of the UPC threads (processors) allocated by the runtime environment to the application it is executed. It will hold the final result of the calculation of PI in this example. Shared scalar variables are singular and always reside in the shared memory of THREAD 0 in UPC.

Next, a shared one dimensional array local_pi with a block size of one (1) and a size of THREADS is declared. The THREADS macro is always set to the number of processors (UPC threads) requested by the job at runtime. All elements in this shared array are accessible by all THREADS allocated to the job. The block size of one means that array elements are distributed, one-per-thread, across the logically Partitioned Global Address Space (PGAS) of this parallel application. One is the default block size for shared arrays, but other sizes are possible.

Finally, a pointer to a special shared scalar variable to be used as a lock is declared. Because UPC defines both a shared and private memory spaces for each program image or THREAD, it must support four classes of pointers: private pointers to private, private pointers to shared, shared pointers to private, and shared pointers to shared. The pointer declared here is a shared pointer to shared which makes the lock's memory location available to all threads. In the body of the code, the lock's memory is allocated and placed in the unlocked state with the call to upc_all_lock_alloc().

Next, each thread initializes its piece of the shared array local_pi to zero with the help of the MYTHREAD macro, which contains the thread identifier of the particular thread that does the assignment. In this case, each UPC thread initializes only the part of the share array the is in its portion of shared PGAS memory. The standard C for-loop that follows divides the work of integration among the different UPC threads so that each thread works on its only local portion of the shared array local_pi. UPC provides a work-sharing loop construct upc_forall that accomplishes the same thing implicitly.

Processor-local (UPC thread) partial sums are then summed globally and in a memory consistent fashion with the help of the UPC lock function upc_lock() and upc_unlock(). Without the explicit locking code here, there would be nothing to prevent two UPC threads from reading the most current value in memory before it had been updated with a latest partial sum. This would produce an incorrect under-summing of the result. Next, a upc_barrier ensures all the summing is completed before the result is printed and the lock's memory is freed.

This example includes some of the more important UPC PGAS-parallel extensions to the C programming language, but a complete review of the UPC parallel extension to C is provide in the web documentation referenced above.

As suggested above, the CUNY HPC Center supports UPC on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where Berkeley UPC is installed and uses the GASNET library to support the PGAS memory abstraction on top of a number standard underlying cluster interconnects. At the HPC Center this would include Ethernet and/or InfiniBand depending on the CUNY HPC Center cluster system being used.

Here, the process of compiling a UPC program both for Cray's UPC on SALK, and for Berkeley UPC on the HPC Center's other systems is described. On the Cray, compiling a UPC program, such as the example above, simply requires adding an option to the Cray C compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: cc -h upc -o int_PI.exe int_PI.c
salk:
salk: ls
int_PI.exe
salk:

First, the Cray programming environment is loaded using the 'module' command; then the Cray compiler is invoked with the -h upc option to include the UPC elements of the compiler. The result is an executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (UPC threads) can be selected at run time using the -n ## option to 'aprun'. The required form the 'aprun' line is shown below in the section on UPC program SLURM job submission.

To compile for a fixed number of processors (a static compile) or UPC threads use the -X ## option on the Cray, as follows:

salk:
salk: cc -X 32 -h upc -o int_PI_32.exe int_PI.c
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or UPC threads, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Berkeley UPC compiler driver 'upcc'.

andy:
andy: upcc  -o int_PI.exe int_PI.c
andy:
andy: ls
int_PI.exe
andy:

Similarly, the 'upcc' compiler driver from Berkeley allows for static compilations using its -T ## option:

andy:
andy: upcc -T 32  -o int_PI_32.exe int_PI.c
andy:
andy: ls
int_PI_32.exe
andy:

The Berkeley UPC compiler driver has a number of other useful options that are described in its 'man' page. In particular, the -network= option will target the executable for the GASNET communication conduit of the user's choosing on systems that have multiple interconnects (Ethernet and InfiniBand, for instance) or target the default version of MPI as the communication layer. Type 'man upcc' for details.

In general, users can expect better performance from Cray's UPC compiler on SALK, but having UPC on the HPC Center's traditional cluster architectures provides another location for development and supports the wider use of UPC and an alternative to MPI. In theory, well-written UPC code should perform as well as MPI on a standard cluster, while reducing the number of lines of code to achieve that performance. In practice, this is still not always the case; more development and hardware support is still needed to get the best performance from PGAS languages on commodity cluster environments.

Submitting UPC Parallel Programs Using SLURM (IN REVIEW)

Finally, two SLURM scripts that will run the above UPC executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#SLURM -q production
#SLURM -N UPC_example
#SLURM -l select=64:ncpus=1:mem=2000mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

aprun -n 64 -N 16 ./int_PI2.exe

Here the dynamically compiled executable is run on 64 Cray XE6 cores (-n 64), 16 cores packed to a physical node (-N 16). More detail is presented below on SLURM job submission on the Cray and the use of of the Cray's 'aprun' command. On the Cray 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control on the Cray (SALK) without understanding 'aprun'.

A similar SLURM script for the example code compiled dynamically (or statically) for 32 processors with the Berkeley UPC compiler (upcc) for execution on one of the HPC Center's more traditional HPC cluster looks like this:

#!/bin/bash
#SLURM -q production
#SLURM -N UPC_example
#SLURM -l select=32:ncpus=1:mem=1920mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

upcrun -n 32 ./int_PI2.exe

Here, the SLURM script requests 32 processors (UPC threads). It uses the 'upcrun' command to setup the Berkeley UPC runtime environment, engage the 32 processors, and initiate execution. Please type 'man upcrun' for details on the 'upcrun' command and its options.

Available Mathematical Libraries

  • FFTW Scientific Library
  • GNU Scientific Library
  • MKL
  • IMSL
    • Fortran Example
    • C Example

FFTW Scientific Library

FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).

The library is described in detail at the FFTW home page at http://www.fftw.org. The CUNY HPC Center has installed FFTW versions 2.1.5 (older), 3.2.2 (default), and 3.3.0 (recent release) on ANDY. All versions were built in both 32-bit and 64-bit floating point formats using the latest Intel 12.0 release of their compilers. In addition, version's 2.1.5 and 3.3.0 support a MPI parallel version of the library. The default version at the CUNY HPC Center is version 3.2.2 (64-bit) located in /share/apps/fftw/default/*.

The reason for the extra versions is that over the course of FFTW's development some changes were made to the API for the MPI parallel library. Version 2.1.5 supports the older MPI-parallel API and the recently released version 3.3.0 supports a newer MPI-parallel API. NOTE: The default version does NOT include an MPI-parallel verstion, which skipped this version generation. A threads version of each library was also built.

Please refer to the on-line documentation at the FFTW website for details on using the library (whatever the version). With the calls properly included in your code you can link in the default at compile and link time with:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/default/lib -lfftw3 

(pgcc or gcc would be used in the same way)

For the non-default versions substitute the version directory for the string 'default' above. For example, for the new 3.3 release in 32-bit use:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/3.3_32bit/lib -lfftw3f

For an MPI-parallel, 64-bit version of 3.3 use:

mpicc -o my__mpi_fftw.exe my_mpi_fftw.c -L/share/apps/fftw/3.3_64bit/lib -lfftw3_mpi

The include files for each release are in the 'include' directory along side the version lib directory. The names of all available libraries for each release can be found by simply listing the contents of the appropriate version's lib directory. Do this for the names of the threads-version of each library for instance.

GNU Scientific Library

The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.

The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

Here is an example of code that uses GSL routines:

#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
 
int main(void)
{
  double x = 5.0;
  double y = gsl_sf_bessel_J0(x);
  printf("J0(%g) = %.18e\n", x, y);
  return 0;
}

The example program has to be linked to the GSL library upon compilation:

gcc $(/share/apps/gsl/default/bin/gsl-config --cflags) test.c $(/share/apps/gsl/default/bin/gsl-config --libs)

The output is shown below, and should be correct to double-precision accuracy:

J0(5) = -1.775967713143382642e-01

Complete GNU Scientific Library documentation may be found of official website of the project: http://www.gnu.org/software/gsl/

MKL

Documentation to be added.

IMSL

IMSL (International Mathematics and Statistics Library) is a commercial collection of software libraries of numerical analysis functionality that are implemented in the computer programming languages of C, Java, C#.NET, and Fortran by Visual Numerics.

C and Fortran implementations if IMSL are installed on Bob cluster under

/share/apps/imsl/cnl701 

and

/share/apps/imsl/fnl600

respectively.

Fortran Example

Here is an example of FORTRAN program that uses IMSL routines:

! Use files
 
       use rand_gen_int
       use show_int
 
!  Declarations
 
       real (kind(1.e0)), parameter:: zero=0.e0
       real (kind(1.e0)) x(5)
       type (s_options) :: iopti(2)=s_options(0,zero)
       character VERSION*48, LICENSE*48, VERML*48
       external VERML
 
!  Start the random number generator with a known seed.
       iopti(1) = s_options(s_rand_gen_generator_seed,zero)
       iopti(2) = s_options(123,zero)
       call rand_gen(x, iopt=iopti)
 
!     Verify the version of the library we are running
!     by retrieving the version number via verml().
!     Verify correct installation of the license number
!     by retrieving the customer number via verml().
!
      VERSION = VERML(1)
      LICENSE = VERML(4)
      WRITE(*,*) 'Library version:  ', VERSION
      WRITE(*,*) 'Customer number:  ', LICENSE

!  Get the random numbers
       call rand_gen(x)
 
!  Output the random numbers
       call show(x,text='                              X')

! Generate error
       iopti(1) = s_options(15,zero)
       call rand_gen(x, iopt=iopti)
 
       end

To compile this example use

 . /share/apps/imsl/imsl/fnl600/rdhin111e64/bin/fnlsetup.sh

ifort -openmp -fp-model precise -I/share/apps/imsl/imsl/fnl600/rdhin111e64/include -o imslmp imslmp.f90 -L/share/apps/imsl/imsl/fnl600/rdhin111e64/lib -Bdynamic -limsl -limslsuperlu -limslscalar -limslblas -limslmpistub -limf -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/fnl600/rdhin111e64/lib


To run it in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

 Library version:  IMSL Fortran Numerical Library, Version 6.0     
 Customer number:  702815                                          
                               X
     1 -    5   9.320E-01  7.865E-01  5.004E-01  5.535E-01  9.672E-01

 *** TERMINAL ERROR 526 from s_error_post.  s_/rand_gen/ derived type option
 ***          array 'iopt' has undefined option (15) at entry (1).

C Example

More complicated example in C.

#include <stdio.h>
#include <imsl.h>

int main(void)
{
    int         n = 3;
    float       *x;
    static float        a[] = { 1.0, 3.0, 3.0,
                                1.0, 3.0, 4.0,
                                1.0, 4.0, 3.0 };
    static float        b[] = { 1.0, 4.0, -1.0 };

    /*
     * Verify the version of the library we are running by
     * retrieving the version number via imsl_version().
     * Verify correct installation of the error message file
     * by retrieving the customer number via imsl_version().
     */
    char        *library_version = imsl_version(IMSL_LIBRARY_VERSION);
    char        *customer_number = imsl_version(IMSL_LICENSE_NUMBER);

    printf("Library version:  %s\n", library_version);
    printf("Customer number:  %s\n", customer_number);

                                /* Solve Ax = b for x */
    x = imsl_f_lin_sol_gen(n, a, b, 0);
                                /* Print x */
    imsl_f_write_matrix("Solution, x of Ax = b", 1, n, x, 0);
                               /* Generate Error to access error 
                                  message file */
    n =-10;

    printf ("\nThe next call will generate an error \n");
    x = imsl_f_lin_sol_gen(n, a, b, 0);
}

To compile this example use

. /share/apps/imsl/imsl/cnl701/rdhsg111e64/bin/cnlsetup.sh

icc -ansi -I/share/apps/imsl/imsl/cnl701/rdhsg111e64/include -o cmath cmath.c -L/share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -L/share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t -limslcmath -limslcstat -limsllapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -lgfortran -i_dynamic -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -Xlinker -rpath -Xlinker /share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t

To run the binary in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

Library version:  IMSL C/Math/Library Version 7.0.1
Customer number:  702815
 
       Solution, x of Ax = b
         1           2           3
        -2          -2           3

The next call will generate an error 

*** TERMINAL Error from imsl_f_lin_sol_gen.  The order of the matrix must be
***          positive while "n" = -10 is given.