Program Compilation

From HPCC Wiki
Jump to navigation Jump to search

Program Compilation and Job Submission (In REVIEW)

Serial Program Compilation

The CUNY HPC Center supports four different compiler suites at this time; those from Cray, Intel, The Portland Group, and GNU. Basic serial programs in C, C++, and Fortran can be compiled with any of these offerings, although the Cray compilers are available only on SALK. Man pages (e.g. for Cray, man cc, for Intel, man icc; for PGI, man pgcc; for GNU, man gcc) and manuals exist for each compiler in each suite and provide details on specific compiler flags. Optimized performance on a particular system with a particular compiler often depends on the compiler options chosen. Identical flags are accepted by the MPI-wrapped versions of each compiler (mpicc, mpif90, etc. [NOTE: SALK does not use mpi-prefixed MPI compile and run tools; it has its own]). Program debuggers and performance profilers are also part of each of these suites.

  • The Intel Compiler Suite
Intel's Cluster Studio (ICS) compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems.
  • The GNU Compiler Suite
The GNU compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems although unlike the other compilers mention, the default and mix of installed versions may not be the same on each system. This is because the HPC Center runs different version of Linux (SUSE and CentOS) at different release levels.

The Intel Compiler Suite

Intel's Cluster Studio (ICS) compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems. Note that icc suite is phased out by Intel and will be removed soon from HPCC environment.

To check for the default version installed on systems:

icc  -V

Compiling a serial C program on systems other than SALK:

icc  -O3 -unroll mycode.c

The line above invokes Intel's C compiler (also used by the default OpenMPI 'mpicc' wrapper for icc). It requests level 3 optimization and asks that loops be unrolled for performance. To find out more about 'icc', type 'man icc'.

Similarly for Intel Fortran and C++.

Compiling a serial Fortran program on systems other than SALK:

ifort -O3 -unroll mycode.f90

Compiling a serial C++ program on systems other than SALK:

icpc -O3 -unroll mycode.C

On SALK, Cray's generic wrappers (cc, CC, ftn) are used for each compiler suite (Intel, PGI, Cray, GNU). To map SALK's Cray wrappers to the Intel compiler suite, users must unload the default Cray compiler modules and load the Intel compiler modules, as follows:

module unload cce
module unload PrgEnv-cray
module load PrgEnv-intel
module load intel

This completes the following mappings and also sets the Cray environment to link to Cray's custom interconnect linked version of MPICH2, as well as other Intel-specific Cray library builds. Once the Intel modules are loaded, you may compile either serial or MPI parallel programs on SALK:

cc      ==>   icc
CC     ==>   icpc
ftn     ==>   ifort

Once the mapping is complete the mapped commands listed above will invoke the corresponding Intel compiler and recognize that compiler's Intel options. NOTE: Using the Intel compiler names directly on SALK will likely cause a problem as the Cray specific libraries (such as the Cray version of MPI) will not be included in the link phase, unless the intention is to run the executable only on the Cray login node.

So to compile a serial (or MPI parallel) C program on SALK after loading the Intel modules:

cc -O3 -unroll mycode.c

Doing the same on SALK for Intel Fortran and C++ programs is left as an exercise for the reader.

The GNU Compiler Suite

The GNU compilers, debuggers, profilers, and libraries are available on all HPC Center cluster systems although unlike the other compilers mention, the default and mix of installed versions may not be the same on each system. This is because the HPC Center runs different version of Linux (SUSE and CentOS) at different release levels.

To check for the default version installed:

gcc  -v

Compiling a serial C program on systems other than SALK:

gcc  -O3 -funroll-loops mycode.c

The line above invokes GNU's C compiler (also used by GNU mpicc). It requests level 3 optimization and that loops be unrolled for performance. To find out more about 'gcc', type 'man gcc'.

Similarly for Fortran and C++.

Compiling a serial Fortran program on systems other than SALK:

gfortran -O3 -funroll-loops mycode.f90

Compiling a serial C++ program (uses gcc) on systems other than SALK:

gcc -O3 -funroll-loops mycode.C

On SALK, Cray's generic wrappers (cc, CC, ftn) are used for each compiler suite (Intel, PGI, Cray, GNU). To map SALK's Cray wrappers to the GNU compiler suite, users must unload the default Cray compiler modules and load the GNU compiler modules, as follows:

module unload PrgEnv-intel
module load PrgEnv-gnu

This completes the following mappings and also sets the environment to link to Cray's custom interconnect linked version of MPICH2, as well as other GNU-specific Cray library builds. Once the GNU modules are loaded, you may compile either serial or MPI parallel programs on SALK:

cc      ==>   gcc
CC     ==>   g++
ftn     ==>   gfortran

Once the mapping is complete the mapped commands listed above will invoke the corresponding GNU compiler and recognize that compiler's GNU options. NOTE: Using the GNU names directly on SALK will likely cause a problem as the Cray specific libraries (such as the Cray version of MPI) will not be included in the link phase, unless the intention is to run the executable only on the Cray login node.

So to compile a serial (or MPI parallel) C program on SALK after loading the GNU modules:

cc -O3 -funroll-loops mycode.c

Doing the same on SALK for GNU Fortran and C++ programs is left as an exercise for the reader.

OpenMP, OpenMP SMP-Parallel Program Compilation, and SLURM Job Submission

All the compute nodes on all the the systems at the CUNY HPC Center include at least 2 sockets and multiple cores. Some have 8 cores (ZEUS, ANDY), and some have 16 (SALK and PENZIAS). These multicore, SMP compute nodes offer the CUNY HPC Center user community the option of creating parallel programs using the OpenMP Symmetric Multi-Processing (SMP) parallel programming model. The SMP parallel programming with the OpenMP model (and other SMP models) is the original parallel processing model because the earliest parallel HPC systems were built only with shared memories. The Cray-XMP (circa 1982) was among the first systems in this class. Shared memory, multi-socket and multi-core designs are now typical of even today's desktop and portable PC and Mac systems. As the CUNY HPC Center systems, each compute node is similarly a shared-memory, symmetric multi-processing system that can compute in parallel using the OpenMP shared-memory model.

In the SMP model, multiple processors work simultaneously within a single program's memory space (image). This eliminates the need to copy data from one program (process) image to another (required by MPI) and simplifies the parallel run-time environment significantly. As such, writing parallel programs to the OpenMP standard is generally easier and requires many fewer lines of code. However, the size of the problem that can be addressed using OpenMP is limited by the amount of memory on a single compute node, and the similarly the parallel performance improvement to be gained is limited by the number of processors (cores) within that single node.

As of Q4 2012 at CUNY's HPC Center, OpenMP applications can run with a maximum of 16 cores (this is on SALK, the Cray XE6m system). Most of the HPC Center's other systems are limited to 8 core OpenMP parallelism.

  • Compiling OpenMP Programs Using the Intel Compiler Suite
  • Compiling OpenMP Programs Using the GNU Compiler Suite
  • Submitting an OpenMP Program to the SLURM Batch Queueing System

Here, a simple OpenMP parallel version of the standard C "Hello, World!" program is set to run on 8 cores:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define NPROCS 8

int main (int argc, char *argv[]) {

   int nthreads, num_threads=NPROCS, tid;

  /* Set the number of threads */
  omp_set_num_threads(num_threads);

  /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Each thread obtains its thread number */
  tid = omp_get_thread_num();

  /* Each thread executes this print */
  printf("Hello World from thread = %d\n", tid);

  /* Only the master thread does this */
  if (tid == 0)
     {
      nthreads = omp_get_num_threads();
      printf("Total number of threads = %d\n", nthreads);
     }

   }  /* All threads join master thread and disband */

}

An excellent and comprehensive tutorial on OpenMP with examples can be found at the Lawrence Livermore National Lab web site: (https://computing.llnl.gov/tutorials/openMP)

Compiling OpenMP Programs Using the Intel Compiler Suite

The intel C compiler requires the '-openmp' option, as follows:

icc  -o hello_omp.exe -openmp hello_omp.c

When run, the program above produces the following output:

$ ./hello_omp.exe
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 1
Hello World from thread = 2 
Hello World from thread = 6
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 7

OpenMP is supported in Intel's C, C++, and Fortran compilers; as such, a Fortran version of the program above could be used to produce similar results. An important feature of OpenMP threads is that they are logical entities that are not by default locked to physical processors. The code above requesting 8 threads would run and produce similar results on a compute node with only 2 or 4 processors, or even 1 processor. In these cases, they would simply take more wall-clock time to complete.

When threads in excess of the physical number of processors present on the motherboad are requested they simply compete for access to actual number of physical cores available. Under such circumstaces, maximum program speed ups are limited to the number of unshared physical processors (cores) available to the OpenMP job less the overhead required to start OpenMP (this ignores Intel's 'hyperthreading' which allows two threads to share sub-resources not in simultaneous use within a single processor).

Compiling OpenMP Programs Using the GNU Compiler Suite

The GNU C compiler requires its '-fopenmp' option for OpenMP programs, as follows:

gcc  -o hello_omp.exe -fopenmp hello_omp.c

The program produces the same output, and again the order of the print statements cannot be predicted and will not necessarily be the same over repeated runs.

OpenMP is supported in both GNU's C, C++, and Fortran compilers; therefore a Fortran version of the program above could be used to produce similar results.

Submitting an OpenMP Program to the SLURM Batch Queueing System

All non-trivial jobs (development or production, parallel or serial) must be submitted to HPC Center system compute nodes from each system's head or login node using a SLURM script. Jobs run interactively on system head nodes that place a significant and sustained load on the head node will be terminated. Details on the use of SLURM are presented later in this document; however, here we present a basic SLURM script ('my_ompjob') that can be used to submit any OpenMP SMP program for batch processing on one of the CUNY HPC Center compute nodes.

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=8
#SLURM -l place=free
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

./hello_omp.exe

When submitted with 'qsub my_ompjob' a job ID XXXX is returned and the output will be written to the file 'openMP_job.oXXXX' where XXXX is the job ID, unless otherwise redirected on the command-line.

The key lines in the script are '-l select' and '-l place'. The first defines (1) resource chunk with '-l select=1' and assigns (8) cores to it with ':ncpus=8'. SLURM must allocate these (8) cores on a single node because they are all part of a single SLURM resource 'chunk' ('chunks' are atomic) to be used in concert by our OpenMP executable, hello_omp.exe.

Next, the line '-l place=free' instructs SLURM to place this chunk anywhere it can find 8 free cores. As mentioned, SLURM resource 'chunks' are indivisible across compute nodes; and therefore, this job can only be run on a single compute node. It would therefore never run on a system with only 4 cores per compute node and on those with only 8 core per node SLURM would have to find a node with no other jobs running on it. This is exactly what we want for an OpenMP job, a one-to-one mapping of physically free cores to OpenMP threads requested with no other jobs schedule by SLURM (or our of SLURM's purvey) to run and compete for those 8 cores.

Placement on a node with as many free physical cores as OpenMP threads is optimal for OpenMP jobs because each processor assigned to an OpenMP job works within that single program's memory space or image. If the processors assigned by SLURM were on another compute node they would not be usable; if they were assigned to another job on the same compute they would not be fully available to the OpenMP program and would delay its completion.

Here, the selection of 8 cores will consume all the cores available on a single compute node on ANDY. This forces SLURM to find and allocate an entire compute node to the OpenMP job. In this case, the OpenMP job will also have all of the memory the compute has at its disposal knowing that no other jobs will be assigned to it by SLURM. If fewer cores were selected (say 4), SLURM could place another job on the same ANDY compute node using as many as (4) cores. This job would compete for memory resources proportionally, but would have its own cores. SLURM offers the 'pack:excl' option to force exclusive placement even if the job uses less than all the cores on the physical node. One might wish to do this to run a single core job and have it use all the memory on the compute node.

One thing that should be kept in mind when defining SLURM resource requirements and in submitting any SLURM script is that jobs with resource requests that are impossible to fulfill on the system where the job is submitted will be queued forever and never run. In our case here, we must know that the system that we are submitting this job to has at least 8 processors (cores) available on a single physical compute node. At the HPC Center this job would run on either ANDY or SALK, but would be queued indefinitely on any system that has fewer than 8 cores per physical node. This resource mapping requirement applies to any resource that you might request in your SLURM script, not just cores. Resource definition and mapping is discussed in greater detail in the SLURM section later in this document.

Note that on SALK, the Cray XE6m system, the SLURM script would require the use of Cray's compute-node, job launch command 'aprun', as follows:

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=16:mem=32768mb
#SLURM -l place=free
#SLURM -j oe
#SLURM -o openMP_job.out
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

aprun -n 1 -d 16 ./hello_omp.exe

Here, 'aprun' is requesting that one process be allocated to a compute ('-n 1') and that it be given all 16 cores available on a single SALK compute node. Because the production queue on SALK allows no jobs requesting fewer than 16 cores, the '-l select' was also changed. The define in the original C source code should also best be change to set the number of OpenMP threads to 16 so that no allocated cores are wasted on the compute node, as in:

#define NPROCS 16

MPI, MPI Parallel Program Compilation, and SLURM Batch Job Submission

The Message Passing Interface (MPI) is a hardware-independent parallel programming and communications library callable from C, C++, or Fortran. Quoting from the MPI standard:

MPI is a message-passing application programmer interface (API), together with protocol and semantic specifications for how its features must behave in any implementation.

MPI has become the de facto standard approach for parallel programming in HPC. MPI is a collection of well-defined library calls composing an Applications Program Interface (API) for transfering data (packaged as messages) between completely independent processes with independent address spaces. These processes might be running within a single physical node, as required above with OpenMP, or distributed across nodes connected by an interconnect such as GigaBit Ethernet or InfiniBand. MPI communication is generally two-sided with both the sender and receiver of the data actively participating in the communication events. Both point-to-point and collective communication (one-to-many; many-to-one; many-to-many) are supported. MPI's goals are high performance, scalability, and portability. MPI remains the dominant parallel programming model used in high-performance computing today, although it is sometimes criticized as difficult to program with.

  • An Overview of the CUNY MPI Compilers and Batch Scheduler
  • Sample Compilations and Production Batch Scripts
    • Intel OpenMPI Parallel C
    • Intel OpenMPI Parallel FORTRAN
    • Intel OpenMPI SLURM Submit Script
    • GNU OpenMPI Parallel C
    • GNU OpenMPI Parallel FORTRAN
    • GNU OpenMPI SLURM Submit Script
    • Other System-Local Custom Versions of the MPI Stack
  • Setting Your Preferred MPI and Compiler Defaults
  • Getting the Right Interconnect for High Performance MPI


The original MPI-1 release was not designed with any special features to support traditional shared-memory or distributed, shared-memory parallel architectures, and MPI-2 provides only limited distributed, shared-memory support with some one-sided, remote direct memory access routines (RDMA). Nonetheless, MPI programs are regularly run on shared memory computers because the MPI model is an architecture-neutral parallel programming paradigm. Writing parallel programs using the MPI model (as opposed to shared-memory models such as OpenMP described above) requires the careful partitioning of program data among the communicating processes to minimize the communication events that can sap the performance of parallel applications, especially when they are run at larger scale (with more processors).

The CUNY HPC Center supports several versions of MPI, including proprietary versions from Intel, SGI, and Cray; however, with the exception of the Cray, CUNY HPC Center systems by default have standardized on the public domain release of MPI called OpenMPI (not to be confused with OpenMP [yes, this is confusing]). While this version will not always perform as well as the proprietary versions mentioned above, it is a reliable version that can be run on most HPC cluster systems. Among the systems currently running at the CUNY HPC Center, only the Cray (SALK) does not support OpenMPI. It instead uses a custom version of MPICH2 based on Cray's Gemini interconnect communication protocol. In the discussion below, we therefore emphasize OpenMPI (except in our treatment of MPI on the Cray) because it can be run on almost every system the CUNY HPC Center supports. Details on how to use Intel's and SGI's proprietary MPIs, and on using MPICH, another public domain version of MPI will be added later.

OpenMPI (completely different from and not to be confused with OpenMP described above) is a project combining technologies and resources from several previous MPI projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) with the stated aim of building the best freely available MPI library. OpenMPI represents the merger between three well-known MPI implementations:

  • FT-MPI from the University of Tennessee
  • LA-MPI from Los Alamos National Laboratory
  • LAM/MPI from Indiana University

with contributions from the PACX-MPI team at the University of Stuttgart. These four institutions comprise the founding members of the OpenMPI development team which has grown to include many other active contributors and a very active user group.

These MPI implementations were selected because OpenMPI developers thought that each excelled in one or more areas. The stated driving motivation behind OpenMPI is to bring the best ideas and technologies from the individual projects and create one world-class open source MPI implementation that excels in all areas. The OpenMPI project names several top-level goals:

  • Create a free, open source software, peer-reviewed, production-quality complete MPI-2 implementation.
  • Provide extremely high, competitive performance (low latency or high bandwidth).
  • Directly involve the high-performance computing community with external development and feedback (vendors, 3rd party researchers, users, etc.).
  • Provide a stable platform for 3rd party research and commercial development.
  • Help prevent the "forking problem" common to other MPI projects.
  • Support a wide variety of high-performance computing platforms and environments.

At the CUNY HPC Center, OpenMPI may be used to run jobs compiled with the Intel, or GNU compilers. Two simple MPI programs, one written in C and another in Fortran are shown below as examples. For details on programming in MPI, users should consider attending the CUNY HPC MPI workshop (3 days in length), refer to the many online tutorials, or read one of books on the subject. A good online tutorial on MPI can be found at LLNL here [1]. A tutorial on parallel programming in general can be found here [2].

Parallel implementations of the "Hello world!" program in C and Fortran are presented here to give the reader a feel for the look of MPI code. These sample codes can be used as test cases in the sections below describing parallel application compilation and job submission. Again, refer to the tutorials mentioned above or attend the CUNY HPC Center MPI workshop for details on MPI programming.

Example 1. C Example (hello_mpi.c)
#include <stdio.h>

/* include MPI specific data types and definitions */
#include <mpi.h>

int main (argc, argv)
int argc;
char *argv[];
{
 int rank, size;

/* set up the MPI runtime environment */
 MPI_Init (&argc, &argv);  

/* get current process id */
 MPI_Comm_rank (MPI_COMM_WORLD, &rank);

/* get number of processes */
 MPI_Comm_size (MPI_COMM_WORLD, &size);

 printf( "Hello world from process %d of %d\n", rank, size );

/* break down the MPI runtime environment */
 MPI_Finalize();

 return 0;

}


Example 2. Fortran example (hello_mpi.f90)
program hello

! include MPI specific data types and definitions
include 'mpif.h'

integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

! set up the MPI runtime environment
call MPI_INIT(ierror)

! get current process id
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)

! get number of processes
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

print*, 'Hello world from process ', rank, ' of ', size

! break down the MPI runtime environment
call MPI_FINALIZE(ierror)

end

An excellent and comprehensive tutorial on MPI with examples can be found at the Lawrence Berkeley National Lab web site: https://computing.llnl.gov/tutorials/mpi)

Sample Compilations and Production Batch Scripts

These examples could be used to compile the sample programs above and should run consistently on all CUNY HPC Center systems except SALK, which as mentioned has its own compiler wrappers.

OpenMPI (Intel compiler) Parallel C code

Compilation (again, because the Intel-compiled version of OpenMPI is the default, the full path shown here is NOT required):

/share/apps/openmpi-intel/default/bin/mpicc -o hello_mpi.exe ./hello_mpi.c

OpenMPI (Intel compiler) Parallel FORTRAN code

Compilation (again, because the Intel-compiled version of OpenMPI is the default, the full path shown here is NOT required):

/share/apps/openmpi-intel/default/bin/mpif90 -o hello_mpi.exe ./hello_mpi.f90

OpenMPI (Intel compiler) SLURM Submit Script

The script below (my_mpi.job) requests that SLURM schedule an 8 processor (core) job and allows SLURM to freely distributed the 8 processors requested to any free nodes. For details on the meaning of all the options in this script please see the full section SLURM Pro section below.

#!/bin/bash
#SLURM -q production
#SLURM -N openmpi_intel
#SLURM -l select=8:ncpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)

echo -n ">>>> SLURM Master compute node is: "
hostname
echo ""

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORKDIR variable is automatically filled with the
# path to the directory you submit your job from

cd $SLURM_O_WORKDIR

# The SLURM_NODEFILE file contains the compute nodes assigned
# to your job by SLURM.  Uncommenting the next line will show them.

echo ">>>> SLURM Assigned these nodes to your job: "
echo ""
cat $SLURM_NODEFILE
echo ""

# Because OpenMPI compiled with the Intel compilers is the default,
# the full path here is NOT required.

/share/apps/openmpi-intel/default/bin/mpirun -np 8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

When submitted with 'qsub myjob' a job ID is returned and output will be written to the file called 'openmpi_intel.oXXXX' where XXXX is the job ID. Errors will be written to 'openmpi_intel.eXXXX' where XXXX is the job ID.

MPI hello world output:


>>>> SLURM Master compute node is: r1i0n6

>>>> SLURM Assigned these nodes to your job: 

r1i0n6
r1i0n7
r1i0n8
r1i0n9
r1i0n10
r1i0n14
r1i1n0
r1i1n1

Hello world from process 0 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 3 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8

OpenMPI (GNU compiler) Parallel C

Coming soon.

OpenMPI (GNU compiler) Parallel FORTRAN

Coming soon.

OpenMPI (GNU compiler) SLURM Submit Script

This script sends SLURM an 8 processor (core) job allowing SLURM to freely distributed the 8 processors to the least loaded nodes. (Note: the only real difference between this script and the Intel script above is in the path to the mpirun command.) For details on the meaning of all the options in this script please see the full SLURM Pro section below.

#!/bin/bash
#SLURM -q production
#SLURM -N openmpi_gnu
#SLURM -l select=8:ncpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)

echo -n ">>>> SLURM Master compute node is: "
hostname
echo ""

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORKDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

echo ">>>> SLURM Assigned these nodes to your job: "
echo ""
cat $SLURM_NODEFILE
echo ""

# Because OpenMPI GNU is NOT the default, the full path is show here,
# but this does not guarantee a clean run. You must ensure that the
# environment has been toggled to GNU either in this batch script or
# within your init files (see section below).

/opt/openmpi/bin/mpirun -np 8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

When submitted with 'qsub myjob' a job ID is returned and output will be written to the file called 'openmpi_intel.oXXXX' where XXXX is the job ID. Errors will be written to 'openmpi_intel.eXXXX' where XXXX is the job ID.

MPI hello world output:


>>>> SLURM Master compute node is: r1i0n3

>>>> SLURM Assigned these nodes to your job:

r1i0n3
r1i0n7
r1i0n8
r1i0n9
r1i0n10
r1i0n14
r1i1n0
r1i1n1

Hello world from process 0 of 8
Hello world from process 7 of 8
Hello world from process 5 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 3 of 8
Hello world from process 1 of 8
Hello world from process 2 of 8

NOTE: The paths used above for the gcc version of OpenMPI apply only to ZEUS, which has a GE interconnect. On BOB, the path to the InfiniBand version of the gcc OpenMPI commands and libraries is:

/usr/mpi/gcc/openmpi-1.2.8/[bin,lib]

Setting Your Preferred MPI and Compiler Defaults

As mentioned above the default version of MPI on the CUNY HPC Center clusters is OpenMPI 1.5.5 compiled with the Intel compilers. This default is set by scripts in the /etc/profile.d directory (i.e. smpi-defaults.[sh,csh]). When the mpi-wrapper commands (mpicc, mpif90, mpirun, etc.) are used WITHOUT full path prefixes, these Intel defaults will be invoked. To use either of the other supported MPI environments (OpenMPI compiled with the PGI compilers, or OpenMPI compiled with the GNU compilers) users should set their local environment either in their home directory init files (i.e. .bashrc, .cshrc) or manually in their batch scripts. The script provided below can be used for this.

WARNING: Full path references by itself to non-default mpi-commands will NOT guarantee error free compiles and runs because of the way OpenMPI references the environment it runs in!!

CUNY HPC Center staff recommend fully toggling the site default environment away from Intel to PGI or GNU when the non-default environments are preferred. This can be done relatively easily by commenting out the default and commenting in one of the preferred alternatives referenced in the script provided below. Users may copy the script smpi-default.sh (or smpi-defaults-csh) from /etc/profile.d. A copy is provided here for reference. (NOTE: This discussion does NOT apply on the Cray which uses the 'modules' system to manage its default applications environment.)

# general path settings 
#PATH=/opt/openmpi/bin:$PATH
#PATH=/usr/mpi/gcc/openmpi-1.2.8/bin:$PATH
#PATH=/share/apps/openmpi-pgi/default/bin:$PATH
#PATH=/share/apps/openmpi-intel/default/bin:$PATH
export PATH

# man path settings 
#MANPATH=/opt/openmpi/share/man:$MANPATH
#MANPATH=/usr/mpi/gcc/openmpi-1.2.8/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-pgi/default/share/man:$MANPATH
#MANPATH=/share/apps/openmpi-intel/default/share/man:$MANPATH
export MANPATH

# library path settings 
#LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-1.2.8/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-pgi/default/lib:$LD_LIBRARY_PATH
#LD_LIBRARY_PATH=/share/apps/openmpi-intel/default/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH

By selectively commenting in the appropriate line in each paragraph above the default PATH, MANPATH, and LD_LIBRARY_PATH can be set to the MPI compilation stack that the user prefers. The right place to do this is inside the user's .bashrc file (or .cshrc file in the C-shell) in the user's HOME directory. Once done, full path references in the SLURM submit scripts listed above become unecessary and one script would work for any compilation stack.

This approach can be used to set the MPI environment to older non-default versions of OpenMPI still in installed in /share/apps/openmpi-[intel,pgi].

Getting the Right Interconnect for High Performance MPI

A few comments should be made about interconnect control and selection under OpenMPI. First, this question applies ONLY to ANDY and HERBERT which have both InfiniBand and Gigabit Ethernet interconnects. InfiniBand provides both greater bandwidth and lower latencies than Gigabit Ethernet, and it should be chosen on these systems because it will deliver better performance at a given processor count and greater application scalability.

Both the Intel and Portland Group versions of OpenMPI installed on both ANDY and HERBERT have been compiled to include the OpenIB libraries. This means that by default the mpirun command will attempt to use the OpenIB libraries at runtime without any special options. If this cannot be done because no InfiniBand devices can be found, a runtime error message will be reported in SLURM Pro's error file, and mpirun will attempt to use other libraries and interfaces (namely GigaBit Ethernet, which is TCP/IP based) to run the job. If successful, the job will run to completion, but perform in a sub-optimal way.

To avoid this, or to establish with certainty which communication libraries and devices are being used by your job, there are options that can be used with mpirun to force the choice of one communication device, or the other.

To force the job to use the OpenIB interface (ib0) or fail, use:

mpirun  -mca btl openib,self -np  8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

To force the job to use the GigaBit Ethernet interface (eth0) or fail, use:

mpirun  -mca btl tcp,self -np  8 -machinefile $SLURM_NODEFILE ./hello_mpi.exe

Note, this discussion does not apply on the Cray which uses its own proprietary Gemini interconnect. It is worth noting that the Cray's interconnect is not switched-based like the other systems, but rather a 2D toroidal mesh for which being aware of job placement on the mesh can be an important consideration when tuning a job for performance at scale.

GPU Parallel Program Compilation and SLURM Job Submission

The CUNY HPC Center supports computing with Graphics Processing Units (GPUs). GPUs can be thought of of as highly parallel co-processors (or accelerators) connected to a node's CPUs via a PCI Express bus. The HPC Center provides GPU accelerators on two systems, on PENZIAS. It has 144 NVIDIA Tesla K20m GPUs (two per every compute node in the rack). Specifications of each GPU (as found by the 'deviceQuery' utility) are as follows:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20m"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
  (13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
  GPU Clock rate:                                706 MHz (0.71 GHz)
  Memory Clock rate:                             2600 Mhz
  Memory Bus Width:                              320-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Each of 144 GPU devices shows performance of 3,524 GFLOPS. K20m are installed on the motherboard and connected via PCIe 2.0 x16 interface.


  • GPU Parallel Programming with the Portland Group Compiler Directives
  • Submitting Portland Group, GPU-Parallel Programs Using SLURM
  • GPU Parallel Programming with NVIDIA's CUDA C or PGI's CUDA Fortran Programming Models
    • A Sample CUDA GPU Parallel Program Written in NVIDIA's CUDA C
    • A Sample CUDA GPU Parallel Program Written in PGI's CUDA Fortran
  • Submitting CUDA (C or Fortran), GPU-Parallel Programs Using SLURM
  • Submitting CUDA (C or Fortran), GPU-Parallel Programs and Functions Using MATLAB


Two distinct parallel programming approaches for the HPC Center's GPU resources are described here. The first (a compiler directive's based extension available in the Portland Group's Inc. (PGI) C and Fortran compilers) delivers ease of use at the expense of somewhat less than highly tuned performance. The second (NVIDIA's Compute Unified Device Architecture, CUDA C or PGI's CUDA Fortran GPU programming model) provides the ability within C or Fortran to more directly address the GPU hardware for better performance, but at the expense of a somewhat greater programming effort. We will introduce both approaches here, and present the basic steps for GPU parallel program compilation and job submission using SLURM for both as well.

GPU Parallel Programming with the Portland Group Compiler Directives

The Portland Group, Inc. (PGI) has taken the lead in building a general purpose, accelerated parallel computing model into its compilers. Programmers can access this new technology at CUNY using PGI's compiler, which supports the use of GPU-specific, compiler directives in standard C and Fortran programs. Compiler directives simplify the programmer's job of mapping parallel kernels onto accelerator hardware and do so without compromising the portability of the user's application. Such a directives-parallelized code can be compiled and run on either the CPU-GPU together, or on the CPU alone. At this time, PGI supports the current, HPC-oriented GPU accelerator products from NVIDIA, but intends to extend its compiler-directives-based approach in the future to other accelerators.

The simplicity of coding with directives is illustrated here with a sample code ('vscale.c') that does a simple iteration independent scaling of a vector on both the GPU and CPU in single precision and compares the results:

        #include <stdio.h>
        #include <stdlib.h>
        #include <assert.h>
        
        int main( int argc, char* argv[] )
        {
            int n;      /* size of the vector */
            float *restrict a;  /* the vector */
            float *restrict r;   /* the results */
            float *restrict e;  /* expected results */
            int i;

            /* Set array size */
            if( argc > 1 )
                n = atoi( argv[1] );
            else
                n = 100000;
            if( n <= 0 ) n = 100000;
        
            /* Allocate memory for arrays */
            a = (float*)malloc(n*sizeof(float));
            r = (float*)malloc(n*sizeof(float));
            e = (float*)malloc(n*sizeof(float));

            /* Initialize array */
            for( i = 0; i < n; ++i ) a[i] = (float)(i+1);
        
            /* Scale array and mark for acceleration */
            #pragma acc region
            {
                for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
            }

            /* Scale array on the host to compare */
                for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;

            /* Check the results and print */
            for( i = 0; i < n; ++i ) assert( r[i] == e[i] );

            printf( "%d iterations completed\n", n );

            return 0;
        }

In this simple example, the only code and instruction to the compiler required to direct this vector scaling kernel to the GPU is the compiler directive:

 #pragma acc region

that precedes the second C 'for' loop. A user can build a GPU-ready executable ('c1.exe' in this case) for execution on ZEUS or ANDY with the following compilation statement:

pgcc -o vscale.exe vscale.c -ta=nvidia -Minfo=accel -fast

The option '-ta=nvidia' declares to the compiler what the destination hardware acceleration technology is going to be (PGI's model is intended to be general, although its implementation for NVIDIAs GPU accelerators is the most advanced to date), and the '-Minfo=accel' option requests output describing what the compiler did to accelerate the code. This output is included here:

main:
     29, Generating copyout(r[:n-1])
           Generating copyin(a[:n-1])
           Generating compute capability 1.0 binary
           Generating compute capability 2.0 binary
     31, Loop is parallelizable
           Accelerator kernel generated
           31, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
               CC 1.0 :   3 registers; 48 shared,   4 constant, 0 local memory bytes;   100% occupancy
               CC 2.0 : 10 registers;   4 shared, 60 constant, 0 local memory bytes;   100% occupancy

In the output, the compiler explains where and what it intends to copy to (and from) CPU memory to GPU accelerator memory. It explains that the C 'for' loop has no loop iteration dependencies and can be run on the accelerator in parallel. It also indicates the vector length (256, the block size of the work to be done on the GPU). Because the array pointer 'a[]' is declared 'restricted', it will point only into 'a'. This ensures the compiler that pointer-alias-related, loop dependencies cannot occur.

The Portland Group C and Fortran Programming Guides provide a complete description its accelerator compiler directives programming model [3]. Additional introductory material can be found in four PGI white paper tutorials (part1, part2, part3, part4), here: [4], [5], [6], [7].

Submitting Portland Group, GPU-Parallel Programs Using SLURM

GPU job submission is very much like other batch job submission under SLURM. Here is a SLURM example script that can be used to run the GPU-ready executable created above on PENZIAS:

#!/bin/bash
#SLURM -q production
#SLURM -N pgi_gpu_job
#SLURM -l select=1:ncpus=1:ngpus=1
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR


echo ">>>> Begin PGI GPU Compiler Directives-based run ..."
echo ""
./vscale.exe
echo ""
echo ">>>> End   PGI GPU Compiler Directives-based run ..."

The only difference from the non-gpu submit script is in the "select" statement. By adding "ngpus=1" directive user instructs SLURM to allocate 1 GPU device per chunk. Altogether 1 CPU and 1 GPU are requested in the above script. Consider different script:

#!/bin/bash
#SLURM -q production
#SLURM -N pgi_gpu_job
#SLURM -l select=4:ncpus=4:ngpus=2
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR


echo ">>>> Begin PGI GPU Compiler Directives-based run ..."
echo ""
./vscale.exe
echo ""
echo ">>>> End   PGI GPU Compiler Directives-based run ..."

Here SLURM is instructed to allocate 4 chunks of resources with each chunk having 4 CPUs and 2 GPUs totaling in 16 CPUs and 8 GPUs. Note that ngpus parameter may only take values of 0, 1 or 2: there are 2 GPUs per compute node, and therefore if asked for more then 2 GPUs per chunk SLURM will fail to find a compute node that matches such request (SLURM chunks are 'atomic' with respect to actual hardware). This is important limitation one needs to keep in mind while creating SLURM scripts.

These are the essential SLURM script requirements for submitting any GPU-Device-ready executable. This applies to the one with compiler directives compiled above, but might also be used to run GPU-ready executable code generated from native CUDA C or Fortran code as described in the next example. In the case above, the PGI compiler-directive marked loops will run in parallel on a single NVIDIA GPU after the data in array 'a[]' is copied to it across the PCI-Express bus.

Other variations are possible, including jobs that combine MPI or OpenMP (or even both of these) and GPU parallel programming in a single GPU-SMP-MPI multi-parallel job. There is not enough space to cover these approaches here, but the HPC Center staff has created code examples that illustrate these multi-parallel programming model approaches and will provide them to interested users at the HPC Center.

GPU Parallel Programming with NVIDIA's CUDA C or PGI's CUDA Fortran Programming Models

The previous section described the recent advances in compiler development from PGI that make utilizing the data- parallel compute power of the GPU more accessible to C and Fortran programmers. This trend has continued with the definition and adoption of the OpenACC standard by PGI, Cray, and CAPS. OpenACC is an OpenMP-like portable standard for obtaining accelerated performance on GPUs and other accelerators using compiler directives. It based on the approaches already developed by PGI, Cray, and CAPS over the last several years.

Yet, for over 5 years NVIDIA has offered and continued to develop its Compute Unified Device Architecture (CUDA), and its direct, NVIDIA-GPU-specific programming environment for C programmers. More recently, PGI has released CUDA Fortran jointly with NVIDIA offering a second language choice for programming NVIDIA GPUs using CUDA.

In this section, the basics of compiling and running CUDA C and CUDA Fortran applications at the CUNY HPC Center are covered. The current default version of CUDA in use at the CUNY HPC Center as of 11-27-12 is CUDA release 5.0.

CUDA is a complete programming environment that includes:

1.  A modified version of the C or Fortran programming language for programming the GPU Device and
   moving data between the CPU Host and the GPU Device.

2. A runtime environment and translator that generates and runs device-specific, CPU-GPU
  executables from more generic, single, mixed-instruction-set executables.

3. A Software Development Kit (SDK), HPC application-related libraries, and documentation
  to support the development of CUDA applications.

NVIDIA and PGI have put a lot of effort into making CUDA a flexible, full-featured, and high-performance program- ming environment similar to those in use in HPC to program CPUs. However, CUDA is still a 2-instruction-set, CPU-GPU programming model that must manage two separate memory spaces linked only by the compute node's PCI-Express bus. As such, programming GPUs using CUDA is more complicated than PGI's compiler-directives-based approach presented above which hides the many details of this approach from the programmer. Still, CUDA's more explicit, close-to-the-hardware approach offers CUDA programmers the chance to get the best possible performance from the GPU for their particular application by carefully controlling SM register use and occupancy.

Adapting a current application or writing a new one for the CUDA CPU-GPU programming model involves dividing that application into those parts that are highly data-parallel and better suited for the GPU Device (the so-called GPU Device code, or device kernel(s)) and those parts that have little or limited data-parallelism and are better suited for execution on the CPU Host (the driver code, or the CPU Host code). In addition, one should inventory the amount of data that must be moved between the CPU Host and GPU Device relative to the amount of GPU computation for each candidate data-parallel GPU kernel. Kernels whose compute-to-communication time ratios are too small should be executed on the CPU.

With the natural GPU-CPU divisions in the application identified, what were once host kernels (usually substantial looping sections in the host code) must be recoded in CUDA C or Fortran for the GPU Device. Also, Host CPU- to-GPU interface code for transferring data to and from the GPU, and for calling the GPU kernel must be written. Once these steps are completed and the host driver and GPU kernel code are compiled with NVIDIA's 'nvcc' compiler driver (or PGI CUDA Fortran compiler), the result is a fully executable mixed CPU-GPU binary (single file, dual instruction set) that typically does the following for each GPU kernel it calls:

1.  Allocates memory for required CPU source and destination arrays on the CPU Host.

2.  Allocates memory for GPU input, intermediate, and result arrays on the GPU Device.

3.  Initializes and/or assigns values to these arrays.

4.  Copies any required CPU Host input data to the GPU Device.

5.  Defines the GPU Device grid, block, and thread dimensions for each GPU kernel.

6.  Calls (executes) the GPU Device kernel code from the CPU Host driver code.

7.  Copies the required GPU Device results back the CPU Host.

8.  Frees (and perhaps zeroes) memory on the CPU Host and GPU Device that is no longer needed.

The details of the actual coding process are beyond the scope of the discussion here, but are treated in depth in NVIDIA's CUDA C Training Class notes, in NVIDIA's CUDA C Programming Guide, and in PGI's CUDA Fortran Programming Guide [8], [9] and in many tutorials and articles on the web [10].

A Sample CUDA GPU Parallel Program Written in NVIDIA's CUDA C

Here, we present a basic example of a CUDA C application that includes code for all the steps outlined above. It fills and then increments a 2D array on the GPU Device and returns the results to the CPU Host for printing. The example code is presented in two parts--the CPU Host setup or driver code, and the GPU Device or kernel code. This example comes from the suite of examples used by NVIDIA in its CUDA Training Class notes. There are many more involved and HPC-relevant examples (matrixMul, binomialOptions, simpleCUFFT, etc.) provided in NVIDIA's Software Development Toolkit (SDK) which any user of CUDA may download and install in their home directory on their CUNY HPC Center account.

The basic example's CPU Host CUDA C code or driver, simple3_host.cu, is:

#include <stdio.h>

extern __global__ void mykernel(int *d_a, int dimx, int dimy);

int main(int argc, char *argv[])
{
   int dimx = 16;
   int dimy = 16;
   int num_bytes = dimx * dimy * sizeof(int);

   /* Initialize Host and Device Pointers */
   int *d_a = 0, *h_a = 0;

   /* Allocate memory on the Host and Device */
   h_a = (int *) malloc(num_bytes);
   cudaMalloc( (void**) &d_a, num_bytes);

   if( 0 == h_a || 0 == d_a ) {
       printf("couldn't allocate memory\n"); return 1;
   }

   /* Initialize Device memory */
   cudaMemset(d_a, 0, num_bytes);

   /* Define kernel grid and block size */
   dim3 grid, block;
   block.x = 4;
   block.y = 4;
   grid.x = dimx/block.x;
   grid.y = dimy/block.y;

   /* Call Device kernel, asynchronously */
   mykernel<<<grid,block>>>(d_a, dimx, dimy);

   /* Copy results from the Device to the Host*/
   cudaMemcpy(h_a,d_a,num_bytes,cudaMemcpyDeviceToHost);

   /* Print out the results from the Host */
   for(int row = 0; row < dimy; row++) {
      for(int col = 0; col < dimx; col++) {
         printf("%d", h_a[row*dimx+col]);
      }
      printf("\n");
   }

   /* Free the allocated memory on the Device and Host */
   free(h_a);
   cudaFree(d_a);

   return 0;

}

The GPU Device CUDA C kernel code, simple3_device.cu, is:

__global__ void mykernel(int *a, int dimx, int dimy)
{
   int ix = blockIdx.x*blockDim.x + threadIdx.x;
   int iy = blockIdx.y*blockDim.y + threadIdx.y;
   int idx = iy * dimx + ix;

   a[idx] = a[idx] + 1;
}

Using these simple CUDA C routines (or code that you have developed yourself), one can easily create a CPU-GPU executable that is ready to run on one of the CUNY HPC Center's GPU-enabled systems (PENZIAS).

Because of the variety of source and destination code states that the CUDA programming environment must source, generate, and manage, NVIDIA has provided a master program, 'nvcc', called the CUDA compiler driver to handle all of these possible compilation phase translations as well as other compiler driver options. The detailed use of 'nvcc' is documented on "PENZIAS" by 'man nvcc' and also in NVIDIA's Compiler Driver Manual [11]. NOTE: Compiling CUDA Fortran programs can be accomplished using PGI's standard release Fortran compiler making sure that the CUDA Fortran code is marked with the '.CUF' suffix as in 'matmul.CUF'. More on this a bit later.

Among the 'nvcc' command's many groups of options are a series of options that determine what source files 'nvcc' should expect to be offered and what destination files it is expected to produce. A sampling of these compilation phase options includes:

--compile    or -c       ::    Compile whatever input files are offered (.c, .cc, .cpp, .cu) into object files (*.o file).
--ptx      or -ptx     ::    Compile all .gpu or .cu input files into device-only .ptx files.
--link     or -link     ::    Compile whatever input files are offered into an executable (the default).
--lib     or -lib        ::    Compile whatever input files are offered into a library file (*.a file).

For a typical compilation to an executable, the third and default option above (which is to supply nothing or simply the string '-link') is used. There are a multitude of other 'nvcc' options that control file and path specifications for libraries and include files, control and pass options to 'nvcc' companion compilers and linkers (this includes much of the gcc stack, which must be in the user's path for 'nvcc' to work correctly), and for code generation, among other things. For a complete description, please see the manual referred to above or the 'nvcc' man page. All this complexity relates to the fact that with CUDA one is working in a multi-source and meta-code environment.

Our concern here is generating an executable from the simple example files presented above that can be used (like the PGI executables generated in the previous section) in a SLURM batch submission script. First, we will produce object files (*.o files), and then we will link them into a GPU-Device-ready executable. Here are the 'nvcc' commands for generating the object files:

nvcc -c  simple3_host.cu
nvcc -c  simple3_device.cu

The above commands should be familiar to C programmers and will produce 2 object files, simple3_host.o and simple3_device.o in the working directory. Next, the GPU-Device-ready executable is created with:

nvcc -o simple3.exe *.o

Again, this should be very familiar to C programmers. It should be noted that these two steps can be combined as follows:

nvcc -o simple3.exe *.cu

No additional libraries or include files are required for this simple example, but in a more complex case like those provided in the CUDA Software Development Kit (SDK), library paths and libraries might be specified using the '-L' and '-l' options, include file paths with the '-I' option, among others. Again, details are provided in the 'nvcc' man page or NVIDIA Compiler Driver manual.

We now have an an executable code, 'simple3.exe', that can be submitted with SLURM to one of the GPU-enabled compute nodes on PENZIAS and that will create and increment a 2D matrix on the GPU, return the results to the CPU, and print them out.

#!/bin/bash
#SLURM -q production
#SLURM -N CAF_example
#SLURM -l select=16:ncpus=1:mem=1920mb
#SLURM -l place=scatter
#SLURM -V

echo ""
echo -n "The primary compute node hostname is: "
hostname
echo ""
echo -n "The location of the SLURM nodefile is: "
echo $SLURM_NODEFILE
echo ""
echo "The contents of the SLURM nodefile are: "
echo ""
cat  $SLURM_NODEFILE
echo ""
NCNT=`uniq $SLURM_NODEFILE | wc -l - | cut -d ' ' -f 1`
echo -n "The node count determined from the nodefile is: "
echo $NCNT
echo ""

# Change to working directory
cd $SLURM_O_WORKDIR

echo "You are using the following 'mpiexec' and 'mpdboot' commannds: "
echo ""
type mpiexec
type mpdboot
echo ""

echo "Starting the Intel 'mpdboot' daemon on $NCNT nodes ... "
mpdboot -n $NCNT --verbose --file=$SLURM_NODEFILE -r ssh
echo ""

mpdtrace
echo ""

echo "Starting an Intel CAF job requesting 16 cores ... "

./int_PI.exe

echo "CAF job finished ... "
echo ""

echo "Making sure all mpd daemons are killed ... "
mpdallexit
echo "SLURM CAF script finished ... "
echo ""

Here, the SLURM script requests 16 processors (CAF images). It simply names the executable itself to setup the Intel CAF runtime environment, engage the 16 processors, and initiate execution. This script is more elaborate because it include the procedure for setting up and breaking down the Intel MPI environment on the nodes that SLURM has selected to run the job.

Available Mathematical Libraries

  • FFTW Scientific Library
  • GNU Scientific Library
  • MKL
  • IMSL
    • Fortran Example
    • C Example

FFTW Scientific Library

FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).

The library is described in detail at the FFTW home page at http://www.fftw.org. The CUNY HPC Center has installed FFTW versions 2.1.5 (older), 3.2.2 (default), and 3.3.0 (recent release) on ANDY. All versions were built in both 32-bit and 64-bit floating point formats using the latest Intel 12.0 release of their compilers. In addition, version's 2.1.5 and 3.3.0 support a MPI parallel version of the library. The default version at the CUNY HPC Center is version 3.2.2 (64-bit) located in /share/apps/fftw/default/*.

The reason for the extra versions is that over the course of FFTW's development some changes were made to the API for the MPI parallel library. Version 2.1.5 supports the older MPI-parallel API and the recently released version 3.3.0 supports a newer MPI-parallel API. NOTE: The default version does NOT include an MPI-parallel verstion, which skipped this version generation. A threads version of each library was also built.

Please refer to the on-line documentation at the FFTW website for details on using the library (whatever the version). With the calls properly included in your code you can link in the default at compile and link time with:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/default/lib -lfftw3 

(pgcc or gcc would be used in the same way)

For the non-default versions substitute the version directory for the string 'default' above. For example, for the new 3.3 release in 32-bit use:

icc -o my_fftw.exe my_fftw.c -L/share/apps/fftw/3.3_32bit/lib -lfftw3f

For an MPI-parallel, 64-bit version of 3.3 use:

mpicc -o my__mpi_fftw.exe my_mpi_fftw.c -L/share/apps/fftw/3.3_64bit/lib -lfftw3_mpi

The include files for each release are in the 'include' directory along side the version lib directory. The names of all available libraries for each release can be found by simply listing the contents of the appropriate version's lib directory. Do this for the names of the threads-version of each library for instance.

GNU Scientific Library

The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.

The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

Here is an example of code that uses GSL routines:

#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
 
int main(void)
{
  double x = 5.0;
  double y = gsl_sf_bessel_J0(x);
  printf("J0(%g) = %.18e\n", x, y);
  return 0;
}

The example program has to be linked to the GSL library upon compilation:

gcc $(/share/apps/gsl/default/bin/gsl-config --cflags) test.c $(/share/apps/gsl/default/bin/gsl-config --libs)

The output is shown below, and should be correct to double-precision accuracy:

J0(5) = -1.775967713143382642e-01

Complete GNU Scientific Library documentation may be found of official website of the project: http://www.gnu.org/software/gsl/

MKL

Documentation to be added.

IMSL

IMSL (International Mathematics and Statistics Library) is a commercial collection of software libraries of numerical analysis functionality that are implemented in the computer programming languages of C, Java, C#.NET, and Fortran by Visual Numerics.

C and Fortran implementations if IMSL are installed on Bob cluster under

/share/apps/imsl/cnl701 

and

/share/apps/imsl/fnl600

respectively.

Fortran Example

Here is an example of FORTRAN program that uses IMSL routines:

! Use files
 
       use rand_gen_int
       use show_int
 
!  Declarations
 
       real (kind(1.e0)), parameter:: zero=0.e0
       real (kind(1.e0)) x(5)
       type (s_options) :: iopti(2)=s_options(0,zero)
       character VERSION*48, LICENSE*48, VERML*48
       external VERML
 
!  Start the random number generator with a known seed.
       iopti(1) = s_options(s_rand_gen_generator_seed,zero)
       iopti(2) = s_options(123,zero)
       call rand_gen(x, iopt=iopti)
 
!     Verify the version of the library we are running
!     by retrieving the version number via verml().
!     Verify correct installation of the license number
!     by retrieving the customer number via verml().
!
      VERSION = VERML(1)
      LICENSE = VERML(4)
      WRITE(*,*) 'Library version:  ', VERSION
      WRITE(*,*) 'Customer number:  ', LICENSE

!  Get the random numbers
       call rand_gen(x)
 
!  Output the random numbers
       call show(x,text='                              X')

! Generate error
       iopti(1) = s_options(15,zero)
       call rand_gen(x, iopt=iopti)
 
       end

To compile this example use

 . /share/apps/imsl/imsl/fnl600/rdhin111e64/bin/fnlsetup.sh

ifort -openmp -fp-model precise -I/share/apps/imsl/imsl/fnl600/rdhin111e64/include -o imslmp imslmp.f90 -L/share/apps/imsl/imsl/fnl600/rdhin111e64/lib -Bdynamic -limsl -limslsuperlu -limslscalar -limslblas -limslmpistub -limf -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/fnl600/rdhin111e64/lib


To run it in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

 Library version:  IMSL Fortran Numerical Library, Version 6.0     
 Customer number:  702815                                          
                               X
     1 -    5   9.320E-01  7.865E-01  5.004E-01  5.535E-01  9.672E-01

 *** TERMINAL ERROR 526 from s_error_post.  s_/rand_gen/ derived type option
 ***          array 'iopt' has undefined option (15) at entry (1).

C Example

More complicated example in C.

#include <stdio.h>
#include <imsl.h>

int main(void)
{
    int         n = 3;
    float       *x;
    static float        a[] = { 1.0, 3.0, 3.0,
                                1.0, 3.0, 4.0,
                                1.0, 4.0, 3.0 };
    static float        b[] = { 1.0, 4.0, -1.0 };

    /*
     * Verify the version of the library we are running by
     * retrieving the version number via imsl_version().
     * Verify correct installation of the error message file
     * by retrieving the customer number via imsl_version().
     */
    char        *library_version = imsl_version(IMSL_LIBRARY_VERSION);
    char        *customer_number = imsl_version(IMSL_LICENSE_NUMBER);

    printf("Library version:  %s\n", library_version);
    printf("Customer number:  %s\n", customer_number);

                                /* Solve Ax = b for x */
    x = imsl_f_lin_sol_gen(n, a, b, 0);
                                /* Print x */
    imsl_f_write_matrix("Solution, x of Ax = b", 1, n, x, 0);
                               /* Generate Error to access error 
                                  message file */
    n =-10;

    printf ("\nThe next call will generate an error \n");
    x = imsl_f_lin_sol_gen(n, a, b, 0);
}

To compile this example use

. /share/apps/imsl/imsl/cnl701/rdhsg111e64/bin/cnlsetup.sh

icc -ansi -I/share/apps/imsl/imsl/cnl701/rdhsg111e64/include -o cmath cmath.c -L/share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -L/share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t -limslcmath -limslcstat -limsllapack -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -lgfortran -i_dynamic -Xlinker -rpath -Xlinker /share/apps/imsl/imsl/cnl701/rdhsg111e64/lib -Xlinker -rpath -Xlinker /share/apps/intel/composerxe-2011.0.084/mkl/lib/em64t

To run the binary in a batch mode use standard submit procedure described in section Program Compilation and Job Submission. In case of successful run the following output will be generated:

Library version:  IMSL C/Math/Library Version 7.0.1
Customer number:  702815
 
       Solution, x of Ax = b
         1           2           3
        -2          -2           3

The next call will generate an error 

*** TERMINAL Error from imsl_f_lin_sol_gen.  The order of the matrix must be
***          positive while "n" = -10 is given.