CoArray Fortran and Unified Parallel C (PGAS) Program Compilation and SLURM Job Submission

From HPCC Wiki
Jump to navigation Jump to search

CoArray Fortran and Unified Parallel C (PGAS) Program Compilation and SLURM Job Submission

As part of its plan to offer CUNY HPC Center users a unique variety of HPC parallel programming alternatives (beyond even those described above), the HPC Center support a two cabinet 2816 core Cray XE6m system called SALK. This system supports two newer and similar, language-integrated and highly scalable approaches to parallel programming, CoArray Fortran (CAF) and Unified Parallel C (UPC). Both are extensions of their parent languages, Fortran and C respectively, and offer a symbolically concise alternative to the de facto standard, message-passing model, MPI. CAF and UPC are so-called Partitioned Global Address Space (PGAS) parallel programming models. Unlike MPI, CAF and UPC are not based on a subroutine library call API.

Both MPI and the PGAS approach to parallel programming rely on a Single Program Multiple Data (SPMD) model. In the SPMD parallel programming model, identical collaborating programs (with fully separate memory spaces, or program images) are executed by different processors that may or may not be separated by a network. Each processor-program produces different parts of the result in parallel by working on different data and taking conditionally different paths through the same code. The PGAS approach differs from MPI in that it abstracts away as much as possible, reducing the way that communication is expressed to minimal built-in extensions to the base language, in our case C and Fortran. In large part, CAF and UPC are free of extension-related, explicit library calls. With the underlying communication layer abstracted away, PGAS languages appear to provide a singular, global memory space spanning its processes.

In addition, communication among processes in a PGAS program is one-sided in the sense that any process can read and/or write into the memory of any other process without informing it of its actions. Such one-sided communication has the advantage of being economical, lowering the latency (first byte delay) that is part of the cost of communication among different parallel processes. Lower latency parallel programs are generally more scalable because they waste less time in communication, especially when the data to be moved are small in size, in finer-grained communication patterns.

Summarizing, PGAS languages such as CAF and UPC offer the following potential advantages over MPI:

1. Explicit communication is abstracted out of the PGAS programming model.

2. Process memory is logically unified into a global address space.

3. Parallel work is economically expressed through simple extensions
    to a base language, rather than through a library-call-based API.

4. Parallel coding is easier and more intuitive.

5. Performance and scalability are better because communication latency is lower.

6. Implementation of fine-grained communication patterns is faster, easier.

The primary drawbacks of PGAS programming models include much less wide-spread support than MPI on common case HPC system architectures such as traditional HPC clusters, and the need for special hardware support to get best-case performance out of the PGAS model. Here at the CUNY HPC Center, the Cray XE6m system, SALK, has a custom interconnect (Gemini) that supports both UPC and CAF. These PGAS languages can be run on standard clusters, but the performance is not typically as good. The HPC Center supports Berkeley UPC and Intel CAF on top standard cluster interconnects without the advantage of PGAS hardware support.

An Example CoArray Fortran (CAF) Code

The following simple example program includes some of the essential features of the CoArray Fortran (CAF) programming model, including multiple processor, image-spanning co-array variable declaration; one-sided data transfer between CAF's memory-space-distinct images via simple assignment statements; and the use of critical regions and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the CAF; rather, the goal is to give the reader a feel for the CAF extensions adopted in the Fortran 2008 programming language standard that now includes CoArrays. This example, which computes PI by numerical integration, can be cut and pasted into a file and run on SALK.

A tutorial on the CAF parallel programming model can be found here [1], a more formal description of the language specifications here [2], and the actual CAF standard document as defined and adopted by the Fortran standard's committee for Fortran 2008 here [3].

! 
!  Computing PI by Numerical Integration in CAF
!

program int_pi()
!
implicit none
!
integer :: start, end
integer :: my_image, tot_images
integer :: i = 0, rem = 0, mseg = 0, nseg = 0
!
real :: f, x
!

! Declare two CAF scalar CoArrays, each with one copy per image

real :: local_pi[*], global_pi[*]

! Define integrand with Fortran statement function, set result
! accuracy through the number of segments

f(x) = 1.0/(1.0+x*x)
nseg = 4096

! Find out my image name and the total number of images

my_image   = this_image()
tot_images = num_images()

! Each image initializes its part of the CoArrays to zero

local_pi  = 0.0
global_pi = 0.0

! Partition integrand segments across CAF images (processors)

rem = mod(nseg,tot_images)

mseg  = nseg / tot_images
start = mseg * (my_image - 1)
end   = (mseg * my_image) - 1

if ( my_image .eq. tot_images ) end = end + rem

! Compute local partial sums on each CAF image (processor)

do i = start,end
  local_pi = local_pi + f((.5 + i)/(nseg))

! The above is equivalent to the following more explicit code:
!
! local_pi[my_image]= local_pi[my_image] + f((.5 + i)/(nseg))
!

enddo

local_pi = local_pi * 4.0 / nseg

! Add local, partial sums to single global sum on image 1 only. Use
! critical region to prevent read-before-write race conditions. In such
! a region, only one image at a time may pass.

critical
 global_pi[1] = global_pi[1] + local_pi
end critical

! Ensure all partial sums have been added using CAF 'sync all' barrier
! construct before writing out results

sync all

! Only CAF image 1 prints the global result

if( this_image() == 1) write(*,"('PI = ', f10.6)") global_pi

end program

This sample code computes PI in parallel using a numerical integration scheme. Taking its key CAF-specific features in order, first we find the declaration of two simple scalar co-arrays (local_pi and global_pi) using CAF's square-bracket notation for the co-array, (e.g. sname[*], vname(1:100)[*], or vname(1:8,1:4)[1:4,*]). The square bracket notation follows the standard Fortran array notation rules, except that the last dimension is always indicated with a asterisk (*) that is expanded to ensure that the number of co-arrays co-dimensioned is equal to the number of images (processes) the application has launched.

Next, the example uses the this_image() and num_images() intrinsic functions to determine each image's image ID (a number from 1 to the number of processors requested) and the total number of images or processes requested by the job. These functions return values are stored in typical, image-local, Fortran integer variables and are used later in the example to partition the work among the processors and define image-specific paths through the code. After the integral segments are partitioned among the CoArray images or processes (using the start and end variables), each image computes its piece of the integral in what is a a standard Fortran do loop. However, the variable local_pi, as noted above, is a co-array. Two notations, one implicit and one explicit (but commented out) are presented. The implicit code, with it square-bracket notation dropped, is allowed (and encouraged for optimization reasons) when only the image-local part of a co-array is referenced by a given image. The explicit code makes it clear through the square-bracket extension [my_image] that each image is working with a local element of the local_pi co-array. When the practice of dropping the []s is adopted as a notational covention, all remote, co-array references (which are more time consuming operations) in are immediately, visually identifiable by square-bracket suffixes in present the code. Optimal coding practice should seek to minimize the use of square-bracketed references where possible.

With the local, partial sums computed by each image and placed in their piece of the local_pi[*] co-array, a global sum is then safely computed and written out only on image 1 with the help of a CAF critical region. Within a critical region, only one image (process) may pass at a time. This ensures that global_pi[1] is accurately summed from each local_pi[my_image] avoiding mistakes that might be caused by simultaneous reads of the same still partially summed global_pi[1] before each image-specific increments were written. Here, we see the variable global_pi[1] with the square-bracket notation which is a reminder that each image (process) is writing its result into the memory space on image 1. This is a remote write for all images, except image 1.

The last section of the code synchronizes (sync all) the images to ensure all partial sums have been added, and then has image 1 write out the global result. Note that, as writtenhere, only image 1 has the global result. For a more detailed treatment of the CoArray Fortran language extension, now part of the Fortran 2008 standard, please see the web references included above.


The CUNY HPC Center supports CoArray Fortran on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where the Intel Cluster Studio provides a beta-level implementation of CoArray Fortran layered on top of Intel's MPI library, an approach that offers CAF's coding simplicity, but no performance advantage over MPI.

Here, the process of compiling a CAF program both for Cray's CAF on SALK, and for Intel's CAF on the HPC Center's other systems is described. On the Cray, compiling a CAF program, such as the example above, simply requires adding an option to the Cray Fortran compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: ftn -h caf -o int_PI.exe int_PI.f90
salk:
salk: ls
int_PI.exe
salk:

In the sequence above, first the Cray programming environment is loaded using the 'module' command; then the Cray Fortran compiler is invoked with the -h caf option to include the CAF features of the Fortran compiler. The result is a CAF-enabled executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (CAF images) can be selected at run time using the -n ## option to Cray's 'aprun' command. The required form of the 'aprun' command is shown below in the section on CAF program job submission using SLURM on the Cray.

To compile for a fixed number of processors (a static compile) or CAF images use the -X ## option on the Cray, as follows:

salk:
salk: ftn -X 32 -h caf -o int_PI_32.exe int_PI.f90
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or CAF images, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Intel Fortran compiler 'ifort' and requires a CAF configuration file to be defined by the user. Here is a typical configuration file to compile statically for 16 CAF images followed by the compilation command. This compilation requests a distributed mode compilation in which distinct CAF images are not expected to be on the same physical node.

andy$cat cafconf.txt
-rr -envall -n 16 ./int_PI.exe
andy$
andy$ifort -o int_PI.exe -coarray=distributed -coarray-config-file=cafconf.txt int_PI.f90

The Intel CAF compiler is relatively new and has had limited testing on CUNY HPC systems. It also makes use of Intel's MPI rather than the CUNY HPC Center default, OpenMPI, which means that Intel CAF jobs will not be properly account for. As such, we recommend that Intel CAF compiler be used for development and testing only, while production CAF codes be run on SALK using Cray's CAF compiler. An upgrade is planned for the Intel Compiler Suite in the near future, and this should improve the performance and functionality of Intel's CAF compiler release. Additional documentation on using Intel CoArray Fortran is available here.

Submitting CoArray Fortran Parallel Programs Using SLURM

Finally, two SLURM scripts that will run the above CAF executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#SLURM -q production
#SLURM -N CAF_example
#SLURM -l select=64:ncpus=1:mem=2000mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

aprun -n 64 -N 16 ./int_PI.exe

Above, the dynamically compiled executable is run on 64 SALK, Cray XE6 cores (-n 64) with 16 cores packed to a physical node (-N 16). More detail is presented below on SLURM job submission to the Cray and on the use of of the Cray's 'aprun' command. On the Cray, 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control and submission on the Cray (SALK) without understanding the 'aprun' command.

A SLURM script for the example code compiled dynamically (or statically) for 16 processors with the Intel compiler (ifort) for execution on one of the HPC Center's more traditional HPC clusters looks like this:

#!/bin/bash
#SLURM -q production
#SLURM -N CAF_example
#SLURM -l select=16:ncpus=1:mem=1920mb
#SLURM -l place=scatter
#SLURM -V

echo ""
echo -n "The primary compute node hostname is: "
hostname
echo ""
echo -n "The location of the SLURM nodefile is: "
echo $SLURM_NODEFILE
echo ""
echo "The contents of the SLURM nodefile are: "
echo ""
cat  $SLURM_NODEFILE
echo ""
NCNT=`uniq $SLURM_NODEFILE | wc -l - | cut -d ' ' -f 1`
echo -n "The node count determined from the nodefile is: "
echo $NCNT
echo ""

# Change to working directory
cd $SLURM_O_WORKDIR

echo "You are using the following 'mpiexec' and 'mpdboot' commannds: "
echo ""
type mpiexec
type mpdboot
echo ""

echo "Starting the Intel 'mpdboot' daemon on $NCNT nodes ... "
mpdboot -n $NCNT --verbose --file=$SLURM_NODEFILE -r ssh
echo ""

mpdtrace
echo ""

echo "Starting an Intel CAF job requesting 16 cores ... "

./int_PI.exe

echo "CAF job finished ... "
echo ""

echo "Making sure all mpd daemons are killed ... "
mpdallexit
echo "SLURM CAF script finished ... "
echo ""

Here, the SLURM script requests 16 processors (CAF images). It simply names the executable itself to setup the Intel CAF runtime environment, engage the 16 processors, and initiate execution. This script is more elaborate because it include the procedure for setting up and breaking down the Intel MPI environment on the nodes that SLURM has selected to run the job.

An Example Unified Parallel C (UPC) Code

The following simple example program includes the essential features of the Unified Parallel C (UPC) programming model, including shared (globally distributed) variable declaration and blocking, one- sided data transfer between UPC's memory-space distinct threads via simple assignment statements, and synchronization barriers. No attempt is made here to tutor the reader in all of the features of the UPC; rather the goal is to give the reader a feel for basic UPC extensions to the C programming language. A tutorial on the UPC programming model can be found here [4], a user guide here [5], and a more formal description of the language specifications here [6]. Cray also has its own documentation on UPC [7]

// 
//  Computing PI by Numerical Integration in UPC
//

// Select memory consistency model (default).

#include<upc_relaxed.h> 

#include<math.h>
#include<stdio.h>

// Define integrand with a macro and set result accuracy

#define f(x) (1.0/(1.0+x*x))
#define N 4096

// Declare UPC shared scalar, shared vector array, and UPC lock variable.

shared float global_pi = 0.0;
shared [1] float local_pi[THREADS];
upc_lock_t *lock;

void main(void)
{
   int i;

   // Allocate a single, globally-shared UPC lock. This 
   // function is collective, intial state is unlocked.

   lock = upc_all_lock_alloc();

   // Each UPC thread initializes its local piece of the
   // shared array.

   local_pi[MYTHREAD] = 0.0;

   // Distribute work across threads using local part of shared
   // array 'local_pi' to compute PI partial sum on thread (processor)

   for(i = 0; i <  N; i++) {
       if(MYTHREAD == i%THREADS) local_pi[MYTHREAD] += (float) f((.5 + i)/(N));
   } 

   local_pi[MYTHREAD] *= (float) (4.0 / N);

   // Compile local, partial sums to single global sum.
   // Use locks to prevent read-before-write race conditions.

   upc_lock(lock);
   global_pi += local_pi[MYTHREAD];
   upc_unlock(lock);

   // Ensure all partial sums have been added with UPC barrier.

   upc_barrier;

   // UPC thread 0 prints the results and frees the lock.

   if(MYTHREAD==0) printf("PI = %f\n",global_pi);
   if(MYTHREAD==0) upc_lock_free(lock);

}

This sample code computes PI in parallel using a numerical integration scheme. Taking the key UPC-specific features present in this example in order, first we find the declaration of the memory consistency model to be used in this code. The default choice is relaxed which is selected explicitly here. The relaxed choice places the burden of ensuring that shared memory operations in the code that are dependent and must be order on the programmer through the use of barriers, fences, and locks. This code includes explicit locks and barriers to ensure memory operations are complete and processor have been synchronized.

Next, three declarations outside the main body of the application demonstrate the use of UPC's shared type. First, a scalar shared variable global_pi is declared. This variable can be read from and written to by any of the UPC threads (processors) allocated by the runtime environment to the application it is executed. It will hold the final result of the calculation of PI in this example. Shared scalar variables are singular and always reside in the shared memory of THREAD 0 in UPC.

Next, a shared one dimensional array local_pi with a block size of one (1) and a size of THREADS is declared. The THREADS macro is always set to the number of processors (UPC threads) requested by the job at runtime. All elements in this shared array are accessible by all THREADS allocated to the job. The block size of one means that array elements are distributed, one-per-thread, across the logically Partitioned Global Address Space (PGAS) of this parallel application. One is the default block size for shared arrays, but other sizes are possible.

Finally, a pointer to a special shared scalar variable to be used as a lock is declared. Because UPC defines both a shared and private memory spaces for each program image or THREAD, it must support four classes of pointers: private pointers to private, private pointers to shared, shared pointers to private, and shared pointers to shared. The pointer declared here is a shared pointer to shared which makes the lock's memory location available to all threads. In the body of the code, the lock's memory is allocated and placed in the unlocked state with the call to upc_all_lock_alloc().

Next, each thread initializes its piece of the shared array local_pi to zero with the help of the MYTHREAD macro, which contains the thread identifier of the particular thread that does the assignment. In this case, each UPC thread initializes only the part of the share array the is in its portion of shared PGAS memory. The standard C for-loop that follows divides the work of integration among the different UPC threads so that each thread works on its only local portion of the shared array local_pi. UPC provides a work-sharing loop construct upc_forall that accomplishes the same thing implicitly.

Processor-local (UPC thread) partial sums are then summed globally and in a memory consistent fashion with the help of the UPC lock function upc_lock() and upc_unlock(). Without the explicit locking code here, there would be nothing to prevent two UPC threads from reading the most current value in memory before it had been updated with a latest partial sum. This would produce an incorrect under-summing of the result. Next, a upc_barrier ensures all the summing is completed before the result is printed and the lock's memory is freed.

This example includes some of the more important UPC PGAS-parallel extensions to the C programming language, but a complete review of the UPC parallel extension to C is provide in the web documentation referenced above.

As suggested above, the CUNY HPC Center supports UPC on both its Cray XE6 system, SALK, (which has custom hardware and software support for the UPC and CAF PGAS languages) and on its other systems where Berkeley UPC is installed and uses the GASNET library to support the PGAS memory abstraction on top of a number standard underlying cluster interconnects. At the HPC Center this would include Ethernet and/or InfiniBand depending on the CUNY HPC Center cluster system being used.

Here, the process of compiling a UPC program both for Cray's UPC on SALK, and for Berkeley UPC on the HPC Center's other systems is described. On the Cray, compiling a UPC program, such as the example above, simply requires adding an option to the Cray C compiler, as follows:

salk:
salk: module load PrgEnv-cray
salk:
salk: cc -h upc -o int_PI.exe int_PI.c
salk:
salk: ls
int_PI.exe
salk:

First, the Cray programming environment is loaded using the 'module' command; then the Cray compiler is invoked with the -h upc option to include the UPC elements of the compiler. The result is an executable that can be run with Cray's parallel job initiation command 'aprun'. This compilation was done in dynamic mode so that any number of processors (UPC threads) can be selected at run time using the -n ## option to 'aprun'. The required form the 'aprun' line is shown below in the section on UPC program SLURM job submission.

To compile for a fixed number of processors (a static compile) or UPC threads use the -X ## option on the Cray, as follows:

salk:
salk: cc -X 32 -h upc -o int_PI_32.exe int_PI.c
salk:
salk: ls
int_PI_32.exe
salk:

In this example, the PI example program has been compiled for 32 processors or UPC threads, and therefore must be invoked with that many processors on the 'aprun' command line:

aprun -n 32 -N 16 ./int_PI_32.exe

On the HPC Center's other systems, compilation is conceptually similar, but uses the Berkeley UPC compiler driver 'upcc'.

andy:
andy: upcc  -o int_PI.exe int_PI.c
andy:
andy: ls
int_PI.exe
andy:

Similarly, the 'upcc' compiler driver from Berkeley allows for static compilations using its -T ## option:

andy:
andy: upcc -T 32  -o int_PI_32.exe int_PI.c
andy:
andy: ls
int_PI_32.exe
andy:

The Berkeley UPC compiler driver has a number of other useful options that are described in its 'man' page. In particular, the -network= option will target the executable for the GASNET communication conduit of the user's choosing on systems that have multiple interconnects (Ethernet and InfiniBand, for instance) or target the default version of MPI as the communication layer. Type 'man upcc' for details.

In general, users can expect better performance from Cray's UPC compiler on SALK, but having UPC on the HPC Center's traditional cluster architectures provides another location for development and supports the wider use of UPC and an alternative to MPI. In theory, well-written UPC code should perform as well as MPI on a standard cluster, while reducing the number of lines of code to achieve that performance. In practice, this is still not always the case; more development and hardware support is still needed to get the best performance from PGAS languages on commodity cluster environments.

Submitting UPC Parallel Programs Using SLURM

Finally, two SLURM scripts that will run the above UPC executable. First, one for the Cray XE6 system, SALK:

#!/bin/bash
#SLURM -q production
#SLURM -N UPC_example
#SLURM -l select=64:ncpus=1:mem=2000mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

aprun -n 64 -N 16 ./int_PI2.exe

Here the dynamically compiled executable is run on 64 Cray XE6 cores (-n 64), 16 cores packed to a physical node (-N 16). More detail is presented below on SLURM job submission on the Cray and the use of of the Cray's 'aprun' command. On the Cray 'man aprun' provides an important and detailed account of the 'aprun' command-line options and their function. One cannot fully understand job control on the Cray (SALK) without understanding 'aprun'.

A similar SLURM script for the example code compiled dynamically (or statically) for 32 processors with the Berkeley UPC compiler (upcc) for execution on one of the HPC Center's more traditional HPC cluster looks like this:

#!/bin/bash
#SLURM -q production
#SLURM -N UPC_example
#SLURM -l select=32:ncpus=1:mem=1920mb
#SLURM -l place=free
#SLURM -o int_PI.out
#SLURM -e int_PI.err
#SLURM -V

cd $SLURM_O_WORKDIR

upcrun -n 32 ./int_PI2.exe

Here, the SLURM script requests 32 processors (UPC threads). It uses the 'upcrun' command to setup the Berkeley UPC runtime environment, engage the 32 processors, and initiate execution. Please type 'man upcrun' for details on the 'upcrun' command and its options.