OpenMP, OpenMP SMP-Parallel Program Compilation, and SLURM Job Submission

From HPCC Wiki
Jump to navigation Jump to search

OpenMP, OpenMP SMP-Parallel Program Compilation, and SLURM Job Submission

All the compute nodes on all the the systems at the CUNY HPC Center include at least 2 sockets and multiple cores. Some have 8 cores (ZEUS, BOB, ANDY), and some have 16 (SALK). These multicore, SMP compute nodes offer the CUNY HPC Center user community the option of creating parallel programs using the OpenMP Symmetric Multi-Processing (SMP) parallel programming model. The SMP parallel programming with the OpenMP model (and other SMP models) is the original parallel processing model because the earliest parallel HPC systems were built only with shared memories. The Cray-XMP (circa 1982) was among the first systems in this class. Shared memory, multi-socket and multi-core designs are now typical of even today's desktop and portable PC and Mac systems. As the CUNY HPC Center systems, each compute node is similarly a shared-memory, symmetric multi-processing system that can compute in parallel using the OpenMP shared-memory model.

In the SMP model, multiple processors work simultaneously within a single program's memory space (image). This eliminates the need to copy data from one program (process) image to another (required by MPI) and simplifies the parallel run-time environment significantly. As such, writing parallel programs to the OpenMP standard is generally easier and requires many fewer lines of code. However, the size of the problem that can be addressed using OpenMP is limited by the amount of memory on a single compute node, and the similarly the parallel performance improvement to be gained is limited by the number of processors (cores) within that single node.

As of Q4 2012 at CUNY's HPC Center, OpenMP applications can run with a maximum of 16 cores (this is on SALK, the Cray XE6m system). Most of the HPC Center's other systems are limited to 8 core OpenMP parallelism.

Here, a simple OpenMP parallel version of the standard C "Hello, World!" program is set to run on 8 cores:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define NPROCS 8

int main (int argc, char *argv[]) {

   int nthreads, num_threads=NPROCS, tid;

  /* Set the number of threads */
  omp_set_num_threads(num_threads);

  /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Each thread obtains its thread number */
  tid = omp_get_thread_num();

  /* Each thread executes this print */
  printf("Hello World from thread = %d\n", tid);

  /* Only the master thread does this */
  if (tid == 0)
     {
      nthreads = omp_get_num_threads();
      printf("Total number of threads = %d\n", nthreads);
     }

   }  /* All threads join master thread and disband */

}

An excellent and comprehensive tutorial on OpenMP with examples can be found at the Lawrence Livermore National Lab web site: (https://computing.llnl.gov/tutorials/openMP)

Compiling OpenMP Programs Using the Intel Compiler Suite

The intel C compiler requires the '-openmp' option, as follows:

icc  -o hello_omp.exe -openmp hello_omp.c

When run, the program above produces the following output:

$ ./hello_omp.exe
Hello World from thread = 0
Number of threads = 8
Hello World from thread = 1
Hello World from thread = 2 
Hello World from thread = 6
Hello World from thread = 4
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 7

OpenMP is supported in Intel's C, C++, and Fortran compilers; as such, a Fortran version of the program above could be used to produce similar results. An important feature of OpenMP threads is that they are logical entities that are not by default locked to physical processors. The code above requesting 8 threads would run and produce similar results on a compute node with only 2 or 4 processors, or even 1 processor. In these cases, they would simply take more wall-clock time to complete.

When threads in excess of the physical number of processors present on the motherboad are requested they simply compete for access to actual number of physical cores available. Under such circumstaces, maximum program speed ups are limited to the number of unshared physical processors (cores) available to the OpenMP job less the overhead required to start OpenMP (this ignores Intel's 'hyperthreading' which allows two threads to share sub-resources not in simultaneous use within a single processor).

Compiling OpenMP Programs Using the PGI Compiler Suite

The PGI C compiler requires its '-mp' option for OpenMP programs, as follows:

pgcc  -o hello_omp.exe -mp hello_omp.c

When run this PGI executable will produce the 'same' output as shown above, although the order of the print statements cannot be predicted and will not necessarily be the same over repeated runs.

OpenMP is supported in PGI's C, C++, and Fortran compilers; therefore a Fortran version of the program above could be used to produce similar results.

Compiling OpenMP Programs Using the Cray Compiler Suite

The Cray C compiler requires its '-h omp' option for OpenMP programs, as follows:

cc  -o hello_omp.exe -h omp hello_omp.c

The program produces the same output, and again the order of the print statements cannot be predicted and will not necessarily be the same over repeated runs.

OpenMP is supported in Cray's C, C++, and Fortran compilers; therefore a Fortran version of the program above could be used to produce similar results.

Note: As discussed above in the section on serial program compilation, on the Cray the 'cc', or 'ftn' or 'CC' compiler wrappers would end up being used (with their compiler-specific OpenMP flags) for each specific compiler suite after the appropriate programming environment module was loaded.

Compiling OpenMP Programs Using the GNU Compiler Suite

The GNU C compiler requires its '-fopenmp' option for OpenMP programs, as follows:

gcc  -o hello_omp.exe -fopenmp hello_omp.c

The program produces the same output, and again the order of the print statements cannot be predicted and will not necessarily be the same over repeated runs.

OpenMP is supported in both GNU's C, C++, and Fortran compilers; therefore a Fortran version of the program above could be used to produce similar results.

Submitting an OpenMP Program to the SLURM Batch Queueing System

All non-trivial jobs (development or production, parallel or serial) must be submitted to HPC Center system compute nodes from each system's head or login node using a SLURM script. Jobs run interactively on system head nodes that place a significant and sustained load on the head node will be terminated. Details on the use of SLURM are presented later in this document; however, here we present a basic SLURM script ('my_ompjob') that can be used to submit any OpenMP SMP program for batch processing on one of the CUNY HPC Center compute nodes.

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=8
#SLURM -l place=free
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

./hello_omp.exe

When submitted with 'qsub my_ompjob' a job ID XXXX is returned and the output will be written to the file 'openMP_job.oXXXX' where XXXX is the job ID, unless otherwise redirected on the command-line.

The key lines in the script are '-l select' and '-l place'. The first defines (1) resource chunk with '-l select=1' and assigns (8) cores to it with ':ncpus=8'. SLURM must allocate these (8) cores on a single node because they are all part of a single SLURM resource 'chunk' ('chunks' are atomic) to be used in concert by our OpenMP executable, hello_omp.exe.

Next, the line '-l place=free' instructs SLURM to place this chunk anywhere it can find 8 free cores. As mentioned, SLURM resource 'chunks' are indivisible across compute nodes; and therefore, this job can only be run on a single compute node. It would therefore never run on a system with only 4 cores per compute node and on those with only 8 core per node SLURM would have to find a node with no other jobs running on it. This is exactly what we want for an OpenMP job, a one-to-one mapping of physically free cores to OpenMP threads requested with no other jobs schedule by SLURM (or our of SLURM's purvey) to run and compete for those 8 cores.

Placement on a node with as many free physical cores as OpenMP threads is optimal for OpenMP jobs because each processor assigned to an OpenMP job works within that single program's memory space or image. If the processors assigned by SLURM were on another compute node they would not be usable; if they were assigned to another job on the same compute they would not be fully available to the OpenMP program and would delay its completion.

Here, the selection of 8 cores will consume all the cores available on a single compute node on either BOB or ANDY. This forces SLURM to find and allocate an entire compute node to the OpenMP job. In this case, the OpenMP job will also have all of the memory the compute has at its disposal knowing that no other jobs will be assigned to it by SLURM. If fewer cores were selected (say 4), SLURM could place another job on the same BOB or ANDY compute node using as many as (4) cores. This job would compete for memory resources proportionally, but would have its own cores. SLURM offers the 'pack:excl' option to force exclusive placement even if the job uses less than all the cores on the physical node. One might wish to do this to run a single core job and have it use all the memory on the compute node.

One thing that should be kept in mind when defining SLURM resource requirements and in submitting any SLURM script is that jobs with resource requests that are impossible to fulfill on the system where the job is submitted will be queued forever and never run. In our case here, we must know that the system that we are submitting this job to has at least 8 processors (cores) available on a single physical compute node. At the HPC Center this job would run on either BOB, ANDY or SALK, but would be queued indefinitely on any system that has fewer than 8 cores per physical node. This resource mapping requirement applies to any resource that you might request in your SLURM script, not just cores. Resource definition and mapping is discussed in greater detail in the SLURM section later in this document.

Note that on SALK, the Cray XE6m system, the SLURM script would require the use of Cray's compute-node, job launch command 'aprun', as follows:

#!/bin/bash
#SLURM -q production
#SLURM -N openMP_job
#SLURM -l select=1:ncpus=16:mem=32768mb
#SLURM -l place=free
#SLURM -j oe
#SLURM -o openMP_job.out
#SLURM -V

# You must explicitly change to your working directory in SLURM
# The SLURM_O_WORDIR variable is automatically filled with the path 
# to the directory you submit your job from

cd $SLURM_O_WORKDIR
echo ""

# The SLURM_NODEFILE file contains the compute nodes assigned
# to the job by SLURM.  Uncommenting the next line will show them.

cat $SLURM_NODEFILE
echo ""

# It is possible to set the number of threads to be used in
# an OpenMP program using the environment variable OMP_NUM_THREADS.
# This setting is not used here because the number of threads (8)
# was fixed inside the program itself in our example code.
# export OMP_NUM_THREADS=8

aprun -n 1 -d 16 ./hello_omp.exe

Here, 'aprun' is requesting that one process be allocated to a compute ('-n 1') and that it be given all 16 cores available on a single SALK compute node. Because the production queue on SALK allows no jobs requesting fewer than 16 cores, the '-l select' was also changed. The define in the original C source code should also best be change to set the number of OpenMP threads to 16 so that no allocated cores are wasted on the compute node, as in:

#define NPROCS 16