Submitting Jobs

From CUNYHPC
Jump to: navigation, search

The CUNY HPC Center uses PBSpro version 12.0 or higher for job submission and queue management. This section provides an overview of the features of PBSpro and sample job scripts. Additional sample scripts for third party applications are provided in the section on software environments.

Contents


As the number and management needs of its systems has grown, CUNY's HPC Center decided to move to a more fully-featured, commercially supported job queueing and scheduling system (workload manager). The HPC Center has selected PBS Pro as a replacement for SGE and has fully transitioned from SGE to PBS Pro on all of its resources. PBS Pro offers numerous features that improve the full, fair, and effective usage of CUNY's HPC Resources. This includes several distinct approaches to resource allocation and scheduling (priority-formula-based, fair-share, etc.), interfaces to control application license use, multiple methods of partitioning systems and scheduling jobs between them, and a full-featured usage analysis package (PBS Analytics), among other things. As of min-January 2011, all CUNY HPC Center multi-node, parallel systems are running the PBS Pro batch scheduling system version 12.

PBS Pro Design and the Cluster Paradigm

PBS Pro places 3 distinct service daemons onto the classic head-node, compute-node(s) cluster architecture. These are the queuing service daemon (known as the Server in PBS Pro), the job scheduling daemon (known as the Scheduler), and the compute or execution host daemon (know as the MOM). The Server and Scheduler typically run on the cluster's head or login node. They receive and distribute jobs submitted by users via the interconnect to the compute nodes in a resource-intelligent fashion. The Server and Scheduler do not run on the compute nodes, only the MOM does. A MOM daemon runs on each of the compute nodes or 'execution hosts' as PBS Pro refers to them. There the MOM accepts, monitors, and manages the work delivered to it by the scheduler. While possible, a MOM is not typically run on the cluster login node because the head node is not usually tasked for production batch work. A diagram of this basic arrangement is presented here.

PBS Daemons.jpg

A SPECIAL NOTE ABOUT THE CRAY (SALK):

A significant modification of this arrangement obtains on our Cray XE6m system, SALK. On SALK, the PBS Server and Scheduler run on one of several special service nodes (head nodes) referred to as the System DataBase node or the SDB. Users cannot login to this node. On the Cray, the node onto which users login (salk.csi.cuny.edu) runs ONLY the PBS Pro MOM. In the Cray's case, PBS Pro views the login node as a single, very large virtual compute node with 2816 cores (or 704, 4-core, virtual nodes called numa-nodes). The single PBS Pro MOM on the login node starts and tracks all of the work scheduled on this large set of virtual nodes through Cray's Application Level Placement Scheduler (ALPS). It is ALPS that is fully aware of the Cray compute resources and its interconnect, and it is the ALPS daemon that is responsible for the physical placement of each PBS job submitted by the users onto the Cray's compute nodes. The Cray 'aprun' command functions as an intermediate between the PBS 'qsub' command and the PBS resources it requests, and the ALPS deamon and the physical resources available on the Cray. The resources requested (or 'claimed') via 'aprun' can never be greater that those reserved by PBS through 'qsub' and the user's PBS script. More detail will be provided on Cray-specific PBS differences later.

PBS Pro Job Submission Modes

The PBS Pro workload management system is designed to serve as a gateway to compute node resources of each CUNY system. All jobs (both interactive and batch) submitted through PBS Pro are tracked and placed on the system in a way that efficiently utilizes the resources while keeping potentially competing jobs out of each other's way. The assumption that PBS Pro makes optimal decisions about job placement depends on the idea that there is no 'back-door' production work submitted to the cluster's compute nodes without PBS Pro's knowledge. When operating as designed, this results in better overall throughput for the job mix and better individual job performance. As such on CUNY's HPC systems, all application runs (whether interactive or batch, development or production) should be submitted through PBS Pro (again, SALK is a minor exception where two nodes [32 cores] are provided for PBS-independent interactive job execution via 'aprun'). Furthermore, no jobs should be run outside of PBS on CUNY system head nodes. This leaves only code compilation and basic serial testing for the head-node. The CUNY HPC Center staff has designed its PBS queue structure to accomnodate interactive, development, production, and other classes of work. Jobs submitted to the compute nodes through other means (or head nodes) will be killed. Login sessions on compute nodes that do not have PBS scheduled work from the user will be terminated.

Running Batch Jobs with PBS Pro

Two steps should be completed to submit a typical batch job for execution under PBS Pro. A user must create a job submission script that includes the sequence of commands (serial or parallel) that are to be run by the job (this step is not required for an interactive PBS Pro job). The user must also specify the resources (cores, memory, etc.) required by the job. This may be done within the script itself using PBS-specific comment lines, or may be provided as options on PBS's job submission command line, 'qsub'. These command-line options (or submit script #PBS comment-lines) typically include information on the number of cores (cpus) required, the estimated memory and CPU time required, the name of the job, and the queue into which the job is to be submitted, among other things. The submit script is submitted for execution to the PBS Server daemon through the PBS Pro 'qsub' command (e.g. 'qsub job.script'). Jobs targeted for the compute nodes that are not submitted via 'qsub' will be killed.

The submit script can contain numerous options. These are described in detail in the PBS Pro 12 User Guide here, or on-line with 'man qsub'. All options within the submit script to be interpreted by 'qsub' should be placed at the beginning of the script file and must be preceded by the special comment string '#PBS'. Options offered on the 'qsub' command-line override script-defined options. Some of the most important PBS Pro options are presented here:

The option to specify the name that will be given to the job (limited to 15 characters):

#PBS -N job_name

The option to specify the queue that the job will be placed in:

#PBS -q queue_name

A detailed description of the available queues is provided here.

The flag to specify the number and kind of resource chunks required by the job:

#PBS -l select=#:[resource chunk definition]

More detail on this very important option is provided in examples below.

The flag to determine how the job's resource chunks are to be distributed (placed) on the compute nodes:

#PBS -l place=[process placement instructions]

The flag to limit and indicate to PBS what a job's total cpu time requirement will be (useful for short jobs):

#PBS -l cput=HH:MM:SS

The flag to pass the head node environment variables to each compute node process:

#PBS -V 

A SPECIAL NOTE ABOUT THE CRAY (SALK):

The Cray includes an alternative and deprecated, but still functioning set of options for specifying the resources required by the job (the so-called 'mpp' resource options). These will not be covered here in the CUNY HPC Wiki, but can be read about on SALK in the pbs_resources manual pages ('man pbs_resources').

More detailed information on PBS Pro 'qsub' options is available from 'man qsub' on all CUNY HPC Center systems and is available in the PBS Pro 12 User Guide here.

Submitting Serial (Scalar) Jobs

Serial (scalar) jobs (as opposed to multiprocessor jobs) use only one processor. For example, executing the simple UNIX command 'date' requires only one processor and simply returns the current date and time. While 'date' and most other UNIX commands would not typically be run by themselves in a batch job, one or more longer-running serial HPC applications are often run this way to avoid tying up a local workstation or as part of a parametric study of some large HPC problem space. Preparing and submitting a serial (scalar) job for batch execution requires many of the same steps that are required to submit a more complicated parallel HPC job, and therefore serial jobs serve as a good basic introduction to batch job submission in PBS Pro.

The following steps would typically be required for serial job submission using PBS Pro:

1. Create a new working directory (named serial for example ) in your home directory and move to it by executing the commands:

bob$ mkdir serial
bob$ cd serial

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a submit script file (named serial_job.sh for example) and insert the following lines in it:

#!/bin/bash
#
# My serial PBS test job.
#
#PBS -q production
#PBS -N serial_job
#PBS -l select=1:ncpus=1
#PBS -l place=free
#PBS -V

# Find out name of compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

serial_job.exe > serial_job.out 2>&1

Working through the script line-by-line. The first line selects the shell that will be used to interpret the lines in the script. Everything after just a # is treated as a regular comment. Everything after a #PBS is an option to be interpreted by the PBS Pro 'qsub' command. The -l select=1:ncpus=1 option (line) above needs some explanation. In PBS Pro, the -l select option specifies the number and kind of resource units to be associated with the job. PBS Pro refers to these resource units as chunks. The number of chunks defined by the '-l select=integer'. Here, 1 chunk has been requested; defined by '-l select=1'. To ask for 2 chunks one would write '-l select=2', and so on.

The particular resources contained in a PBS chunk are specified by the colon-separated list that follows. In this case, only one, 'npcus' is specified. This is the number of cores (cpus) in this script's chunk. To compute the total quantity of any particular resource requested by the job script, you must multiply the number of chunks by the number of the particular resource specified. Here, to compute the total number of cores to be reserved by this job, you multiply the number of chunks (1 in this case) by the number of cores in each chunk (also 1 , defined by ncpus=1), or 1 chunk x 1 ncpus = 1. As this is a serial job, this is precisely what is needed. The date command is not a parallel application and cannot take advantage of multiple processors anyway. In this case, the contents of the resource chunk include 1 core (cpu) explicitly defined, but also a set of default resources defined by the PBS administrator because they were left undefined by the user. Other resources like memory, processor time, applications licenses, or disk space can also be explicitly requested in a chunk. Here, when resources are not requested explicitly, the job is given the local site's default setting for the unrequested resource. Resource defaults are inherited from those defined for the execution queue that the job is finally placed in, or from the global Server settings. More involved examples of the -l select option are given below.

When defining resource chunks several things should be kept in mind. No resource "chunk" should be defined that exceeds any of the component resources (cores, memory, disk, etc.) available on any single, physical compute node on the system to which the job is being submitted. This is because PBS resource chunks are 'atomic', and therefore each must be allocated on the system's compute nodes as a whole. If there are no physical nodes that have the resources requested in a PBS chunk, then PBS will find it impossible to run the job. That job will remain queued forever (the 'Q' state) without any error message to the user. This is one of the most common PBS job submission errors. The number and kind of chunk(s) defined by the user (colon separated resource list) in by the '-l select' statement determine what resources PBS Pro allocates to the job, in combination with any PBS defaults.

Moving on to subsequent lines, the -l place=pack option is not strictly required for this serial job, but is included for illustration. It requests that ALL resource chunks (not just a single chunk which is atomic) specified in the '-l select=integer' line be allocated on a single physical compute node. In this case, because we are asking for only 1 chunk with 1 core and have not specified other resources in our chunk, it will be easy to fulfill this placement request, but if the -l select option had asked for more resources in total than were available on all individual compute node in the cluster, the job would never run because it would be making a resource request impossible to fulfill. Here, again it would be queued and never run. There would be no way to pack the -l select resource request on a single node.

Unwittingly making either type of impossible-to-fulfill request (packing too many resource chunks, or defining chunks that by themselves are too larger) is a common mistake in PBS Pro submit scripts created by beginners. Jobs can also be delayed because the requested resources are temporarily unavailable due to other work on the system. Both possibilities may produce the same "No Resources Available" message at the end of the 'qsub -f JID' output, confusing the user. In the next example, showing a submit script for a symmetric multiprocessing (SMP) parallel job, the issue of proper resource chunk placement comes up again.

The final PBS option in the script, '-V', instructs PBS to transfer the user current local environment to the compute nodes where the job will be run. This is important in that if the Unix paths to commands and libraries were different on the compute nodes the script or executables that link in libraries dynamically might fail. This is another error common to new users new of PBS. Lastly, in the body of the script, the directory is changed to the current working directory (the directory from which the 'qsub' command was issued) with 'cd $PBS_O_WORDIR'. In general when an executable and its input files are referenced without full paths in the PBS script, the user must ensure the PBS batch session is started in the correct directory, in this case the working directory. By default (without 'cd $PBS_O_WORKDIR'), the user is placed only in their home directory on the compute nodes as if they had logged their in interactively. Finally, a serial executable is run. No full path is required because 'serial_job.exe' is expected to be in the $PBS_O_WORKDIR directory. Also, note that the output file has been explicitly named in typical Unix fashion using the greater than (>) sign:

serial_job.exe > serial_job.out 2>&1

Naming the output file explicitly is a best practice because this ensures that program output, which might be quite large, is placed where it can be stored rather than in the default location for the PBS output and error files on the system directory /var/spool/PBS which has limited space. The additional expression at the end of the line (2>&1) simply combines the standard error file with the named standard output file (serial_job.out). Failing to do this, if your jobs output is very large, could fill up the /var/spool/PBS directory and cause your job and PBS on that compute to fail. Always explicitly name the output and error files in your PBS scripts as shown here.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

On the Cray all jobs queued to the production queue must request at least 16 cores on the '-l select' line to be run. The Cray is intended for those running jobs that can scale to up larger sizes (at least 64 cores). To run this simple, serial PBS script on SALK one would have to submit it to the development queue by replacing:

#PBS -q production

with:

#PBS -q development

3. Submit the job to PBS Pro by entering the command 'qsub serial_job.sh'. If your submit file is correctly constructed, PBS Pro will respond by reporting the job's request ID (59 in this case) followed by the host name of the system that submitted the job:

$qsub serial_job.sh
59.bob.csi.cuny.edu

A SPECIAL NOTE ABOUT THE CRAY (SALK):

You may find that the PBS Pro commands (qsub, qstat, qdel, etc.) and environment are not active by default on your Cray account. This can be remedied by using the Cray 'module' command, which is used to control environmental variable settings such at the $PATH setting on the Cray. You can load the PBS module with:

module load pbs

4. You can check the status of your submitted job with the command 'qstat', which by itself lists all jobs that PBS is managing whether in the queued (Q), running (R), hold (H), suspended (S), or exciting (E) states. To get a full listing for a particular job you can type 'qstat -f JID'. For more detail on the PBS Pro version of 'qstat' please consult the man page with 'man qstat'. The job request ID "59" above is a unique numerical ID assigned by the PBS Pro Server to your job. JID numbers are assigned in ascending order as each user's job gets submitted to PBS. The output from the 'qstat' job monitoring command always lists jobs in job request ID order. The oldest jobs on the system are always at the top of the 'qstat' output; those most recently submitted are at the bottom.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

Because on the Cray PBS must work through the Cray scheduler ALPS, the 'qstat' command does not provide as much information as it does on the HPC Center's other systems. For instance, the 'Time Used' column in the Cray's 'qstat' output will typically show no time accumulated or very little. This because PBS can track only the time used by the process that submits your job to ALPS, not the time used by the job itself. Fortunately, Cray provides its own command for obtaining such details, 'apstat':

salk$
salk$ qstat

PBS Pro Server salk.csi.cuny.edu at CUNY CSI HPC Center
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
6519.sdb          3D_P30           kconnington       00:00:03 R qlong512        
6831.sdb          Par100           kconnington       00:00:00 R qlong128        
6852.sdb          rat_run64        haralick          00:00:00 R qlong128        
6853.sdb          case6_128        poje              00:00:00 R qlong16         
6854.sdb          case7_128        poje              00:00:00 R qlong16 
        
salk$
salk$
salk$ apstat

Compute node summary
    arch config     up    use   held  avail   down
      XT     80     79     43      3     33      1

No pending applications are present

Total placed applications: 5
Placed  Apid ResId     User   PEs Nodes    Age   State Command
       18255   246 kconning   512    32 161h32m  run   a.out
       18854   589 kconning   128     8   2h01m  run   a.out
       18889   609 haralick    16     1   0h04m  run   mpi_decomposi
       18891   610     poje    16     1   0h04m  run   pvs_128
       18893   611     poje    16     1   0h04m  run   pvs_128

There is also a verbose option 'apstat -avv'. Please consult the 'apstat' man page on the Cray ('man apstat') for details.


5. Once the job is finished you will see the job's output (Unix std.out) in the file 'serial_job.o59' which is your job name followed by the job ID number. Errors will be written to 'serial_job.e59' (Unix std.err), if there are problems with your job. If for some reason these files cannot be written your account will receive two email messages with their contents included.

Looking at the output of our submitted serial job with the date command in it:

bob$ cat serial_job.o59 
Wed Mar 11 17:15:59 EDT 2011

The output from the 'date' command executed on one of the compute nodes is written there.

Submitting OpenMP Symmetric Multiprocessing (SMP) Parallel Jobs

Symmetric multiprocessing (SMP) requires a multiprocessor (and/or multicore) computer with a unified (physically integrated) memory architecture. SMP programs use two or more processors (cores) to complete parallel work within a single program image within a unified memory space. SMP systems were among the earliest types of multi- processor HPC architectures (pre-dating the current multicore chips by decades), but the SMP architecture supports a limited number of processors compared to a distributed memory system like an HPC cluster. Yet, CUNY's HPC cluster system compute nodes are each themselves small SMP systems with 4 or 8 processors (cores) that can work together on a program within their node's unified memory space. For example, each of the compute nodes on the SGI system, ANDY, have 8 Nehalem cores (2 sockets with 4 cores each) that share 24 Gbytes of memory. Each node on the newly installed Cray XE6m system, SALK, has 16 AMD MagnyCour cores (2 sockets with 8 cores each) that share 32 Gbytes of memory. The current trend in microprocessor development away from faster clocks and toward higher on-chip core counts means that the compute nodes of next-generation HPC clusters are likely to have even higher core counts available for SMP parallel operations.

While the core count of SMP systems limits their parallel and peak performance, their integrated memory architecture makes programming them in parallel much simpler. OpenMP (not to be confused with OpenMPI) is a compiler-directive based SMP parallel programming model that is commonly used on SMP systems, and it is supported by CUNY's HPC Center compilers. OpenMP is relatively easy to learn compared to the Message Passing Interface (MPI) parallel programming model, which was designed to work even on distributed memory systems like CUNY's HPC clusters and Cray. Still, some HPC applications, both commercial and researcher developed, are still serial in design. As a first step, they can be re-written in parallel to use the unified memory space of an SMP system or a single cluster compute node.

In an earlier section, an OpenMP parallel version of the standard "Hello World" program was presented. It is a simple matter to incrementally modify the serial PBS Pro submit script presented above to run "Hello World" in SMP-parallel mode on 4 processors (cores) within a single CUNY HPC cluster compute node. The primary differences relate to reserving multiple processors (cores) and then ensuring that they are placed (packed) within a single compute node where OpenMP must function. Here is an example PBS script for an SMP-parallel job, smpjob.sh.

The following steps would typically be required for SMP job submission using PBS Pro:

1. Create a new working directory (named smp for example ) in your home directory, copy your program into it, and compile it by by executing the following commands:

bob$ mkdir smp
bob$ cp ./hello_omp.c smp
bob$ cd smp
bob$ icc -openmp -o hello_omp.exe hello_omp.c

Note: On the Cray (SALK) using Cray's compilers you would use 'cc' to compile. Also, the Cray compilers interpret OpenMP directives by default (i.e. their '-h omp' flag is on by default).

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a submit script file (named smpjob.sh for example) and insert the following lines in it:

#!/bin/bash
#
# My script to run a 4-processor SMP parallel test job
#
#PBS -q production
#PBS -N parallel_hello
#PBS -l select=1:ncpus=4:mem=7680mb
#PBS -l place=pack
#PBS -V

# Find out name of compute node
hostname

# Find out OMP thread count
echo -n "OMP thread count is: "
echo $OMP_NUM_THREADS

# Change to working directory
cd $PBS_O_WORDIR

./hello_omp.exe

Most of the options in this SMP submit script are the same as those in the serial job script presented above.

Select the queue into which to place the job:

#PBS -q production

Specify a name for the job:

#PBS -N parallel_hello

Define the number and kind of resource chunks needed by the job:

#PBS -l select=1:ncpus=4:mem=7680mb

Describe how the processes in the job should be distributed (or not) across the compute nodes:

#PBS-l place=pack

Export local environment variables to the compute nodes from the submission (head) node:

#PBS -V 

Check to see what PBS has set the OpenMP thread count too:

echo -n "OMP thread count is: "
echo $OMP_NUM_THREADS

Change to the job's working directory (PBS Pro 'qsub' does not have a 'cwd' option):

cd $PBS_O_WORKDIR

Run the SMP-parallel "Hello World" job on 4 processors (cores):

./hello_omp.exe

In the '#PBS' options header section, the primary change is with the '-l select' option. Here, a single resource chunk is still being requested ('-l select=1'), but it is larger than before and is now defined to include 4 processors (cores) and 7.68 Gbytes of memory. The '-l place=pack' option is the same as it was in the serial script, but here it ensures that the 4 cores in this resource chunk are confined (packed) to a single compute node so that they may be used in the SMP parallel programming style. The effect of these changes is to inform the PBS Pro Server that a resource chunk with 4 processors and 7.68 Gbytes of physical memory (more when virtual memory is counted) should be placed on 1 compute node. A careful reader will note that in fact the '-l place=pack' option is unnecessary here because as stated earlier a single resource chunk MUST always fit within a single physical compute node. If we had requested the same total resources, but with '-l select=4:ncpus=1:mem=1920mb' (4 chunks of one fourth the size), then the pack option would be necessary because each chunk might otherwise be placed on a different physical node. In either case, the PBS Scheduler will attempt to place all the resources requested here on a single physical node. If it cannot do this, the job will not be run due to insufficient resources -- not insufficient across the cluster, but within any single cluster compute node.

The compute-node limits apply to the physical memory requested as well. Here, the memory requested (7680mb) is actually 7680 * 1,048,576 = 8,053,063,680 bytes. This is a good high-water maximum for the 4 processors on a compute node with 8 * 2^30 bytes of memory. Again, the size of a resource chunk should never be greater than the size of the largest physical compute node in the cluster that the job is to be run on. On a Linux system, the files /proc/cpuinfo and /proc/meminfo are good sources for determining a compute node's processor counts and memory size. PBS Pro resource chunk defaults have been configured by CUNY HPC Center staff with these values in mind.

Finally, OpenMP SMP programs are able to determine the number processors available to it from environmental variables set by PBS Pro with the help of the resources requested in the '-l select' option. No additional processor specification is needed on the './hello_omp.exe' command-line as it would be with an MPI job. For this OpenMP program, PBS Pro sets the OMP_NUM_THREADS environmental variable to 4 from the 'ncpus=4' setting of the '-l select' option. This ensures that 4 OMP threads will be used by the OpenMP executable, one for each core PBS Pro has reserved. Additional examples of the interplay between the '-l select' option, OpenMP, and SMP applications are provided below.

As with the serial job, this SMP job is submitted to the PBS Pro Server and the queuing system using the 'qsub' command: 3. As with the serial job, submit the job to PBS Pro by entering the command 'qsub smp_job.sh'. If your submit file is correctly constructed, PBS Pro will respond by reporting the job's request ID (59 in this case) followed by the host name of the system that submitted the job:

qsub smp.bash
71.bob.csi.cuny.edu

A SPECIAL NOTE ABOUT THE CRAY (SALK):

All jobs run on the Cray's (SALK) compute nodes must be started with Cray's aprun command. This applies to SMP parallel jobs and in this case requires the use of 'aprun' command-line options that are specific to SMP type work. Mapping PBS Pro resource reservation definitions from the '-l select' line onto 'aprun' command-line options can be confusing. Users of the Cray should read the 'aprun' man page ('man aprun') carefully, paying particular attention to the multiple examples presented near the end. The resources that PBS reserves on the '-l select' line define an upper limit that bounds what can be requested using 'aprun.' When an 'aprun' request exceeds those PBS reservation boundaries, the user will receive a message a 'claim exceeds' message of the form:

apsched: claim exceeds reservation's node-count

which indicates that what is being requested via 'aprun' exceeds what has been reserved by PBS on the '-l select' line. Any PBS-reserved resource (npus, memory, etc.) can potentially generate this 'claim exceeds' error message.

To run this OpenMP job on 4 cores, the correct Cray 'aprun' command would be:

aprun -n 1 -d 4 -N 1 ./hello_omp.exe

Here, the '-n 1' option defines the total number of Cray processing elements (PEs) to be used by the job. This corresponds to the number of PBS chunks specified and cannot exceed that number. The '-d 4' option defines the number of threads to be used per PE. This corresponds to the number of cpus (cores) per PBS chunk and cannot exceed that number. The '-N 1' option defines the number of Cray PEs to be used per Cray physical compute node, in this case 1 out of the 16 cores available per node. This number cannot exceed the number of PEs defined by the '-n' option. OpenMP threads are 'light-weight' in the sense that they do not achieve full process status from an operating system or MPI perspective.

A Cray PE is a software concept, NOT a hardware concept and corresponds to a distinct Linux process with its own memory space. For instance, an MPI job of rank 8 would have 8 PEs assigned to it because each MPI rank (process) has its own memory space. A thread is also a software concept, NOT a hardware concept and corresponds to an independent piece of parallel work within a PE or SMP application. These software concepts are mapped to Cray compute node hardware elements by the 'aprun' command. Typically, the total number of cores requested by 'aprun' is the product of the number of PEs requested by the '-n' option and the number of threads requested by the '-d' option. This is sometimes referred to as the 'width-by-depth' of the job. This number cannot exceed the number of PBS chunks multiplied by the number of cpus (cores) per chunk defined in the '-l select' line of your PBS script. In our case here, that total is 4, and the 'aprun' command asks for exactly as many cores (4) as PBS has reserved. More detail on the relationship between PBS resource reservations made with '-l select' and the command-line options to the Cray 'aprun' command are provided below.

Submitting MPI Distributed Memory Parallel Jobs

Taking an incremental approach one step further, just a few modifications to the '#PBS' options in the SMP script above and another to the execution line are required to create a working PBS script for an MPI distributed memory parallel job. Distributed memory parallel programs are by definition designed to make parallel use of an arbitrary collection of interconnected processors (cores) -- whether they happen to be within a single compute node as in the SMP job above or on opposite ends of a cluster's interconnecting switch. Such applications are referred to as distributed because (unlike SMP applications) each cooperating process has its own completely distinct memory space that might reside on any node in a distributed memory system. Message Passing Interface (MPI) communication is accomplished through two-way, message passing between two or more processes managing their own distinct memory spaces. Because of its ability to run on almost any parallel computing architecture, MPI has become the de facto standard parallel programming model for distributed memory and other parallel computing architectures. MPI applications have been shown to scale up to thousands of processors.

In an earlier section above, an example MPI distributed-memory parallel version of the standard "Hello World" program was presented. It is a simple matter to incrementally modify the SMP-parallel PBS submit script in the prior section to run this MPI "Hello World" program on 16 processors (cores). The changes required in the distribute memory parallel PBS script relate to reserving 16 PBS resource chunks large enough for the needs of each each MPI process (rank), but small enough to be placed freely on the physical compute nodes of the cluster with enough unused resources (cores, memory, etc.) to hold them. The notion of 'free' placement in PBS allows for putting resource chunks where ever they will fit, including each chunk on separate nodes or some chunks on the same node. Here is an example PBS script for such a distributed-memory, MPI parallel job, dparallel_job.sh.

The following steps would typically be required to submit an MPI program to the CUNY HPC Center PBS Pro batch scheduling system:

1. Create a new sub-directory (named "dparallel" for example) in your home directory, copy your program into it, and compile it by executing the following commands:

andy$ mkdir dparallel
andy$ cp ./hello_mpi.c dparallel
andy$ cd dparallel
andy$ mpicc -o hello_mpi.exe hello_mpi.c

Note: On the Cray (SALK) you would compile the program using 'cc'. On the Cray the MPI library is linked in by default.

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a PBS Pro submit script file (named dparallel_job.sh for example) and insert the following lines in it:

#!/bin/bash
#
# Simple distributed memory MPI PBS Pro batch job
#
#PBS -q production
#PBS -N dparallel_job
#PBS -l select=16:ncpus=1:mem=1920mb
#PBS -l place=free
#PBS -V

# Find out name of primary compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Start the job on 16 processors using 'mpirun' (different on Cray)
mpirun -np 16 -machinefile $PBS_NODEFILE ./hello_mpi.exe

3. Submit the job to the PBS Pro Server using 'qsub dparallel_job.sh'.

qsub dparallel_job.sh
65.bob.csi.cuny.edu

Here, 65 is the PBS Pro job request ID and 'bob.csi.cuny.edu' is the name of the system from which the job was submitted. (Note: This is not always the system on which the job is run. PBS Pro can be configured to allow users to queue jobs up on any system in a resource grid. At this time, the CUNY HPC Center has system-local scheduling only.) As with the examples above, this MPI job's status can be checked with the command 'qstat', or for a full listing of job 65 'qstat -f 65'.

4. When the job completes its output will be written to the file 'dparallel_job.o65' (errors will go to dparallel_job.e65):

andy$
andy$ cat dparallel_job.o65
Hello world from process 0 of 16
Hello world from process 1 of 16
Hello world from process 4 of 16
Hello world from process 12 of 16
Hello world from process 8 of 16
Hello world from process 2 of 16
Hello world from process 5 of 16
Hello world from process 13 of 16
Hello world from process 9 of 16
Hello world from process 3 of 16
Hello world from process 6 of 16
Hello world from process 14 of 16
Hello world from process 10 of 16
Hello world from process 7 of 16
Hello world from process 15 of 16
Hello world from process 11 of 16

The output file contains the messages generated by each of the 16 MPI processes requested by the 'mpirun' command.

What are the differences in this MPI distributed memory PBS script? First, in the '-l select' line the number of chunks has been set to the number of PEs (and because in this job there is a chunk-to-core equivalency, the number of cores as well) that will be used, 16 in this case. The composition of each chunk has also changed from the SMP job. The distributed memory script defines chunks of 1 core (ncpus=1) and about 2 GBytes of memory each (mem=1920mb). We leave it up to the reader to compute the exact amount being requested using the 2^20 multiplier. PBS Pro lists those compute nodes it has allocated to the job in a file whose PATH is given in the $PBS_NODEFILE variable. Here it is provided as an argument to the 'mpirun' command's '-machinefile' option and is used by the 'mpirun' command to select the compute nodes PBS has reserved for the job.

The only other real difference in the PBS options section is with the '-l place' option. The placement option here has been set to 'free' which allows the PBS Pro scheduler to place the 16 chunks requested, one for each MPI process, on compute nodes with (execution hosts in PBS Pro terms) the required space and with the lowest load. This typically results in a distribution that is not even partially packed as PBS seeks out one node after another that is least utilized. The '-l place' option could also have been set to 'scatter' which would force placement on physically distinct compute nodes regardless of load, if there are enough with the required resources available. In either case, when requesting 16 chunks with 16 cores, using 'pack' here would not work on ANDY, because there are NO compute nodes on ANDY with 16 cores or that much memory. At this point, readers may be asking themselves the question, "How do I get performance efficient packing of my MPI processes?"

Clearly, having more MPI processes on the same compute node should reduce the time to send messages at least those processes and reduce total communication time. To get closer to a communication efficient packing scheme, a combination of the '-l place=pack' option should be used along with the '-l select' 'mpiprocs=' resource. For example:

#PBS -l select=4:ncpus=4:mem=7680mb:mpiprocs=4
#PBS -l place=pack

This combination asks the scheduler to pack 4 resource chunks of 4 cpus each (16 cpus in total) onto 4 compute nodes. The 'mpiprocs=4' variable requests that each block of 4 processors be scheduled on the same compute node (execution host). This is accomplished by creating a $PBS_NODEFILE file that repeats the name of each node assigned to each of the 4 chunks, 4 times, as follows:

andy$ cat $PBS_NODEFILE
r1i0n3
r1i0n3
r1i0n3
r1i0n3
r1i0n10
r1i0n10
r1i0n10
r1i0n10
r1i1n5
r1i1n5
r1i1n5
r1i1n5
r1i1n6
r1i1n6
r1i1n6
r1i1n6

As assigned, on ANDY this job would run on 4 physical compute nodes (3, 10, 5, and 6) and would use half (4) of the available processors (cores) on each node. Other combinations of chunks and processors are possible and might be preferred on machines with more or fewer cores per node. At the CUNY HPC Center both ANDY and BOB have 8 processors per compute node. SALK (the Cray) has 16 processors (cores). The chunk-count determines the number of resource pieces that PBS Pro must find a place for, and the product of the chunk-count and the 'ncpus' resource variable determines the total number of processors (cores) reserved by PBS for the job. How would you change the above 16 processor (core) job to be closely packed (2 chunks of 8 cores each) to be run on ANDY or BOB?

While such close packing schemes may be recommended for reducing your job's communication time and speeding its execution once started by PBS, there is a potential down-side when the system that you are submitting to is busy. A busy system is unlikely to have completely unassigned compute nodes. In such a case, a job submitted with a 'close packing' approach will be queued until unassigned nodes become available. On a busy system the wait could be a significant amount of time. As a result, more wall-clock time than might be saved in processor time by packing the job might be wasted. In general, submitting your work with '-l place=free' gives the PBS Scheduler the most flexibility in placing your job and is the best choice for moving your job as quickly as possible from the PBS queued state (Q) to the running state (R) on a busy system.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

To run a similar MPI distributed memory job using PBS on the Cray would require several modifications. Here is the script modified for submission on the Cray:

#!/bin/bash
#
# Simple distributed memory MPI PBS Pro batch job
#
#PBS -q production
#PBS -N dparallel_job
#PBS -l select=16:ncpus=1:mem=2048mb
#PBS -l place=free
#PBS -j oe
#PBS -o dparallel.out
#PBS -V

# Find out name of primary compute node
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Start the job on 16 processors using 'mpirun' (different on non-Cray systems)
aprun -n 16 -d 1 -N 16 ./hw2.exe < ./message.input

First, the PBS-reserved memory per chunk has to be raised because on the Cray 'aprun' requests 2048 Mbytes per core by default which is also the memory per core on each Cray compute node. Without this change, Cray's 'apsched' daemon produces the following error message:

apsched: claim exceeds reservation's memory

which indicates that 'aprun' is requesting more memory that the script's PBS-defined resource chunks have reserved. There are ways of asking for more and less memory per core using options to the 'aprun' command, but generally on the Cray 2048 Mbytes (2.048 Gbytes) should be requested on the PBS '-l select' line per cpu (core) in a resource chunk. In this case, there is 1 cpu in our PBS resource chunk and therefore we need only 2048 Mbytes (mem=2048mb) of memory.

In addition, on the Cray the 'aprun' command must be used to submit the job instead of 'mpirun'. Here, 'aprun' requests 16 PEs ('-n 16') with just 1 thread per PE ('-d 1'), which is the default, and requests 16 PEs per Cray compute node ('-N 16'), the maximum on the Cray which yields a fully packed node. This 'aprun' request falls within the boundaries of the resources reserved in the PBS '-l select' line, and therefore will be formally scheduled by the ALPS daemon on the Cray. As defined, this request will fit within a single Cray compute node.

Submitting GPU-Accelerated Data Parallel Jobs

GPU acceleration is supported on PENZIAS server which has two NVIDIA Kepler K20 GPUs on each of its 72 nodes. Although each of PENZIAS’ 72 nodes has 16 cores, each node is divided into two vitual nodes: 8 cores per node with no GPUs and 8 cores with 2 GPUs. A single program might combine OpenMP, symmetric multi-processor SMP parallelism with GPU parallel acceleration in which each OpenMP thread controls its own locally attached GPU. Alternatively, users might also combine MPI's distributed memory, message-passing CPU parallelism across nodes with GPU parallel acceleration within nodes. Combing all three parallel programming models (MPI, OpenMP, and GPU parallelism) in a single program is even possible, although not often dictated by program requirements.

In a manner similar to OpenMP SMP parallelism, GPU-acceleration takes advantage of an application's loop-level data parallelism by creating a separate execution thread for each independent iteration in single or nested looping structures. It then distributes those threads among the GPU's many, small-foot-print processors. Loops with dependencies can sometimes be restructured to eliminate those dependencies and allow for GPU processing. Many of the old concepts used to optimize code loops for vector computers are directly applicable to GPU data parallel acceleration and GPU programming.

While vector systems process loop iteration data in long, strip-mined, loop-iteration-based, 'vectors', GPUs process the same loop iteration data in wide, processor-block-mapped 'warps'. Vector systems execution a vector quantity (32, 64, 128) of loop iterations with a single pipelined vector instruction, while GPUs generate a separate instruction sequence (a thread) for each loop iteration and schedule them in blocks called 'warps'. A GPU 'warp' (32 iterations wide) is analogous to a vector. When multiple GPUs are involved (GPU-MPI or GPU-OpenMP programs), loop iteration data is further divided across a collection of GPUs. Programmers are reminded that using GPUs requires them to negotiate the distribution of their data across another level in the memory hierarchy because a GPU's memory and processing power are accessible only through the attached GPU's (motherboard) PCI Express bus.

Taking either of the parallel batch submission scripts above as a starting point, just a few modifications to the '#PBS' options and the command sequence are required to create a single-CPU, GPU-accelerated data parallel script for PBS Pro. A few additional changes would be required to create a combined GPU-OpenMP or GPU-MPI parallel PBS script.

Follow the instructions here to submit a basic serial CPU program with GPU acceleration written in CUDA C to the CUNY HPC Center's PBS Pro batch scheduling system:

1. Create a new sub-directory (named "gpuparallel" for example) in your home directory and move into by executing the following commands:

penzias$ mkdir gpuparallel
pennies$ cd gpuparallel

2. Use a text editor (CUNY HPCC suggests vi or vim) to create a file for the following CUDA C Host and Device code (simple3.cu in this example) and cut-and-paste these lines into it:

#include <stdio.h>

extern __global__ void kernel(int *d_a, int dimx, int dimy);

/* -------- CPU or HOST Code --------- */

int main(int argc, char *argv[])
{
   int dimx = 16;
   int dimy = 16;
   int num_bytes = dimx * dimy * sizeof(int);

   int *d_a = 0, *h_a = 0; // device and host pointers

   h_a = (int *) malloc(num_bytes);
   cudaMalloc( (void**) &d_a, num_bytes);

   if( 0 == h_a || 0 == d_a ) {
       printf("couldn't allocate memory\n"); return 1;
   }

   cudaMemset(d_a, 0, num_bytes);

   dim3 grid, block;
   block.x = 4;
   block.y = 4;
   grid.x = dimx/block.x;
   grid.y = dimy/block.y;

   kernel<<<grid,block>>>(d_a, dimx, dimy);

   cudaMemcpy(h_a,d_a,num_bytes,cudaMemcpyDeviceToHost);

   for(int row = 0; row < dimy; row++) {
      for(int col = 0; col < dimx; col++) {
         printf("%d", h_a[row*dimx+col]);
      }
      printf("\n");
   }

   free(h_a);
   cudaFree(d_a);

   return 0;

}

/* --------  GPU  or DEVICE Code -------- */

__global__ void kernel(int *a, int dimx, int dimy)
{
   int ix = blockIdx.x*blockDim.x + threadIdx.x;
   int iy = blockIdx.y*blockDim.y + threadIdx.y;
   int idx = iy * dimx + ix;

   a[idx] = a[idx] + 1;
}

NVIDIA distributes highly optimized libraries which may not be needed in basic compilation(s), but may improve the performance on production codes or on those developed on site. The high performance NVIDIA libraries are packed under Software Development Kit (SDK). In order to access CUDA tools and libraries the user must load the CUDA programming environment via module file. Please note that the current default version of the CUDA Programming environment installed on "PENZIAS" is version 5.5.

penzias$ module load cuda

3. Compile given above CUDA C code using 'nvcc', NVIDIA's CUDA C compiler. It is set up by default along with all standard libraries needed to complete basic compilations.

penzais$ nvcc -o ./simple3.exe ./simple3.cu

4. Use a text editor (CUNY HPCC suggests vi or vim) to create a PBS Pro submit script (named gpu_job.sh' for example) and insert the lines below. This script selects 1 cpu and 1 companion GPU where the CUDA Host (CPU) code and Device (GPU) code listed above each run, respectively. The simple3.exe file is a mixed binary that includes both CPU and GPU code, and everything needed for the CUDA runtime environment to negotiate whatever CPU-GPU cross-bus interaction is required to complete its execution.

#!/bin/bash
#
# Simple 1 CPU, 1 GPU PBS Pro batch job
#
#PBS -q production
#PBS -N gpu_job
#PBS -l select=1:ncpus=1:ngpus=1
#PBS -l place=free
#PBS -V

# Find out which compute node the job is using
hostname

# Change to working directory
cd $PBS_O_WORKDIR

# Run executable on a single node using 1 CPU and 1 GPU.
./simple3.exe


5. Submit the job to the PBS Pro Server using 'qsub gpu_job.sh'. You will then get the message:

penzias$ qsub gpu_job.sh
8606.penziasadmin.csi.cuny.edu

Here, 8606 is the PBS Pro job request ID and 'penzias.csi.cuny.edu' is the name of the system from which the job was submitted. (Note: This is not always the system on which the job is run. PBS Pro can be configured to allow users to queue jobs up on any system in a resource grid. At this time, the CUNY HPC Center has system-local scheduling only.) As with the examples above, this GPU job's status can be checked with the command 'qstat', or for a full listing of job 8606 'qstat -f 8606'.

6. When the job completes its output will be written to the file 'gpu_job.o8606':

$
$cat gpu_job.o8606

compute-0-53.local
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111
1111111111111111

Any errors for this job would have been written to 'gpu_job.e8606'.

The output file (transferred to and printed on the Host) contains the integer of 1 assigned to each element in the integer array a[]. At the top it prints the name of the compute node PBS assigned to the job, 'compute-0-53.local'. On PENZIAS, those nodes with GPUs attached are each with 2 Kepler GPUs.

Note that for this code the '-l select' line a single chunk is requested with 1 cpu ('ncpus=1'), 1 GPU ('ngpus=1') and the placement option in the script has been set to 'free' which allows the PBS Pro scheduler to place this CPU-GPU chunk anywhere on the PENZIAS machine. Also note that PENZIAS has 4TB memory per core. As such the maximum useable memory per chink is 3840mb. However HPC Center recommends to request at least 10% less than maximum available memory per chunk. Thus the '-l select' like should look like:


#PBS -l select=1:ncpus=1:ngpus=1:mem=3680mb

HPC Center staff has also created example OpenMP-GPU and MPI-GPU codes, makefiles, and PBS scripts that can be requested by sending an email to 'hpchelp@csi.cuny.edu' (very cool stuff)!

Submitting 'Interactive' Batch Jobs

PBS Pro provides a special kind of batch option called interactive-batch. An interactive-batch job is treated just like a regular batch job (in that it is queued up, and has to wait for PBS Pro to provide it with resources before it can run); however, once the resources are provided, the user's terminal input and output are connected to the job in a manner similar to an interactive session. The user is interactively logged into a "master" execution host (compute node), and its resources and the rest of the resources reserved (processors and otherwise) by PBS are held for the interactive job's duration.

Interactive-batch jobs can take a script file like regular batch jobs, but only the '#PBS' options in the header are read. All script-file commands are ignored. It is assumed that the user will supply their commands interactively after the session has started on the assigned execution host. As always, the '#PBS' options can also be supplied on the 'qsub' command-line. All PBS Pro interactive-batch jobs must include the -I on the 'qsub' command line. The following example starts a 4 processor interactive-batch session packed onto a single compute node (compute-0-2 here) from which a 4 processor MPI job is run. One should note that while the resources requested by the interactive-batch job are reserved for the duration of the job, the user does not have to use them all with each interactive job submission.

bob$qsub -I -q interactive -N intjob -V -l select=4:ncpus=1:mem=1920mb -l place=pack
  qsub: waiting for job 73.bob.csi.cuny.edu to start
  qsub: job 73.bob.csi.cuny.edu ready
compute-0-2$
compute-0-2$
compute-0-2$cd dparallel
compute-0-2$cat $PBS_NODEFILE

compute-0-2
compute-0-2
compute-0-2
compute-0-2

compute-0-2$hostname
compute-0-2
$
$mpirun -np 4 -machinefile $PBS_NODEFILE ./hello_mpi.exe
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
$
$CNTRL^D

$hostname
bob.csi.cuny.edu
$
$

The 'qsub' options (provided on the command-line in this case) define all that is needed for this PBS interactive-batch job. Here the '-I' option must be provided on the command-line, and the CUNY HPCC's 'interactive' queue must be selected. This queue has compute nodes dedicated for interactive work that cannot be used by production batch jobs. When the requested resources are found by the PBS Pro Scheduler, the user is logged into one of those compute nodes, the shell prompt returns, and a $PBS_NODEFILE is created. The user must change to the directory from which he wishes to submit the job as he would in a regular batch script. This is all laid out in the session above. Above, a 4 processor job is started directly with the 'mpirun' command from the shell prompt. It runs and returns its output to the terminal. More such jobs could be run if desired, although there is a cpu time and wall-clock time limit imposed on interactive sessions. The defaults are 8 and 16 minutes respectively. The maximums are 16 and 32 minutes. An interactive-batch session is terminated by simply logging our of the execution host that PBS provided the session by typing a CNTRL^D. This logs the user out of the compute node and returns them to the head node where the job was submitted.

Through the 'interactive' queue, CUNY HPCC has reserved compute nodes resources for interactive-batch jobs only. The 'interactive' queue along with the 'development' queue described below have been created to ensure that some systems resources are always available for code development. More about these and CUNY HPCC's other PBS Pro queues is provided in subsequent section.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

The sequence above will work on the Cray if the the following changes are made. Change the qsub command to:

qsub -I -q interactive -N intjob -V -l select=4:ncpus=1:mem=2048mb -l place=pack

where the memory requested per cpu (core) is 2048 Mbytes. And replace the 'mpirun' command with:

aprun -n 4 -d 1 -N 4 ./hello_mpi.exe

Because the Cray's login node (salk.csi.cuny.edu) functions as a master compute node or execution host with access to all the compute node resources managed by Cray ALPS daemon, the 'hostname' command will always return the name 'salk.csi.cuny.edu' on the Cray when run from any PBS script. Neither users nor PBS can directly access the Cray's compute nodes.

One last point relating to Cray (SALK) interactive work. On the Cray only, interactive jobs may be run directly from the default command-line using 'aprun' because the HPC Center has reserved 2 Cray compute nodes (2 x 16 cores) for interactive use completely outside of PBS. This compute-node privilege is not to be confused with running PBS interactive jobs just discussed or running jobs directly on the Cray head node (a service node) which is forbidden as it is on all other CUNY HPC systems. Such a job would not use the 'aprun' command to launch it.

An 'aprun' inititated Cray-interactive job would be run from the interactive prompt simply with:

salk$ aprun -n 4 -d 1 -N 4 ./hello_mpi.exe

without ANY interaction with PBS. Such jobs are limited in size (32 cores) and will be killed if they run for more than 30 wall-clock minutes.

More on PBS Pro resource 'chunks' and the '-l select' Option

With examples of options to the 'qsub' command and how to submit jobs to the PBS Pro Server presented above, the more complete description and additional examples provided here will be easier to understand. The general form for specifying PBS resource 'chunks' to be allocated on the compute nodes with the '-l select' option, is as follows:

-l select=[N:]chunk_type1 + [N:]chunk_type1 + ...

Here, the values of N give the number of each type of chunk requested, and each chunk type is defined with a collection of node-specific, resource-attribute assignments using the '=' sign. Each attribute is separated from the next by a colon, as in:

ncpus=2:mem=1920mb:arch=linux:switch=qdr ...

While it was not seen in the examples above, more than one type of PBS resource chunk can be defined within a single '-l select' option. Additional distinct types are appended with the '+' sign as shown just above. Using multiple chunk types should be a relatively infrequent occurrence at CUNY HPCC because our nodes are in general physically uniform. There are many kinds of node resource attributes (for more detail enter 'man pbs_resources'). Many are built-into PBS Pro; some may be site-defined. They can have a variety of data types (long, float, boolean, size, etc.), and they can be consumed by the running jobs when they are requested (ncpus for instance) or just used to define a destination. More detail on the node-specific attributes used to define chunks can also be found with 'man pbs_node_attributes'.

Once the number and type of chunk is defined, the PBS Pro scheduler maps the chunked resources requested onto the available system resources. If the system can physically fulfill the request, and there are not other jobs already using the resources requested, the job will be run. Jobs with resource requests that are physically impossible to fulfill will never run, although they can be queued with no warning about the requests impossibility to the user. Those that cannot be fulfilled because other jobs are using the resources will be queued and eventually run. To determine exactly what resources you have been given, whether your job is running or not, and the reason why not if not, generate your jobs full description with 'qstat -f JID' command. (Note: When no resource is requested by the user, the default values set for the queue that the user's job ends up in are applied to the job. These may not be exactly what is required by the job or wanted by the user.)

Below are a number of additional examples of '-l select' resource requests with an explanation of what is being requested. These are more complicated, synthetic cases to give users the idea of what is possible. They are not designed to apply to any specific CUNY HPC PBS Pro system, and they will not necessarily find an obvious use exactly as provided here. They should give you an idea of the variety of resource requests possible with the '-l select' option. Users are directed to the PBS Pro 12 User Guide here for additional examples and a full description of PBS Pro from the user's perspective.

Example 1:

-l select=2:ncpus=1:mem=10gb:arch=linux+3:ncpus=2:mem=8gb:arch=solaris

This job requests two chunk types, 2 of one type and 3 of the other. The first chunk type is to be placed on Linux compute nodes, and the second type on Solaris compute nodes. The first chunk requires nodes with at least 1 processor and 10 GBytes of memory. The second chunk requires nodes with at least 2 processors and 8 GBytes of memory.

Example 2:

-l select=4:ncpus=4:bigmem=true

This jobs request 4 chunks of the same type, each with 4 processors, and each with a site-specific boolean attribute of 'bigmem'. Nodes with 4 processors available and that had been ear-marked by the site as large memory nodes would be selected. A total of 16 processors in all would be allocated by PBS.

Example 3:

-l select=4:ncpus=2:lscratch=250gb
-l place=pack

This job also asks for 4 resource chunks of the same type. Each node selected by PBS must have 2 processors available and 250 Gbytes of node-local scratch space. The job requests a total of 8 processors and then asked for the resources to be packed on a single node. Unless the system that this job is submitted to has nodes with 8 cores and a total of 1 TByte of local storage (4 x 250 GBytes), this job will remain queued indefinitely. The 'lscratch' resource attribute is site-local and availability would be determined by a run-time script that checks available disk space. The CUNY HPC Center has defined a local scratch disk resource on ANDY.

Example 4:

-l select=3:ncpus=2:mpiprocs=2

This job requests 3 identical resource chunks, each with 2 processors for 6 in total. The 'mpiprocs' resource attribute effects how the $PBS_NODEFILE is constructed. It ensures that the $PBS_NODEFILE file generated includes two entries for each of the three chunks (and probably nodes) allocated so that there will be two MPI processes run per node. The $PBS_NODEFILE file generated from this job script would contain something like this:

node10
node10
node11
node11
node12
node12

Without the 'mpiprocs' attribute there would be only three entries in the file, one for each execution host.

Example 5:

-l select=1:ncpus=1:mem=7680mb
-l place=pack:excl

This job requests just 1 chunk for a serial job that requires a large amount of memory. The '-l place=pack:excl' option ensures that when the chunk is allocated, no other job will be able to allocate resource chunks on that node -- even if there are additional resources available. This will perhaps idle some processors on that node, but will ensure that its memory is entirely available to this job. The 'pack:excl' stands for pack exclusively (e.g. do not allow any other job on that node). By default CUNY HPC system nodes have been configured to allow their resources to be divided among multiple jobs. This means that a given node may have more than one user job running on it. An exception is on the Cray where this is prevented by Cray's ALPS scheduler in order to reduce interrupt driven load imbalance for jobs with high core counts.

Example 6:

-l select=4:ncpus=2:ompthreads=4
-l place=pack:excl

This job is configured to run a hybrid MPI and OpenMP code. The number of processors explicitly requested in total is 8 (the chunk number [4] times the ncpus number [2]). The $PBS_NODEFILE would include 4 compute nodes, one for each chunk (also, one for each of just 4 MPI processes). Assigned nodes would need to have at least 2 processors on them, but could have more. If they had, for instance, 4 processors (cores), then OpenMP would start 4 threads, one per physical processor. If they had only 2, then OpenMP would still run 4 threads on each node, but the 4 threads would compete with each other among the 2 physical cores. This set of options would suit a hyper-threaded processor like the Intel Nehalem that is used on CUNY's ANDY cluster system. (Note: The CUNY HPC Center has limited the number of processes that may run on ANDY's nodes to 8, the same as the number of physical cores). The '-l place=pack:excl' options again ensures that no other jobs will be placed there to compete with this jobs 4 OpenMP (SMP) threads.

A SPECIAL NOTE ABOUT THE CRAY (SALK):

PBS job submission on the Cray is more complicated because of the requirement to use 'aprun' (a relatively complicated utility) to execute your applications and to ensure that the resources 'aprun' requests do not exceed those that PBS has reserved for the job in the '-l select' line. Reading the 'aprun' man page carefully, including the examples at the end, will reduce your frustration in the long run and is highly recommended by HPC Center staff. A few rules of thumb relating the '-l select' line PBS options to the 'aprun' command-line options are provided here to help with basic job submission.

Rule 0:

>> Jobs submitted to the production queue ('-q production') cannot request fewer than 16 PBS resource chunks (PEs). <<

Jobs requiring fewer than 16 chunks (PEs) must be submitted to the 'development' queue. If you try to submit such a job to the production queue, you will get the following error message:

qsub: Job rejected by all possible destinations

Rule 1:

>> The number of PEs (often equivalent to cores, but not always) per physical compute node, set with the 'aprun' '-N' option, should never be greater than 16 (the maximum per node on SALK) or greater than the total number of PEs requested by the job via the '-n' option. <<

If you try to submit such a job, you will get the following error message:

apsched: -N value cannot exceed largest node size

or else:

aprun: -N cannot exceed -n

Rule 2:

>> The total number of PEs set by the 'aprun' '-n' option and to be used by the application should never exceed the total number of PBS chunks requested in the '-l select=' statement. <<

If you try to submit such a job, you will get the following error message:

apsched: claim exceeds reservation's node-count

Rule 3:

>> The product of the number of PEs requested by the 'aprun' '-n' option and the number of threads requested with the 'aprun' '-d' options should never exceed the product of the number of PBS chunks, '-l select=' and the number of cores per chunk, 'ncpus='. <<

If you try to sbumit such a job, you will get the following error message:

apsched: claim exceeds reservation's CPUs

Rule 4:

>> By default, the 'aprun' command requests 2 Gbytes (2048 Mbytes) of memory per cpu (core). You should set your PBS '-l select' per-cpu (core) memory resource to 'mem=2048mb' to match this default. <<

This is a per cpu (core) requirement. If your PBS resources chunks have multiple cpus (cores) then the memory requested per chunk should be the appropriate multiple. If you set your memory resource in the PBS '-l select' option to less than 2048 Mbytes per cpu (core), you will get the following error message:

apsched: claim exceeds reservation's memory

If you have difficulty or need to do something special like carefully placing your job for best nearest-neighbor performance on SALK's 2D torus interconnect, or use more than 2048 Mbytes memory pre core, you should ask for help via 'hpchelp@csi.cuny.edu' after reading through the 'man aprun' man page.

CUNY HPC Center PBS Pro Queue Structure

CUNY HPC Center has designed its PBS Pro queue structure to efficiently map our population of batch jobs to HPC Center system resources. PBS Pro has two distinct types of queues, execution queues and routing queues. Routing queues are defined to accept general classes of work (production, development and debugging, etc.). Jobs submitted to routing queues are then directed to their associated execution queues based on the resources requested by the job. Job resource requirements are assigned either explicitly by the user with the '-l select' option as described above, or (when not explicitly indicated by the user) implicitly through pre-defined PBS server and PBS queue resource defaults. Jobs in each general class are sorted first by the routing queue selected and then according to their resource requirements for placement in an execution queue. The execution queue is a PBS job's final destination and is shown by the 'qstat' command.

CUNY HPC Center PBS Pro Routing Queues

The routing queues that CUNY HPC Center users will generally use in their PBS Pro batch submission scripts ('-q' option) include:

Routing Queue Type 1:

interactive          ::  A development and debug queue for small scale, short ''interactive''-batch jobs.

Routing Queue Type 2:

development          ::  A development and debug queue for small scale, short batch ''test'' jobs.

Routing Queue Type 3:

production           ::  A production queue for ''production'' work of any scale and length.

On ANDY, users can submit their jobs to the default production queue described above (production) which offers 360 Intel Nehalem cores attached to a dedicated DDR Infiniband interconnect, OR they can submit their batch production work to the production_qdr queue described below which offers 360 identical Intel Nehalem cores attached to a 2x faster, but shared, communication-and-storage QDR Infiniband inteconnect. Please note that there is NO ("production_qdr") que on PENZIAS and arrange your job script accordingly when move from ANDY to PENZIAS and vice versa.

QDR Routing Queue Type 3:

production_qdr           ::  A production queue for ''production'' work of any scale and length.

Other routing queues have been defined for reservations, dedicated time, idle cycles, and rush jobs. These are currently disabled, but will be activated as CUNY HPC Center develops its 24 x 7 scheduling policy on each of its HPC systems.

Choosing the right routing queue is important because resources have been reserved for each class of work and limits have been set on the resources available in each queue. For instance, jobs submitted to the interactive routing queue are limited to 4 or fewer processors and can run for not more than a maximum of 32 processor minutes in total. The production routing queues will accept a job of virtually any size and duration, and move it to the appropriate execution queues defined below.

CUNY HPC Center PBS Pro Execution Queues

At the CUNY HPC Center, one of the currently active routing queues described above MUST be the queue name used with the '-q' option to the PBS Pro 'qsub' command, as in '-q production'. As defined at CUNY HPCC, jobs can ONLY be submitted to routing queues. Queues with extension ("_qdr") are valid only on ANDY. From there, CPU-only jobs are routed to one of the following execution queues based on the resources requested:

Execution Queue 1 (not on SALK):

qint4         ::  A queue limited to ''interactive'' work of not more than 4 processors and 16 total processor minutes.

Execution Queue 2 (not on SALK):

qdev8        ::  A queue limited to batch ''development'' work of not more than 8 processors and 60 total processor minutes.

Execution Queue 3 (not on SALK):

qserial(_qdr)       ::  A queue limited to batch ''production'' work of not more than 1 processor (currently no cpu time limit)

Execution Queue 4:

qshort16(_qdr)    ::  A queue limited to batch ''production'' work of between 2 to 16 processors, and fewer than 32 total processor hours

Execution Queue 5:

qlong16(_qdr)     ::  A queue limited to batch ''production'' work of between 2 and 16 processors, and more than 32 total processor hours (currently no cpu time limit)

Execution Queue 6 (not on SALK):

qshort64(_qdr)    ::  A queue limited to batch ''production'' work of between 17 and 64 processors, and fewer than 128 total processor hours

Execution Queue 7 (not on SALK):

qlong64(_qdr)    ::   A queue limited to batch ''production'' work of between 17 and 64 processors, and more than 128 total processor hours (currently no cpu time limit)

Execution Queue 8 (SALK only):

qshort256         ::  A queue limited to batch ''production'' work of between 17 and 256 processors, and fewer than 256 total processor hours

Execution Queue 9 (SALK only):

qlong256         ::   A queue limited to batch ''production'' work of between 17 and 256 processors, and more than 256 total processor hours (currently no cpu time limit)

Execution Queue 10 (SALK only):

qshort1024         ::  A queue limited to batch ''production'' work of between 257 and 1024 processors, and fewer than 1024 total processor hours

Execution Queue 11 (SALK only):

qlong1024         ::   A queue limited to batch ''production'' work of between 257 and 1024 processors, and more than 1024 total processor hours (currently no cpu time limit)

Execution Queue 12:

qmax(_qdr)        ::  A queue limited to batch ''production'' work of between 65 and 132 processors (currently no cpu time minimum or limit) **

(**) On SALK the 'qmax' queue accepts jobs from 1025 to 2048 processors. Special provisions for jobs requiring more resources than are allowed by default in 'qmax' can be made on any system. The smaller size of PENZIAS required that the qshort64 and qlong64 executions be replaced by a smaller qmax queue. The parenthetical extensions (_qdr) indicate the name of the QDR-equivalent execution queue on ANDY2.

As you can see from the resource limits, the execution queues are designed to contiguously pack the resource request space. Jobs submitted to the routing queues will be sorted according to the resources requested on the '-l select' option and placed in the appropriate execution queues. The job's memory requirements are also considered. The entire physical memory of a given system's compute node has been divided proportionately among the cores available on the node. This value is the default requested for each resource chunk unless otherwise specified by the user. This value sets the amount of the job's total memory to be mapped to the node's physical memory space. Job's that actually need more memory will have pages that spill out onto disk. Each execution queue limits the amount memory available to this proportional fraction of the node's memory times the processor (core) count of the job, up to the processor (core) limit for the queue.

Each execution queue has its priority set according to the prevailing usage pattern on each system. Currently, this priority scheme slightly favors jobs that are between 8 and 16 processors in size all systems, except the Cray (SALK). On SALK, jobs between 256 and 1024 cores have the highest priority to encourage the execution of large jobs there. Still, a job's priority is dependent on more than the priority of the execution queue that it ends up in. As it accumulates time in the queued (Q) state, its priority rises and this new priority is used at the next scheduling cycle (currently every 5 minutes) to decide whether or not to run the job. Furthermore, the current CUNY HPC Center PBS Pro configuration has backfilling enable, so that some smaller jobs with lower priority may be started if there is not space enough to run queued larger jobs with higher priority. This 'priority-formula' based approach to job scheduling may be supplanted by a 'fair-share' approach in the future.

The workload at the CUNY HPC Center is varied which makes it difficult to achieve perfect utilization while maintaining fair access to the resources by all parties. The objective of the design of the queueing structure is to strike a balance between high utilization and fair access. The queue limits and priorities have already been refined several times since PBS Pro became CUNY's default batch scheduler to better meet these goals. Your input is invited on this question, and if you find your jobs have remained queued for periods of more than a day please feel free to make an inquiry on the matter. We have found that the majority of jobs that are delayed have been delayed unnecessarily due job submission script errors (impossible-to-fulfill resource requests) or inefficiencies (using an alternative script would allow the job to start). This is not always the case and when problems arise their resolution usually leads to a better PBS configuration. The CUNY HPC Center also recommends that users be prepared to run their applications on multiple systems in the event that one system is busy and another is more lightly loaded. From our usage records we know that users that operate in this manner get more processing time over all.

As our familiarity with PBS Pro grows and as the needs of our user community evolve it is likely that the queue structure will continue to be refined and augmented. There may be a need to create additional queues for specific applications for instance. User comments regarding the queue structure are welcome and should be sent to hpchelp@mail.csi.cuny.edu.