Running Jobs: Difference between revisions

From HPCC Wiki
Jump to navigation Jump to search
 
(206 intermediate revisions by the same user not shown)
Line 1: Line 1:
__TOC__
__TOC__
==Running jobs on any HPCC server - an overview==
Running jobs on HPCC production server is a 2 step process. On step 1 the users prepare a set of files in  their /scratch<userid> directory and set up application environemnt . The set of needed files include:   
* Input file(s) for the job holding input data;
*Parameter(s) file(s) for the job (if applicable);
*Set up execution environment by loading proper module(s);
*Correct job submission script which holds computational parameters of the job (i.e. needed # of cores, amount of memory, run time etc.).
On Step 2 users submit job via batch job system. The latter gets as input the job submission script.


==Running jobs on Penzias, Appel and Karle==
== Overview ==
The jobs on these servers <u>must and can start only from separate file system called '''scratch mounted on all production nodes'''</u>. This file system is not a main file system and does not hold home directories for users. Consequently, users must prepare the set of job related files in their <font face="courier">'''/scratch/<font color="red"><userid>''' </font></font>directory before submit a job. Users must be aware that scratch is temporary storage and must save their data and important files (including executables) in their home directory. The minimal set of files needed to submit a job include:
The HPCC resources are grouped in 3 tiers: free tier (FT), advanced tier (AT), condo tier (CT) and separate server Arrow. In all cases and despite of used server all jobs at HPCC must:
* Input file(s) for the job holding input data;
# Start from user's directory on '''scratch''' file system '''- <font face="courier">/scratch/<font color="red"><userid></font></font>''' . <u>Jobs cannot be started from users home directories -</u> '''/global/u/<font face="courier"><font color="red"><userid></font></font>'''
*Parameter(s) file(s) for the job (if applicable);
# Use SLURM job submission system (job scheduler) '''.''' All jobs submission scripts written for other job scheduler(s) (i.e. PBS pro) must be converted to SLURM syntax. All users' data must be kept in user home directory  '''/global/u/<font face="courier"><font color="red"><userid></font></font>''' . Data on /scratch can be purged at any time nor are protected by tape backup.
*Set up execution environment;
All users' data must be kept in user home directory  '''/global/u/<font face="courier"><font color="red"><userid></font></font>''' . Data on /scratch can be purged at any time and are '''<u>NOT protected</u>''' by tape backup. Arrow and CT  servers mount independent file system HPFS and thus data cannot be shared directly between servers in AT and FT and Arrow of CT servers. Users must explicitly move files.
*Correct job submission script.  


=== '''Input files (Penzias, Appel and Karle)''' ===
=== Advanced and Free Tier ===
The input files and parameter(s) files can be locally generated or directly transferred to <font face="courier">/scratch/<font color="red"><userid>.</font></font>HPCC recommends a transfer to user's home directory first (/global/u/<font color="red"><userid> </font>) before copy the needed files from user's home (/global/u/<font color="red"><userid> </font>) to  /scratch/<font color="red"><userid></font> . In addition, these files can be transferred from users' local storage (i.e. local laptop) to DSMS (/global/u/<font color="red"><userid> </font>) using cea and/or Globus. The submission script must be created with use of Unix/Linux text editor only such as Vi/Vim, Edit, Pico or NanoMS Windows Word is a word processing system and cannot be used to create the job submission scripts.   
Servers in FT and AT are Blue Moon, Penzias, CRYO and Appel. They are attached to separate '''/scratch''' and '''/global/u''' (previously known as DSMS).  via 40Gbps Infiniband Interconnect. The former is a separate small disk based parallel file system NFS mounted on all nodes (compute and login) and the latter is large, slower file system (holding all users' home directories '''/global/u/<font face="courier"><font color="red"><userid></font></font>''')  mounted only on servers' login nodes via 40Gbps Infiniband Interconnect. Both file systems have moderate bandwidth of several hundred MB per second. Every home directory for free and advanced tier servers has a quote of 50GB.  The latter  can be expanded by submitting argumented  request to HPCC. bal/u''' file s file system is backup-ed with retention time of backup 30 days.''' Because the '''scratch''' filesystem  is mounted on all compute nodes all jobs on any server must start <font face="courier">'''/scratch/<font color="red"><userid></font>'''</font> directory. Jobs cannot be started from user's home - '''/global/u/<font face="courier"><font color="red"><userid></font></font>''' . Users must preserve valuable files (data, executables, parameters etc) in '''/global/u/<font color="red"><userid></font>'''. Both file systems have moderate bandwidth of several hundred MB per second. Every home directory has a quote of 50GBThe latter  can be expanded by submitting request to HPCC stating the reasons for required expansion. Note, that global/u''' file s file system is backup-ed with retention time of backup 30 days.'''  


== Running jobs on Arrow ==
=== Condo Tier and Arrow ===
Arrow is attached to 2.1 PB hybrid file system holding both users' home directories (backup-ed) and scratch directories. The underlying files system manipulates file placement automatically to ensure the best possible performance for different type of files.  To submit job on arrow users must prepare and place set of job related files including correct job submission file in /scratch/<font color="red"><userid>. </font> Users must preserve valuable files (data, executables, parameters etc) in /global/u/<font color="red"><userid> </font>.
At it was stated above all jobs must start from '''<font face="courier">/scratch/<font color="red"><userid></font></font>''' directory and the valuable data must be kept in '''/global/u/<font face="courier"><font color="red"><userid></font></font>'''. For Arrow and condo servers the '''/scratch''' and '''/global/u''' reside on the same HPFS file system over 200 Gbps Infiniband interconnect. The system software takes cares of optimal placement of the files. Note, that '''/global/u is not backed on tape at that time due to lack of funds.'''


=== Input files (Arrow) ===
=== Copy/move files from/to server ===
The input files can be locally generated. For Arrow the file transfer node '''cea''' and '''Globus online''' cannot be used to transfer file to Arrow's storage. The users are should consult HPCC for possible options to transfer their files to Arrow's storage. 
This section is an overview. For details please refer to a section '''"File Transfers".'''  


== Set up execution environment on all clusters ==
==== From/to server in free and advanced tier ====
All servers at HPCC are shared resources. To ensure proper environment for every job HPCC uses environment modules system. The latter which allows a dynamic modification of a user’s environment via modulefiles. Each modulefile holds  information needed to configure the shell environment for a specific software application, or to provide access to specific software tools and libraries. Modulefiles may be shared by all users on a system and users may have their own collection of module files. The users' collections may be used for "fast load" of needed modules for a job or to supplement or replace the shared modulefiles. HPCC uses 2 Environment Modules systems : the traditional Unix Modules system based on TCL and Lmod system based on Lua. The the latter has clear advantage when complex hierarcial  workflows are needed since Lmod  can handle the MODULEPATH hierarchical problem.  In addition Lmod supports TCL modules.     


:
* by using cea data transfer node
* by tunneling data (without copy) via gateway (chizen)
* use Globus online


==Inroduction==
Coping data from/to user computer to/from chizen is discouraged. Chizen has small memory and thus cannot handle large fails. 


'''SLURM''' is open source scheduler and batch system which is implemented at HPCC.
==== From/to Arrow and any of condo servers ====
Currently SLURM is used only for Penzias’ job management but the use of SLURM will be
Only tunneling (not copy) via gateway is supported. Note that Globus and cea are not accessible for CT servers and Arrow.  
expanded to other servers in the future.


'''SLURM commands:'''
== Running jobs on server form advanced and free tier ==


Slurm commands resemble the commands used in Portable Batch System (SLURM). The
=== Partitions ===
below table compares the most common SLURM and SLURM Pro commands.  
The main partition which distributes jobs on other partitions is production. Users must use this partition for all job submission. The partition has time limit of 120 hours (currently). Note that time limit as well as number of jobs per group and per user are '''reviewed periodically''' and may change in order to maximize utilization f the resources. In addition the MHN supports '''partdev partition''' which has limit of 2 hours and is dedicated to development of the codes.  


[[Image:SLURM.png]]
=== Copy/move files from/to  FT and AT servers ===
Before submitting any job to FT/AT servers the users must prepare/move/copy data into their <font face="courier">'''/scratch/<font color="red"><userid></font>'''</font> directory.   Users can transfer data  to/from <font face="courier">'''/scratch/<font color="red"><userid></font>'''</font> by using the file transfer node '''cea''' or by using '''GlobusOnline.''' HPCC recommends a transfer to user's home directory first ( '''/global/u/<font color="red"><userid></font>'''  ) before copy the needed files from user's home directory to  '''/scratch/<font color="red"><userid></font>'''.  Note that both '''cea''' and '''Globus online''' allows the transfer of user's files directly to '''/global/u/<font color="red"><userid></font>'''.  The input data, job scripts and parameter(s) files can be locally generated with use of Unix/Linux text editor such as Vi/Vim, Edit, Pico or Nano.  MS Windows Word is a word processing system and cannot be used to create  job submission scripts. 


=== Set up application environment ===
FT and AT servers use "Modules"  to set up  environment. “Modules” makes it easier for users to run a standard or customized application and/or system environment.  On AT and FT the HPCC uses classical TCL UNIX modules and LMOD - an advanced module system. The latter addresses  the MODULEPATH hierarchical problem common in UNIX based "modules" implementation. Application packages can be loaded and unloaded cleanly through the module system using modulefiles. This includes easily adding or removing directories to the PATH environment variable. Modulefiles for Library packages provide environment variables that specify where the library and header files can be found.  All the popular shells are supported: '''bash, ksh, csh, tcsh, zsh.''' LMOD is also available for perl and python.  It is important to mention that LMOD can interpret TCL module files.  The basic TCL module commands are listed below. Note that almost all applications have default version and several other versions. The default version is marked with (D). For example:<syntaxhighlight lang="abap">
python/2.7.13_anaconda (D)
</syntaxhighlight>denotes Default version of Python which can be loaded without explicit specification of the version of the software: <syntaxhighlight lang="abap">
module load python
</syntaxhighlight>Any other non default version of the same software can be loaded with specification of the full name of the module file.  <syntaxhighlight lang="abap">
module load python/3.7.6_anaconda
</syntaxhighlight>will load non-default 3.7.6 version of the Python.  The module load command can be used to load several application environments at once:
module load package1 package2 ...
For documentation on “Modules”:
man module
For help enter:
module help
To see a list of currently loaded “Modules” run:
module list
To see a complete list of all modules available on the system run:
module avail
To show content of a module enter:
module show <module_name>
To change from one application to another ( example. default versions of gnu and intel compiler):
module swap gcc intel
To go back to an initial set of modules:
module reset
=== Using LMOD commands ===
To get a list of all modules available
module spider
To get information about a specific module
module spider python
=== Modules for the advanced user ===
A “Modules” example for advanced users who need to change their environment.
The HPC Center supports a number of different compilers, libraries, and utilities. In addition, at any given time different versions of the software may be installed. “Modules” is employed to define a default environment, which generally satisfies the needs of most users and eliminates the need for the user to create the environment. From time to time, a user may have a specific requirement that differs from the default environment.
In this example, the user wishes to use a version of the NETCDF library on the HPC Center’s Cray Xe6 (SALK) that is compiled with the Portland Group, Inc. (PGI) compiler instead of the installed default version, which was compiled with the Cray compiler. The approach to do this is:
: • Run '''module list''' to see what modules are loaded by default.
: • Determine what modules should be unloaded.
: • Determine what modules should be loaded.
: • Add the needed modules, i.e., '''module load'''
The first step, see what modules are loaded, is shown below.
'''user@arrow:~> module list'''
Currently Loaded Modulefiles:
  1) modules/3.2.6.6
  2) nodestat/2.2-1.0400.31264.2.5.gem
  3) sdb/1.0-1.0400.32124.7.19.gem
  4) MySQL/5.0.64-1.0000.5053.22.1
  5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90
  6) udreg/2.3.1-1.0400.4264.3.1.gem
  7) ugni/2.3-1.0400.4374.4.88.gem
  8) gni-headers/2.1-1.0400.4351.3.1.gem
  9) dmapp/3.2.1-1.0400.4255.2.159.gem
  10) xpmem/0.1-2.0400.31280.3.1.gem
  11) hss-llm/6.0.0
  12) Base-opts/1.0.2-1.0400.31284.2.2.gem
  13) xtpe-network-gemini
  14) cce/8.0.7
  15) acml/5.1.0
  16) xt-libsci/11.1.00
  17) pmi/3.0.0-1.0000.8661.28.2807.gem
  18) rca/1.0.0-2.0400.31553.3.58.gem
  19) xt-asyncpe/5.13
  20) atp/1.5.1
  21) PrgEnv-cray/4.0.46
  22) xtpe-mc8
  23) cray-mpich2/5.5.3
  24) SLURM/11.3.0.121723
From the list, we see that the Cray Programming Environment ('''PrgEnv-cray/4.0.46''') and the Cray Compiler environment are loaded ('''cce/8.0.7''') by default. To unload these Cray modules and load in the PGI equivalents we need to know the names of the PGI modules. The '''module avail''' command shows this.
  '''user@SALK:~> module avail'''
  •
  •
  •
We see that there are several versions of the PGI compilers and two versions of the PGI Programming Environment installed. For this example, we are interested in loading PGI's 12.10 release (not the default, which is '''pgi/12.6''') and the most current release of the PGI programming environment ('''PrgEnv-pgi/4.0.46'''), which is the default.
The following module commands will unload the Cray defaults, load the PGI modules mentioned, and load version 4.2.0 of NETCDF compiled with the PGI compilers.
user@SALK:~> module unload PrgEnv-cray
user@SALK:~> module load PrgEnv-pgi
user@SALK:~> module unload pgi
user@SALK:~> module load pgi/12.10
user@SALK:~>
user@SALK:~> module load netcdf/4.2.0
user@SALK:~>
user@SALK;~> cc -V
/opt/cray/xt-asyncpe/5.13/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native.
pgcc 12.10-0 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc.  All Rights Reserved.
A few additional comments:
: • The first three commands do not include version numbers and will therefore load or unload the current default versions.
: • In the third line, we unload the default version of the PGI compiler (version 12.6), which is loaded with the rest of the PGI Programming Environment in the second line. We then load the non-default and more recent release from PGI, version 12.10 in the fourth line.
: • Later, we load NETCDF version 4.2.0 which, because we have already loaded the PGI Programming Environment, will load the version of NETCDF 4.2.0 compiled with the PGI compilers.
: • Finally, we check to see which compiler the Cray "cc" compiler wrapper actually invokes after this sequence of module commands by again entering '''module list'''.
== Running jobs on Arrow and condo servers ==
Arrow server  consist of 2 nodes purchased with NSF AOC grant funding. Condo servers are purchased directly by faculty with their own research funds. Arrow nodes and condo nodes share the same login node and are attached via 200 Gbps infiniband interconnect to the same hybrid (NVMI + hard disks) fast file system called '''HPFFS''' which can provide speeds of 25-30GB/s write  and 45-50 GB/s read. The '''/scratch''' and '''/global/u''' are part of the same '''HPFFS''' file system, but scratch is optimized for predominant access of the fast  NVMI tier of '''HPFFS'''. The underlying system software manipulates the placement of the files to ensure the best possible performance for different file types. All jobs must start  from '''<font face="courier&quot;">/scratch/<font color="red"><userid></font></font>'''  directory. Jobs <u>cannot be started</u> from user's home:  '''/global/u/<font face="courier"><font color="red"><userid></font></font>''' It is important to mention that data on '''/global/u''' on '''HPFFS'''  file system are not backup-ed since this equipment is not integrated in HPCC infrastructure. Every user home directory has a quote of 100 GB.  The latter  can be expanded by submitting motivated request to HPCC.


A few examples follow:
=== Partitions and QOS access ===
Arrow server has one public partition and '''seven private partitions'''. The public partition is open for all core users of NSF grant (). The private partitions are restricted to owners of the resources. The access to each partition is controlled by Quality of service (QOS) function. Thus only users registered for a particular partition with matched QOS credentials  will be allowed to run. Simply put, any job from unauthorized user will be rejected. Table below summarizes the information for partitions:
{| class="wikitable"
|+
!Partition name
!Type
!Qos
!Cores
!GPU
!GPU type
!Users allowed
!Time limits
!Core limits
!Memory limits
!Jobs limits
!Type of jobs
|-
|partnsf
|public
|qosnsf
|256/512
|16
|A100/80 GB
|all core users of NSF grant ()
|yes
|128
|8G per cpucore
|30/user; 50 per group
|Serial, OpenMP, MPI
|-
|partmath
|private
|qosmath
|192
|2
|A40/48 GB
|members of prof Kuklov and Prof Poje groups
|yes
|no
|no
|no
|Serial, OpenMP, MPI
|-
|partcfd
|private
|high
|128
|2
|A40/48 GB
|members of prof. Poje group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|-
|partphys
|private
|high
|64
|0
|NA
|members of prof. Kuklov group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|-
|partchem
|private
|qoschem
|192
|10
|A30/24GB + A100/40 GB
|members of prof. Loverde group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|-
|partsym
|private
|qossmhigh
|64
|2
|A100/40 GB
|members of prof. Loverde group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|-
|partasrc
|private
|qosasrchigh
|64
|2
|A30/24 GB
|members of ASRC group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|-
|parteng
|private
|qoseng
|128
|2
|A40/48 GB
|members of prof. Vaishampayan group
|no
|no
|no
|no
|Serial, OpenMP, MPI
|}


If the files are in <font face="courier">'''/global/u'''</font>
===<u>Copy files from/to Arrow and condo servers</u>===


cd /scratch/<font color="red"><userid></font>
Because Arrow and CT servers are connected only to '''HPFFS''' and are detached from main HPC infrastructure the user files can only be '''<u>tunneled</u>''' to Arrow with use of <u>'''ssh tunneling mechanism'''</u>. Users <u>cannot use Globus online and/or Cea</u> to transfer files between new and old file systems, nor they can use Cea and Globus Online to transfer files from their local devices to Arrow's file system. However the use of  ssh tunneling offers an alternative way to securely transfer files to Arrow over the Internet using ssh protocol and Chizen as a ssh server. Users are encouraged to contact HPCC for further guidance. Here is example of tunneling via Chizen:<syntaxhighlight lang="abap">
  mkdir <font color="red"><job_name></font> && cd <font color="red"><job_name></font>
scp -J <user_id>@chizen.csi.cuny.edu <file_to_transfer> <user_id>@arrow:/scratch/<user_id/.
cp /global/u/<font color="red"><userid></font>/<font color="red"><myTask</font>/a.out ./
cp /global/u/<font color="red"><userid></font>/<font color="red"><myTask</font>/<font color="red"><mydatafile></font> ./


If the files are in SR (cunyZone):
Users must submit their password twice for Chizen and for Arrow.
Files are tunneled through but not copied to chizen. Note that files copied to Chizen will be removed.
</syntaxhighlight>


cd /scratch/<font color="red"><userid></font>
=== <u>Set up execution environment on Arrow and CT servers</u> ===
mkdir <font color="red"><job_name></font> && cd <font color="red"><job_name></font>
iget <font color="red">myTask</font>/a.out ./
iget <font color="red">myTask</font>/<font color="red"><mydatafile></font> ./


===Set up job environment===
====Overview of LMOD environment modules system====
Users must load the proper environment before start any job. The loaded environment wil be automatically exported to compute nodes at the time of execution. Users must use modules to load environment. For example to
Each of the applications, libraries and executables requires specific environment. In addition many software packages and/or system packages exist in different versions. To ensure proper environment for each and every application, library or system software CUNY-HPCC applies the environment module system which allow quick and easy way to dynamically change user's environment through modules. Each module is a file which describes needed environment for the package.Modulefiles may be shared by all users on a system and users may have their own collection of module files. Note that on old servers (Penzias, Appel) HPCC utilizes TCL based modules management system which has less capabilities than LMOD.  On Arrow HPCC uses only LMOD environment. management system. The latter is Lua based and has capabilities to resolve hierarchies. It is important to mentioned that LMOD system understands and accepts the TCL modules Thus user's module existing on Appel or Penzias can be transferred and used directly on Arrow. The LMOD also allows to use shortcuts. In addition users may create collections of modules and store the later under particular name. These collections can be used for "fast load" of needed modules or to supplement or replacement of  the shared modulefiles. For instance '''''ml''''' can be used as replacement of command ''module load.''
load environment for default version of GROMACS one must type:


module load gromacs
====Modules categories====
[[File:Screenshot 2023-06-29 at 2.06.29 PM.png|thumb|949x949px|Output of module category Library]]
<syntaxhighlight lang="abap">
module category Library
</syntaxhighlight>Lmod modules are organized in categories. On Arrow the categories are Compilers, Libraries (Libs), Utilities(Util), Applications. Development Environments(DevEnv) and Communication (Net). To check content of each category the users may use the command ''module category <name of the category>.'' The picture above shows the output. In addition the version of the product is showed in module file name. Thus the line <syntaxhighlight lang="abap">
Compilers/GNU/13.1.0
</syntaxhighlight>
shown in EPYC directory denotes the module file for GNU (C/C++/fortran) compiler ver 13.1.0. tuned for AMD architecture.   


The list of available modules can be seen with command
====List of available modules====
[[File:Screenshot 2023-06-29 at 12.44.04 PM.png|thumb|934x934px|'''Module avail output: list of available modules''']]
To get list of available modules the users may use the command <syntaxhighlight lang="abap">
module avail
</syntaxhighlight>
   
   
  module avail
The output of this command for Arrow server is shown.The (D) after the module's name denotes that this module is default. The (L) denotes that the module is already loaded.     
 
====Load module(s) and check for loaded modules ====
Command ''module load <name of the module>''  OR ''module add<name of the module>''  loads a requested module. For example the below command load modules for utility cmake and network interface.  User may check which modules are already loaded by typing ''module list.''  The figure below shows the output of this command
[[File:Screenshot 2023-06-29 at 1.00.36 PM.png|thumb|428x428px|'''Output of module list command''']]
<syntaxhighlight lang="abap">
module load Utils/Cmake/3.26.4
module add Net/hpcx/2.15
module list
</syntaxhighlight>
 
Another command which is equivalent to ''module load'' is ''module add'' as it is shown in above example. 
 
====Module details====
The information about module is available via ''whatis'' command for library swig:
[[File:Screenshot 2023-06-29 at 1.45.26 PM.png|thumb|791x791px|Output of module whatis command]]
<syntaxhighlight lang="abap">
module whatis Libs/swig
</syntaxhighlight>
 
 
====Searching for modules====
Modules can be searched by ''module spider''  command. For instance the search of Python modules gives the following output:
[[File:Screenshot 2023-06-29 at 1.53.41 PM.png|thumb|792x792px|Output of module spider command]]
<syntaxhighlight lang="abap">
module spider Python
</syntaxhighlight>
 
 
<u>t</u>  Each modulefile holds  information needed to configure the shell environment for a specific software application, or to provide access to specific software tools and libraries. Modulefiles may be shared by all users on a system and users may have their own collection of module files. The users' collections may be used for "fast load" of needed modules or to supplement or replace the shared modulefiles.
 
 
== Compiling user's developed codes on Arrow==
Arrow login node is Intel X86_64 server with two K20m GPU.  Codes can be compiled there and executable can run on AMD nodes, but with basic X86_64/AMD compatibility. For better results HPCC recommends to: 
 
*compile codes directly on nodes where the codes will be run.
*use AMD optimized libraries such as ACML and AMD tuned compilers (AOCC). Users should read AOCC user manual for optimization options.
*the GNU compilers can be used as well but optimal performance on nodes is not guaranteed.
 
To compiler code directly on a node HPCC recommends the users to submit the batch job (alternative is to use interactive job - see below). Here is an example of compilation of parallel FORTRAN 77 code on a node member of particular partition. <syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --nodes=1              # request for one node
#SBATCH --job_name=<job_name>
#SBATCH --partition=<partition where to compile>  #one of the partitions when the user is registered
#SBATCH --qos=<qos for group e.g. qosmath>
#SBATCH --ntasks=1
#SBATCH --mem=64G
 
echo $SLURM_CPUS_PER_TASK
 
module purge
  module load Compilers/AOCC/4.0.0.    # load compiler
module load Net/OpenMPI/4.1.5_aocc  # load OpenMPI library
mpif77 -o <executable> -O.. <source> #invokes compilation. Add appropriate optimization flags
</syntaxhighlight>
 
==Batch job submission system (SLURM)==
This section below describes use of SLURM batch job submission system '''on Arrow.''' However many of examples can be also used on older servers like Penzias or Appel. Note that Penzais has outdated K20m GPU so pay attention and specify correctly the GPU type correctly in GPU constraints. '''SLURM''' is open source scheduler and batch system which is implemented at HPCC. SLURM is used on all servers to submit jobs.
 
===SLURM script structure===
A Slurm script must do three things:
 
#prescribe the resource requirements for the job
#set the environment
#specify the work to be carried out in the form of shell commands
The simple SLURM script is given below<syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --job-name=test_job      # some short name for a job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1              # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16        # memory per cpu-core 
#SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<valid user email>
 
cd $SLURM_SUBMIT_DIR            # change to directory from where jobs starts
</syntaxhighlight>The first line of a Slurm script above specifies the Linux/Unix shell to be used. This is followed by a series of #SBATCH directives which set the resource requirements and other parameters of the job. The script above requests 1 CPU-core and 4 GB of memory for 1 minute of run time. Note that #SBATCH is command to SLURM while the # not followed by SBATH is interpret as comment line. Users can submit 2 types of jobs - batch jobs and interactive jobs:  <syntaxhighlight lang="abap">
sbatch <name-of-slurm-script> submits job to the scheduler
salloc                         requests an interactive job on compute node(s) (see below)
</syntaxhighlight>
 
===Job(s) execution time===
The job execution time is sum with them the job waits in  SLURM partition (queue) before being executed on node(s) and  actual running time on node. For the parallel corder the partition time (time job waits in partition) increases with increasing resources such as number of CPU-cores. On other hand the execution time (time on nodes) decreases with inverse of resources. Each job has its own "sweet spot" which minimizes the time to solution. Users are encouraged to run several test runs and to figure out what amount of asked resources works best for their job(s).     
 
=== Working with QOS and partitions on Arrow===
 
Every job submission script on Arrow must hold proper description of QOS and partition. For instance all jobs intended to use node n133 must have the following lines:<syntaxhighlight lang="abap">
#SBATCH --qos=qoschem
#SBATCH --partition partchem
</syntaxhighlight> 
 
In similar way all jobs intended to use n130 and n131 must have in their job submission script: <syntaxhighlight lang="abap">
#SBATCH --qos=qosnsf
#SBATCH --partition partnsf
</syntaxhighlight>Note that Penzias do not use QOS. Thus users must adapt scripts they copy from Penzias server to match QOS requirements on Arrow.         
 
 
=== Submitting serial (sequential jobs)===
These jobs utilize  only a single CPU-core.  Below is a sample Slurm script for a serial job in partition partchem. '''<u>Users must add lines for QOS and partition as was explained above.</u>''' :
 
<syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --job-name=serial_job    # short name for job
#SBATCH --nodes=1                # node count always 1
#SBATCH --ntasks=1              # total number of tasks aways 1
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=8G        # memory per cpu-core 
#SBATCH --qos=qoschem
#SBATCH --partition partchem
 
cd $SLURM_SUBMIT_DIR
 
srun ./myjob
</syntaxhighlight>In above script requested resources are:
 
 
*--nodes=1 - specify one node
*--ntasks=1 - claim one task (by default 1 per CPU-core)
 
Job can be submitted for execution with command: <syntaxhighlight lang="abl">
sbatch <name of the SLURM script>
 
For instance if the above script is saved in file named serial_j.sh the command will be:
 
sbatch serial_j.sh
</syntaxhighlight>
 
===Submitting multithread job===
Some software like MATLAB or GROMACS are able to use multiple CPU-cores using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core. The example below show how run thread-parallel  on Arrow. '''<u>Users must add lines for QOS and partition as was explained above.</u>''' 


The list of loaded modules can be seen with command
<syntaxhighlight lang="abap">
#!/bin/bash


module list
#SBATCH --job-name=multithread  # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1              # total number of tasks across all nodes
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G        # memory per cpu-core


More information about modules is provided in "Modules and available third party software" section below.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
===Running jobs on HPC systems running SLURM scheduler===
To be able to schedule your job for execution and to actually run your job on one or more compute nodes, SLURM  needs to be instructed about your job’s parameters. These instructions are typically stored in a “job submit script”. In this section, we describe the information that needs to be included in a job submit script. The submit script typically includes 
:• job name
:• queue name
:• what compute resources (number of nodes, number of cores and the amount of memory, the amount of local scratch disk storage (applies to Andy, Herbert, and Penzias), and the number of GPUs) or other resources a job will need
:• packing option
:• actual commands that need to be executed (binary that needs to be run, input\output redirection, etc.).


</syntaxhighlight>
In this script the the '''cpus-per-task''' is mandatory so SLURM can run the multithreaded task using four CPU-cores. Correct choice of '''cpus-per-task''' is very important because typically the increase of this parameter decreases in execution time but increases waiting time in partition(queue). In addition these type of jobs rarely scale well beyond 16 cores. However the optimal value of cpus-per-task must be determined empirically by conducting several test runs. It is important to remember that the code must be '''1. muttered code and 2. be compiled with multithread option''' for instance -fomp flag in GNU compiler. 


A pro forma job submit script is provided below.
===Submitting distributed parallel job===


#!/bin/bash
#SBATCH --partition <font color="red"><queue_name></font>
#SBATCH -J <font color="red"><job_name></font>
#SBATCH --mem <font color="red"><????></font>
# change to the working directory
cd $SLURM_WORKDIR
echo ">>>> Begin <font color="red"><job_name></font>"
# actual binary (with IO redirections) and required input
# parameters is called in the next line
mpirun -np <font color="red"><cpus></font> <font color="red"><Program Name> <input_text_file></font> > <font color="red"><output_file_name></font> 2>&1








Note: The <font face="courier">'''#SLURM'''</font> string must precede every SLURM parameter.
<font face="courier">'''#'''</font> symbol in the beginning of any other line designates a comment line which is ignored by SLURM


Explanation of SLURM attributes and parameters:


:'''<font face="courier">--partition <font color="red"><queue_name></font></font>''' Available main queue is “production” unless otherwise instructed.
These jobs use Message Passing Interface to realize the distributed-memory parallelism across several nodes. The script below demonstrates how to run MPI parallel job on Arrow. '''<u>Users must add lines for QOS and partition as was explained above.</u>'''  <syntaxhighlight lang="abap">
::• “production” is the normal queue for processing your work on Penzias.
#!/bin/bash
::• “development” is used when you are testing an application.  Jobs submitted to this queue can not request more than 8 cores or use more than 1 hour of total CPU time.  If the job exceeds these parameters, it will be automatically killed. “Development” queue has higher priority and thus jobs in this queue have shorter wait time.
#SBATCH --job-name=MPI_job      # short name for job
::• “interactive” is used for quick interactive tests. Jobs submitted into this queue run in an interactive terminal session on one of the compute nodes. They can not use more than 4 cores or use more than a total of 15 minutes of compute time.
#SBATCH --nodes=2                # node count
#SBATCH --ntasks-per-node=32    # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G        # memory per cpu-core


:'''<font face="courier">-J <font color="red"><job_name></font></font>''' The user must assign a name to each job they run.  Names can be up to 15 alphanumeric characters in length.
cd $SLURM_SUBMIT_DIR
:'''<font face="courier">--ntasks=<font color="red"><cpus></font></font>'''  The number of cpus (or cores) that the user wants to use.


::• Note:  SLURM refers to “cores” as “cpus”; currently HPCC clusters maps one thread per one core.
srun ./mycode <args>.            # mycode is in local directory. For other places provider full path           
</syntaxhighlight>The above script can be easily modified for hybrid (OpenMP+MPI) by changing the cpu-per-task parameter. The optimal value of '''--nodes''' and '''--ntasks''' for a given code must be determined empirically with several test runs. In order to run decrease communication the users shown try to run large jobs by taking the whole node rather than 2 chunks from 2 (or more nodes). In addition for '''large memory''' jonbs users must use '''--mem''' rather than '''mem-per-cpu'''. Below is an SLURM script example for submission of '''large memory''' MPI job with 128 cores on a single mode. Obviously is better this type of job to be run on single node rather than on two times 64 cores from 2 nodes. To achieve that users may use the following SLURM prototype script:      <syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --job-name MPI_J_2
#SBATCH --nodes 1
#SBATCH --ntasks 128          # total number of tasks
#SBATCH --mem 40G              # total memory per job
#SBATCH --qos=qoschem
#SBATCH --partition partchem


:'''<font face="courier">--mem <font color="red"><mem> </font></font>'''  This parameter is required. It specifies how much memory is needed per job.
cd $SLURM_SUBMIT_DIR


:'''<font face="courier">--gres <font color="red"><gpu:2></font></font>'''  The number of graphics processing units that the user wants to use on a node (This parameter is only available on PENZIAS).
srun ...
gpu:2 denotes requesting 2 GPU's.  
</syntaxhighlight>In above script the requested resources are 128 cores on one node. Note that unused memory on this node will not be accessible to other jobs. In difference to previous script the memory is referred as total memory for a job via parameter '''--mem.'''




----
===Submitting Hybrid (OMP+MPI) job on Arrow===
<syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --job-name=OMP_MPI      # name of the job
#SBATCH --ntasks=24              # total number of tasks aka total # of MPI processes
#SBATCH --nodes=2                # total number of nodes
#SBATCH --tasks-per-node=12      # number of tasks per node
#SBATCH --cpus-per-task=2        # number of OMP threads per MPI process
#SBATCH --mem-per-cpu=16G        # memory per cpu-core 
#SBATCH --partition=partnsf
#SBATCH --qos=qosnsf


'''Special note for MPI users'''
cd $SLURM_SUBMIT_DIR


Parameters are defined can significantly affect the run time of a job.  For example, assume you need to run a job that requires 64 cores.  This can be scheduled in a number of different ways.  For example,
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK


#SBATCH --nodes 8
srun ...
#SBATCH --ntasks 64


will freely place the 8 job chunks on any nodes that have 8 cpus available. While this may minimize communications overhead in your MPI job, SLURM will not schedule this job until 8 nodes each with 8 free cpus becomes available. Consequently, the job may wait longer in the input queue before going into execution.
</syntaxhighlight>The above script si prototype and shows how to allocate 24 MPI treads with 12 cores per node. Each MPI tread initiates 2 OMP threads. For actual working script users must add QOS and partition information and adjust their requirements for the memory.                      


#SBATCH --nodes 32
#SBATCH --ntasks 2


will freely place 32 chunks of 2 cores each. There will possibly be some nodes with 4 free chunks (and 8 cores) and there may be nodes with only 1 free chunk (and 2 cores). In this case, the job ends up being more sparsely distributed across the system and hence the total averaged latency may be larger then in case with '''<font face="courier">nodes 8, ntasks 64</font>'''
===GPU jobs ===
On arrow each of the nodes has 8 GPU A40 with 80GB on board. To use GPUs in a job users must add the --gres option in SBATH line for cpu resources. The example below demonstrates a GLU enabled SLURM script. '''<u>Users must add lines for QOS and partition as was explained above.</u>'''


<syntaxhighlight lang="abap">
#!/bin/bash
#SBATCH --job-name=GPU_J        # short name for job
#SBATCH --nodes=1                # number of nodes
#SBATCH --ntasks=1              # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G        # memory per cpu-core
#SBATCH --gres=gpu:1            # number of gpus per node max 8 for Arrow


----
cd $SLURM_SUBMIT_DIR


'''<font face="courier">mpirun -np <font color="red"><total tasks or total cpus></font></font>'''.  This script line is only to be used for MPI jobs and defines the total number of cores required for the parallel MPI job.
srun ... <code> <args>  
</syntaxhighlight>


The Table 2 below shows the maximum values of the various SLURM parameters by system.  Request only the resources you need as requesting maximal resources will delay your job.
=== GPU constrains===


====Serial Jobs====
For serial jobs,''' <font face="courier"> --nodes 1</font>''' and '''<font face="courier"> --ntasks 1 </font>''' should be used.


#!/bin/bash
#
# Typical job script to run a serial job in the production queue
#
#SBATCH --partition production
#SBATCH -J <font color="red"><job_name></font>
#SBATCH --nodes 1
#SBATCH --ntasks 1
# Change to working directory
cd $SLURM_SUBMIT_DIR
# Run my serial job
<font color="red"></path/to/your_binary></font> > <font color="red"><my_output></font> 2>&1


====OpenMP and Threaded Parallel jobs====
OpenMP jobs can only run on a single virtual node.  Therefore, for OpenMP jobs,''' <font face="courier">place=pack</font>''' and '''<font face="courier">select=1</font>''' should be used;  '''<font face="courier">ncpus</font>''' should be set to '''<font face="courier">[2, 3, 4,… n]</font>''' where '''<<font face="courier">n</font>''' must be less than or equal to the number of cores on a virtual compute node.


Typically, OpenMP jobs will use the '''<font face="courier"><font color="red"><mem></font></font>''' parameter and may request up to all the available memory on a node.




#!/bin/bash
#SBATCH -J Job_name
#SBATCH --partition production
#SBATCH --ntasks 1
#SBATCH --nodes 1
#SBATCH --mem=<font color="red"><mem></font>
#SBATCH -c 4
# Set OMP_NUM_THREADS to the same value as -c
# with a fallback in case it isn't set.
# SLURM_CPUS_PER_TASK is set to the value of -c, but only if -c is explicitly set
omp_threads=1
if [ -n "$SLURM_CPUS_PER_TASK" ];
omp_threads=$SLURM_CPUS_PER_TASK
else
omp_threads=1
fi
 
mpirun -np <font color="red"></path/to/your_binary></font> > <font color="red"><my_output></font> 2>&1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            mpirun -np 16 <font color="red"></path/to/your_binary></font> > <font color="red"><my_output></font> 2>&1


====MPI Distributed Memory Parallel Jobs====
On Appel the nodes in partnsf, partchem and partmath have different GPU types (A30, A40 and A100). The type of GPU can be specified in SLURM by use constraint on the GPU SKU, GPU generation, or GPU compute capability.  Here is example:<syntaxhighlight lang="abap">
For an MPI job, '''<font face="courier">select=</font>''' and '''<font face="courier">ncpus=</font>''' can be one or more, with '''<font face="courier">np= >/=1</font>'''.
#SBATCH --gres=gpu:1 --constraint='gpu_sku:V100'     # allocates one V100 GPU


#!/bin/bash
#SBATCH --gres=gpu:1 --constraint='gpu_gen:Ampere'    # allocates one Ampere GPU (A40 or A100)
#
# Typical job script to run a distributed memory MPI job in the production queue requesting 16 cores in 16 nodes.
#
#SBATCH --partition production
#SBATCH -J <font color="red"><job_name></font>
#SBATCH --ntasks 16
#SBATCH --nodes 16
#SBATCH --mem=<font color="red"><mem></font>


#SBATCH --gres=gpu:1 --constraint='gpu_cc:12.0'      # allocates GPU per computability (generation)
# Change to working directory
cd $SLURM_SUBMIT_DIR
# Run my 16-core MPI job
mpirun -np 16 <font color="red"></path/to/your_binary></font> > <font color="red"><my_output></font> 2>&1


#SBATCH --gres=gpu:1 --constraint='gpu_mem:32GB'      # allocates GPU with 32GB memory on board


====GPU-Accelerated Data Parallel Jobs====
#SBATCH --gres=gpu:1 --constraint='nvlink:2.0'.      # allocates GPU linked with NVLink
</syntaxhighlight>


#!/bin/bash
===Parametric jobs via Job Array===
  #
Job arrays are used for running the same job multiple times but with only slightly different parameters. The below script demonstrates how to run such a job  on Arrow. '''<u>Users must add lines for QOS and partition as was explained above.</u>''' NB! The array numbers must be less than the maximum number of jobs allowed in the array.
# Typical job script to run a 1 CPU, 1 GPU batch job in the production queue
  #
#SBATCH --partition production
#SBATCH -J <font color="red"><job_name></font>
#SBATCH --ntasks l
#SBATCH --gres gpu:1
#SBATCH --mem <fond color="red"><mem></fond color>


# Find out which compute node the job is using
<syntaxhighlight lang="abap">
  hostname
#!/bin/bash
   
#SBATCH --job-name=Array_J        # short name for job
# Change to working directory
#SBATCH --nodes=1                # node count
cd $SLURM_SUBMIT_DIR
#SBATCH --ntasks=1                # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G        # memory per cpu-core  
#SBATCH --output=slurm-%A.%a.out # stdout file (standart out)
#SBATCH --error=slurm-%A.%a.err  # stderr file (standart error)
#SBATCH --array=0-3              # job array indexes 0, 1, 2, 3
   
   
# Run my GPU job on a single node using 1 CPU and 1 GPU.
cd $SLURM_SUBMIT_DIR
<font color="red"></path/to/your_binary></font> >  <font color="red"><my_output></font> 2>&1


====Submitting jobs for execution====
<executable>
</syntaxhighlight>


'''NOTE:''' We do not allow users to run any production job on the login-node.  It is acceptable to do short compiles on the login node, but all other jobs must be run by handing off the “job submit script” to SLURM running on the head-node.  SLURM will then allocate resources on the compute-nodes for execution of the job.
=== Interactive jobs===




The command to submit your “job submit script” ('''<font face="courier"><font color="red"><job.script></font></font>''') is:
sbatch <font color="red"><job.script></font>


===Running jobs on shared memory systems===


<font color="red">This section in in development</font>


===Saving output files and clean-up===
Normally you expect certain data in the output files as a result of a job. There are a number of things that you may want to do with these files:


:• Check the content of these outputs and discard them. In such case you can simply delete all unwanted data with '''<font face="courier">rm</font>''' command.
:• Move output files to your local workstation. You can use '''<font face="courier">scp</font>''' for small amounts of data and/or '''GlobusOnline''' for larger data transfers.
:• You may also want to store the outputs at the HPCC resources. In this case you can either move your outputs to '''<font face="courier">/global/u</font>''' or to '''SR1''' storage resource.


In all cases your /scratch/<userid> directory is expected to be empty. Output files stored inside '''<font face="courier">/scratch/<font color="red"><userid></font></font>''' can be purged at any moment (except for files that are currently being used in active jobs) located under the '''<font face="courier">/scratch/<font color="red"><userid></font>/<font color="red"><job_name></font></font>''' directory.
These jobs are useful in development or test phase and rarely are required in a workflow. It is not recommend to use interactive jobs as main type of jiobs since they consume more resources that regular batch jobs. To set up interactive job the users first have to 1. start interactive shell and 2 "reserve" the resources. teh example below describes that. <syntaxhighlight lang="abap">
srun -p interactive --pty /bin/bash    # starts interactive session
</syntaxhighlight>Once the interactive session is running the users must "reserve" resources needed for actual job:<syntaxhighlight lang="abap">
salloc --ntasks=8 --ntasks-per-node=1 --cpus-per-task=2.          # allocates resources
salloc: CPU resource required, checking settings/requirements...
salloc: Granted job allocation ....
salloc: Waiting for resource configuration
salloc: Nodes ...                          # system reports back where the resources were allocated
</syntaxhighlight>

Latest revision as of 04:42, 10 October 2023

Overview

The HPCC resources are grouped in 3 tiers: free tier (FT), advanced tier (AT), condo tier (CT) and separate server Arrow. In all cases and despite of used server all jobs at HPCC must:

  1. Start from user's directory on scratch file system - /scratch/<userid> . Jobs cannot be started from users home directories - /global/u/<userid>
  2. Use SLURM job submission system (job scheduler) . All jobs submission scripts written for other job scheduler(s) (i.e. PBS pro) must be converted to SLURM syntax. All users' data must be kept in user home directory /global/u/<userid> . Data on /scratch can be purged at any time nor are protected by tape backup.

All users' data must be kept in user home directory /global/u/<userid> . Data on /scratch can be purged at any time and are NOT protected by tape backup. Arrow and CT servers mount independent file system HPFS and thus data cannot be shared directly between servers in AT and FT and Arrow of CT servers. Users must explicitly move files.

Advanced and Free Tier

Servers in FT and AT are Blue Moon, Penzias, CRYO and Appel. They are attached to separate /scratch and /global/u (previously known as DSMS). via 40Gbps Infiniband Interconnect. The former is a separate small disk based parallel file system NFS mounted on all nodes (compute and login) and the latter is large, slower file system (holding all users' home directories /global/u/<userid>) mounted only on servers' login nodes via 40Gbps Infiniband Interconnect. Both file systems have moderate bandwidth of several hundred MB per second. Every home directory for free and advanced tier servers has a quote of 50GB. The latter can be expanded by submitting argumented request to HPCC. bal/u file s file system is backup-ed with retention time of backup 30 days. Because the scratch filesystem is mounted on all compute nodes all jobs on any server must start /scratch/<userid> directory. Jobs cannot be started from user's home - /global/u/<userid> . Users must preserve valuable files (data, executables, parameters etc) in /global/u/<userid>. Both file systems have moderate bandwidth of several hundred MB per second. Every home directory has a quote of 50GB. The latter can be expanded by submitting request to HPCC stating the reasons for required expansion. Note, that global/u file s file system is backup-ed with retention time of backup 30 days.

Condo Tier and Arrow

At it was stated above all jobs must start from /scratch/<userid> directory and the valuable data must be kept in /global/u/<userid>. For Arrow and condo servers the /scratch and /global/u reside on the same HPFS file system over 200 Gbps Infiniband interconnect. The system software takes cares of optimal placement of the files. Note, that /global/u is not backed on tape at that time due to lack of funds.

Copy/move files from/to server

This section is an overview. For details please refer to a section "File Transfers".

From/to server in free and advanced tier

  • by using cea data transfer node
  • by tunneling data (without copy) via gateway (chizen)
  • use Globus online

Coping data from/to user computer to/from chizen is discouraged. Chizen has small memory and thus cannot handle large fails.

From/to Arrow and any of condo servers

Only tunneling (not copy) via gateway is supported. Note that Globus and cea are not accessible for CT servers and Arrow.

Running jobs on server form advanced and free tier

Partitions

The main partition which distributes jobs on other partitions is production. Users must use this partition for all job submission. The partition has time limit of 120 hours (currently). Note that time limit as well as number of jobs per group and per user are reviewed periodically and may change in order to maximize utilization f the resources. In addition the MHN supports partdev partition which has limit of 2 hours and is dedicated to development of the codes.

Copy/move files from/to FT and AT servers

Before submitting any job to FT/AT servers the users must prepare/move/copy data into their /scratch/<userid> directory. Users can transfer data to/from /scratch/<userid> by using the file transfer node cea or by using GlobusOnline. HPCC recommends a transfer to user's home directory first ( /global/u/<userid> ) before copy the needed files from user's home directory to /scratch/<userid>. Note that both cea and Globus online allows the transfer of user's files directly to /global/u/<userid>. The input data, job scripts and parameter(s) files can be locally generated with use of Unix/Linux text editor such as Vi/Vim, Edit, Pico or Nano. MS Windows Word is a word processing system and cannot be used to create job submission scripts.

Set up application environment

FT and AT servers use "Modules" to set up environment. “Modules” makes it easier for users to run a standard or customized application and/or system environment. On AT and FT the HPCC uses classical TCL UNIX modules and LMOD - an advanced module system. The latter addresses the MODULEPATH hierarchical problem common in UNIX based "modules" implementation. Application packages can be loaded and unloaded cleanly through the module system using modulefiles. This includes easily adding or removing directories to the PATH environment variable. Modulefiles for Library packages provide environment variables that specify where the library and header files can be found. All the popular shells are supported: bash, ksh, csh, tcsh, zsh. LMOD is also available for perl and python. It is important to mention that LMOD can interpret TCL module files. The basic TCL module commands are listed below. Note that almost all applications have default version and several other versions. The default version is marked with (D). For example:

python/2.7.13_anaconda (D)

denotes Default version of Python which can be loaded without explicit specification of the version of the software:

module load python

Any other non default version of the same software can be loaded with specification of the full name of the module file.

module load python/3.7.6_anaconda

will load non-default 3.7.6 version of the Python. The module load command can be used to load several application environments at once:

module load package1 package2 ...

For documentation on “Modules”:

man module

For help enter:

module help

To see a list of currently loaded “Modules” run:

module list

To see a complete list of all modules available on the system run:

module avail

To show content of a module enter:

module show <module_name> 

To change from one application to another ( example. default versions of gnu and intel compiler):

module swap gcc intel

To go back to an initial set of modules:

module reset

Using LMOD commands

To get a list of all modules available

module spider

To get information about a specific module

module spider python

Modules for the advanced user

A “Modules” example for advanced users who need to change their environment.

The HPC Center supports a number of different compilers, libraries, and utilities. In addition, at any given time different versions of the software may be installed. “Modules” is employed to define a default environment, which generally satisfies the needs of most users and eliminates the need for the user to create the environment. From time to time, a user may have a specific requirement that differs from the default environment.

In this example, the user wishes to use a version of the NETCDF library on the HPC Center’s Cray Xe6 (SALK) that is compiled with the Portland Group, Inc. (PGI) compiler instead of the installed default version, which was compiled with the Cray compiler. The approach to do this is:

• Run module list to see what modules are loaded by default.
• Determine what modules should be unloaded.
• Determine what modules should be loaded.
• Add the needed modules, i.e., module load

The first step, see what modules are loaded, is shown below.

user@arrow:~> module list
Currently Loaded Modulefiles:

  1) modules/3.2.6.6
  2) nodestat/2.2-1.0400.31264.2.5.gem
  3) sdb/1.0-1.0400.32124.7.19.gem
  4) MySQL/5.0.64-1.0000.5053.22.1
  5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90
  6) udreg/2.3.1-1.0400.4264.3.1.gem
  7) ugni/2.3-1.0400.4374.4.88.gem
  8) gni-headers/2.1-1.0400.4351.3.1.gem
  9) dmapp/3.2.1-1.0400.4255.2.159.gem
 10) xpmem/0.1-2.0400.31280.3.1.gem
 11) hss-llm/6.0.0
 12) Base-opts/1.0.2-1.0400.31284.2.2.gem
 13) xtpe-network-gemini
 14) cce/8.0.7
 15) acml/5.1.0
 16) xt-libsci/11.1.00
 17) pmi/3.0.0-1.0000.8661.28.2807.gem
 18) rca/1.0.0-2.0400.31553.3.58.gem
 19) xt-asyncpe/5.13
 20) atp/1.5.1
 21) PrgEnv-cray/4.0.46
 22) xtpe-mc8
 23) cray-mpich2/5.5.3
 24) SLURM/11.3.0.121723

From the list, we see that the Cray Programming Environment (PrgEnv-cray/4.0.46) and the Cray Compiler environment are loaded (cce/8.0.7) by default. To unload these Cray modules and load in the PGI equivalents we need to know the names of the PGI modules. The module avail command shows this.

 user@SALK:~> module avail
 •
 •
 •

We see that there are several versions of the PGI compilers and two versions of the PGI Programming Environment installed. For this example, we are interested in loading PGI's 12.10 release (not the default, which is pgi/12.6) and the most current release of the PGI programming environment (PrgEnv-pgi/4.0.46), which is the default.

The following module commands will unload the Cray defaults, load the PGI modules mentioned, and load version 4.2.0 of NETCDF compiled with the PGI compilers.

user@SALK:~> module unload PrgEnv-cray
user@SALK:~> module load PrgEnv-pgi
user@SALK:~> module unload pgi
user@SALK:~> module load pgi/12.10
user@SALK:~> 
user@SALK:~> module load netcdf/4.2.0
user@SALK:~>
user@SALK;~> cc -V

/opt/cray/xt-asyncpe/5.13/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native.

pgcc 12.10-0 64-bit target on x86-64 Linux 
Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
Copyright 2000-2012, STMicroelectronics, Inc.  All Rights Reserved.

A few additional comments:

• The first three commands do not include version numbers and will therefore load or unload the current default versions.
• In the third line, we unload the default version of the PGI compiler (version 12.6), which is loaded with the rest of the PGI Programming Environment in the second line. We then load the non-default and more recent release from PGI, version 12.10 in the fourth line.
• Later, we load NETCDF version 4.2.0 which, because we have already loaded the PGI Programming Environment, will load the version of NETCDF 4.2.0 compiled with the PGI compilers.
• Finally, we check to see which compiler the Cray "cc" compiler wrapper actually invokes after this sequence of module commands by again entering module list.

Running jobs on Arrow and condo servers

Arrow server consist of 2 nodes purchased with NSF AOC grant funding. Condo servers are purchased directly by faculty with their own research funds. Arrow nodes and condo nodes share the same login node and are attached via 200 Gbps infiniband interconnect to the same hybrid (NVMI + hard disks) fast file system called HPFFS which can provide speeds of 25-30GB/s write and 45-50 GB/s read. The /scratch and /global/u are part of the same HPFFS file system, but scratch is optimized for predominant access of the fast NVMI tier of HPFFS. The underlying system software manipulates the placement of the files to ensure the best possible performance for different file types. All jobs must start from /scratch/<userid> directory. Jobs cannot be started from user's home: /global/u/<userid> It is important to mention that data on /global/u on HPFFS file system are not backup-ed since this equipment is not integrated in HPCC infrastructure. Every user home directory has a quote of 100 GB. The latter can be expanded by submitting motivated request to HPCC.

Partitions and QOS access

Arrow server has one public partition and seven private partitions. The public partition is open for all core users of NSF grant (). The private partitions are restricted to owners of the resources. The access to each partition is controlled by Quality of service (QOS) function. Thus only users registered for a particular partition with matched QOS credentials will be allowed to run. Simply put, any job from unauthorized user will be rejected. Table below summarizes the information for partitions:

Partition name Type Qos Cores GPU GPU type Users allowed Time limits Core limits Memory limits Jobs limits Type of jobs
partnsf public qosnsf 256/512 16 A100/80 GB all core users of NSF grant () yes 128 8G per cpucore 30/user; 50 per group Serial, OpenMP, MPI
partmath private qosmath 192 2 A40/48 GB members of prof Kuklov and Prof Poje groups yes no no no Serial, OpenMP, MPI
partcfd private high 128 2 A40/48 GB members of prof. Poje group no no no no Serial, OpenMP, MPI
partphys private high 64 0 NA members of prof. Kuklov group no no no no Serial, OpenMP, MPI
partchem private qoschem 192 10 A30/24GB + A100/40 GB members of prof. Loverde group no no no no Serial, OpenMP, MPI
partsym private qossmhigh 64 2 A100/40 GB members of prof. Loverde group no no no no Serial, OpenMP, MPI
partasrc private qosasrchigh 64 2 A30/24 GB members of ASRC group no no no no Serial, OpenMP, MPI
parteng private qoseng 128 2 A40/48 GB members of prof. Vaishampayan group no no no no Serial, OpenMP, MPI

Copy files from/to Arrow and condo servers

Because Arrow and CT servers are connected only to HPFFS and are detached from main HPC infrastructure the user files can only be tunneled to Arrow with use of ssh tunneling mechanism. Users cannot use Globus online and/or Cea to transfer files between new and old file systems, nor they can use Cea and Globus Online to transfer files from their local devices to Arrow's file system. However the use of ssh tunneling offers an alternative way to securely transfer files to Arrow over the Internet using ssh protocol and Chizen as a ssh server. Users are encouraged to contact HPCC for further guidance. Here is example of tunneling via Chizen:

scp -J <user_id>@chizen.csi.cuny.edu <file_to_transfer> <user_id>@arrow:/scratch/<user_id/.

Users must submit their password twice for Chizen and for Arrow. 
Files are tunneled through but not copied to chizen. Note that files copied to Chizen will be removed.

Set up execution environment on Arrow and CT servers

Overview of LMOD environment modules system

Each of the applications, libraries and executables requires specific environment. In addition many software packages and/or system packages exist in different versions. To ensure proper environment for each and every application, library or system software CUNY-HPCC applies the environment module system which allow quick and easy way to dynamically change user's environment through modules. Each module is a file which describes needed environment for the package.Modulefiles may be shared by all users on a system and users may have their own collection of module files. Note that on old servers (Penzias, Appel) HPCC utilizes TCL based modules management system which has less capabilities than LMOD. On Arrow HPCC uses only LMOD environment. management system. The latter is Lua based and has capabilities to resolve hierarchies. It is important to mentioned that LMOD system understands and accepts the TCL modules Thus user's module existing on Appel or Penzias can be transferred and used directly on Arrow. The LMOD also allows to use shortcuts. In addition users may create collections of modules and store the later under particular name. These collections can be used for "fast load" of needed modules or to supplement or replacement of the shared modulefiles. For instance ml can be used as replacement of command module load.

Modules categories

Output of module category Library
module category Library

Lmod modules are organized in categories. On Arrow the categories are Compilers, Libraries (Libs), Utilities(Util), Applications. Development Environments(DevEnv) and Communication (Net). To check content of each category the users may use the command module category <name of the category>. The picture above shows the output. In addition the version of the product is showed in module file name. Thus the line

Compilers/GNU/13.1.0

shown in EPYC directory denotes the module file for GNU (C/C++/fortran) compiler ver 13.1.0. tuned for AMD architecture.

List of available modules

Module avail output: list of available modules

To get list of available modules the users may use the command

module avail

The output of this command for Arrow server is shown.The (D) after the module's name denotes that this module is default. The (L) denotes that the module is already loaded.

Load module(s) and check for loaded modules

Command module load <name of the module> OR module add<name of the module> loads a requested module. For example the below command load modules for utility cmake and network interface. User may check which modules are already loaded by typing module list. The figure below shows the output of this command

Output of module list command
module load Utils/Cmake/3.26.4
module add Net/hpcx/2.15
module list

Another command which is equivalent to module load is module add as it is shown in above example.

Module details

The information about module is available via whatis command for library swig:

Output of module whatis command
module whatis Libs/swig


Searching for modules

Modules can be searched by module spider command. For instance the search of Python modules gives the following output:

Output of module spider command
module spider Python


t Each modulefile holds information needed to configure the shell environment for a specific software application, or to provide access to specific software tools and libraries. Modulefiles may be shared by all users on a system and users may have their own collection of module files. The users' collections may be used for "fast load" of needed modules or to supplement or replace the shared modulefiles.


Compiling user's developed codes on Arrow

Arrow login node is Intel X86_64 server with two K20m GPU. Codes can be compiled there and executable can run on AMD nodes, but with basic X86_64/AMD compatibility. For better results HPCC recommends to:

  • compile codes directly on nodes where the codes will be run.
  • use AMD optimized libraries such as ACML and AMD tuned compilers (AOCC). Users should read AOCC user manual for optimization options.
  • the GNU compilers can be used as well but optimal performance on nodes is not guaranteed.

To compiler code directly on a node HPCC recommends the users to submit the batch job (alternative is to use interactive job - see below). Here is an example of compilation of parallel FORTRAN 77 code on a node member of particular partition.

#!/bin/bash
#SBATCH --nodes=1               # request for one node 
#SBATCH --job_name=<job_name>
#SBATCH --partition=<partition where to compile>  #one of the partitions when the user is registered
#SBATCH --qos=<qos for group e.g. qosmath>
#SBATCH --ntasks=1 
#SBATCH --mem=64G 

echo $SLURM_CPUS_PER_TASK

 module purge
 module load Compilers/AOCC/4.0.0.    # load compiler 
 module load Net/OpenMPI/4.1.5_aocc   # load OpenMPI library
 mpif77 -o <executable> -O.. <source> #invokes compilation. Add appropriate optimization flags

Batch job submission system (SLURM)

This section below describes use of SLURM batch job submission system on Arrow. However many of examples can be also used on older servers like Penzias or Appel. Note that Penzais has outdated K20m GPU so pay attention and specify correctly the GPU type correctly in GPU constraints. SLURM is open source scheduler and batch system which is implemented at HPCC. SLURM is used on all servers to submit jobs.

SLURM script structure

A Slurm script must do three things:

  1. prescribe the resource requirements for the job
  2. set the environment
  3. specify the work to be carried out in the form of shell commands

The simple SLURM script is given below

#!/bin/bash
#SBATCH --job-name=test_job      # some short name for a job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16         # memory per cpu-core  
#SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<valid user email>

cd $SLURM_SUBMIT_DIR             # change to directory from where jobs starts

The first line of a Slurm script above specifies the Linux/Unix shell to be used. This is followed by a series of #SBATCH directives which set the resource requirements and other parameters of the job. The script above requests 1 CPU-core and 4 GB of memory for 1 minute of run time. Note that #SBATCH is command to SLURM while the # not followed by SBATH is interpret as comment line. Users can submit 2 types of jobs - batch jobs and interactive jobs:

sbatch <name-of-slurm-script>	submits job to the scheduler
salloc	                        requests an interactive job on compute node(s) (see below)

Job(s) execution time

The job execution time is sum with them the job waits in SLURM partition (queue) before being executed on node(s) and actual running time on node. For the parallel corder the partition time (time job waits in partition) increases with increasing resources such as number of CPU-cores. On other hand the execution time (time on nodes) decreases with inverse of resources. Each job has its own "sweet spot" which minimizes the time to solution. Users are encouraged to run several test runs and to figure out what amount of asked resources works best for their job(s).

Working with QOS and partitions on Arrow

Every job submission script on Arrow must hold proper description of QOS and partition. For instance all jobs intended to use node n133 must have the following lines:

#SBATCH --qos=qoschem
#SBATCH --partition partchem

In similar way all jobs intended to use n130 and n131 must have in their job submission script:

#SBATCH --qos=qosnsf
#SBATCH --partition partnsf

Note that Penzias do not use QOS. Thus users must adapt scripts they copy from Penzias server to match QOS requirements on Arrow.


Submitting serial (sequential jobs)

These jobs utilize only a single CPU-core. Below is a sample Slurm script for a serial job in partition partchem. Users must add lines for QOS and partition as was explained above. :

#!/bin/bash
#SBATCH --job-name=serial_job    # short name for job
#SBATCH --nodes=1                # node count always 1
#SBATCH --ntasks=1               # total number of tasks aways 1
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=8G         # memory per cpu-core  
#SBATCH --qos=qoschem
#SBATCH --partition partchem

cd $SLURM_SUBMIT_DIR

srun ./myjob

In above script requested resources are:


  • --nodes=1 - specify one node
  • --ntasks=1 - claim one task (by default 1 per CPU-core)

Job can be submitted for execution with command:

sbatch <name of the SLURM script>

For instance if the above script is saved in file named serial_j.sh the command will be: 

sbatch serial_j.sh

Submitting multithread job

Some software like MATLAB or GROMACS are able to use multiple CPU-cores using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core. The example below show how run thread-parallel on Arrow. Users must add lines for QOS and partition as was explained above.

#!/bin/bash 

#SBATCH --job-name=multithread   # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core 

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

In this script the the cpus-per-task is mandatory so SLURM can run the multithreaded task using four CPU-cores. Correct choice of cpus-per-task is very important because typically the increase of this parameter decreases in execution time but increases waiting time in partition(queue). In addition these type of jobs rarely scale well beyond 16 cores. However the optimal value of cpus-per-task must be determined empirically by conducting several test runs. It is important to remember that the code must be 1. muttered code and 2. be compiled with multithread option for instance -fomp flag in GNU compiler.

Submitting distributed parallel job

These jobs use Message Passing Interface to realize the distributed-memory parallelism across several nodes. The script below demonstrates how to run MPI parallel job on Arrow. Users must add lines for QOS and partition as was explained above.

#!/bin/bash
#SBATCH --job-name=MPI_job       # short name for job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks-per-node=32     # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G        # memory per cpu-core 

cd $SLURM_SUBMIT_DIR

srun ./mycode <args>.            # mycode is in local directory. For other places provider full path

The above script can be easily modified for hybrid (OpenMP+MPI) by changing the cpu-per-task parameter. The optimal value of --nodes and --ntasks for a given code must be determined empirically with several test runs. In order to run decrease communication the users shown try to run large jobs by taking the whole node rather than 2 chunks from 2 (or more nodes). In addition for large memory jonbs users must use --mem rather than mem-per-cpu. Below is an SLURM script example for submission of large memory MPI job with 128 cores on a single mode. Obviously is better this type of job to be run on single node rather than on two times 64 cores from 2 nodes. To achieve that users may use the following SLURM prototype script:

#!/bin/bash
#SBATCH --job-name MPI_J_2
#SBATCH --nodes 1
#SBATCH --ntasks 128           # total number of tasks
#SBATCH --mem 40G              # total memory per job
#SBATCH --qos=qoschem
#SBATCH --partition partchem

cd $SLURM_SUBMIT_DIR

srun ...

In above script the requested resources are 128 cores on one node. Note that unused memory on this node will not be accessible to other jobs. In difference to previous script the memory is referred as total memory for a job via parameter --mem.


Submitting Hybrid (OMP+MPI) job on Arrow

#!/bin/bash
#SBATCH --job-name=OMP_MPI       # name of the job
#SBATCH --ntasks=24              # total number of tasks aka total # of MPI processes
#SBATCH --nodes=2                # total number of nodes
#SBATCH --tasks-per-node=12      # number of tasks per node
#SBATCH --cpus-per-task=2        # number of OMP threads per MPI process 
#SBATCH --mem-per-cpu=16G        # memory per cpu-core  
#SBATCH --partition=partnsf
#SBATCH --qos=qosnsf

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun ...

The above script si prototype and shows how to allocate 24 MPI treads with 12 cores per node. Each MPI tread initiates 2 OMP threads. For actual working script users must add QOS and partition information and adjust their requirements for the memory.


GPU jobs

On arrow each of the nodes has 8 GPU A40 with 80GB on board. To use GPUs in a job users must add the --gres option in SBATH line for cpu resources. The example below demonstrates a GLU enabled SLURM script. Users must add lines for QOS and partition as was explained above.

#!/bin/bash
#SBATCH --job-name=GPU_J         # short name for job
#SBATCH --nodes=1                # number of nodes
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G        # memory per cpu-core 
#SBATCH --gres=gpu:1             # number of gpus per node max 8 for Arrow

cd $SLURM_SUBMIT_DIR

srun ... <code> <args>

GPU constrains

On Appel the nodes in partnsf, partchem and partmath have different GPU types (A30, A40 and A100). The type of GPU can be specified in SLURM by use constraint on the GPU SKU, GPU generation, or GPU compute capability. Here is example:

#SBATCH --gres=gpu:1 --constraint='gpu_sku:V100'      # allocates one V100 GPU

#SBATCH --gres=gpu:1 --constraint='gpu_gen:Ampere'    # allocates one Ampere GPU (A40 or A100)

#SBATCH --gres=gpu:1 --constraint='gpu_cc:12.0'       # allocates GPU per computability (generation) 

#SBATCH --gres=gpu:1 --constraint='gpu_mem:32GB'      # allocates GPU with 32GB memory on board

#SBATCH --gres=gpu:1 --constraint='nvlink:2.0'.       # allocates GPU linked with NVLink

Parametric jobs via Job Array

Job arrays are used for running the same job multiple times but with only slightly different parameters. The below script demonstrates how to run such a job on Arrow. Users must add lines for QOS and partition as was explained above. NB! The array numbers must be less than the maximum number of jobs allowed in the array.

#!/bin/bash
#SBATCH --job-name=Array_J        # short name for job
#SBATCH --nodes=1                 # node count
#SBATCH --ntasks=1                # total number of tasks across all nodes
#SBATCH --cpus-per-task=1         # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=16G         # memory per cpu-core  
#SBATCH --output=slurm-%A.%a.out  # stdout file (standart out)
#SBATCH --error=slurm-%A.%a.err   # stderr file (standart error)
#SBATCH --array=0-3               # job array indexes 0, 1, 2, 3 
 
cd $SLURM_SUBMIT_DIR

<executable>

Interactive jobs

These jobs are useful in development or test phase and rarely are required in a workflow. It is not recommend to use interactive jobs as main type of jiobs since they consume more resources that regular batch jobs. To set up interactive job the users first have to 1. start interactive shell and 2 "reserve" the resources. teh example below describes that.

srun -p interactive --pty /bin/bash    # starts interactive session

Once the interactive session is running the users must "reserve" resources needed for actual job:

salloc --ntasks=8 --ntasks-per-node=1 --cpus-per-task=2.           # allocates resources 
salloc: CPU resource required, checking settings/requirements...
salloc: Granted job allocation ....
salloc: Waiting for resource configuration
salloc: Nodes ...                          # system reports back where the resources were allocated