Running Jobs: Difference between revisions
Line 19: | Line 19: | ||
== Running jobs on Arrow == | == Running jobs on Arrow == | ||
Arrow is attached to 2.1 PB hybrid file system holding both users' home directories ( | Arrow is attached to 2.1 PB hybrid file system holding both users' home directories (/global/u/<userid>) and scratch directories (/scratch/<userid>). The underlying files system manipulates file placement automatically to ensure the best possible performance for different type of files. To submit job on arrow users must prepare and place set of job related files including correct job submission file in /scratch/<font color="red"><userid>. </font> Users must preserve valuable files (data, executables, parameters etc) in /global/u/<font color="red"><userid> </font>. | ||
=== Input files (Arrow) === | === Input files (Arrow) === |
Revision as of 03:57, 21 June 2023
Running jobs on any HPCC server - an overview
Running jobs on HPCC production server is a 2 step process. On step 1 the users should prepare a set of files in their /scratch<userid> directory and set up application environment . The set of needed files include:
- Input file(s) for the job holding input data. Note that input can be placed in a subdirectory, but path to it must be specified;
- Parameter(s) file(s) for the job (if applicable, could be in subdirectory as well with explicit path);
- Set up execution environment by loading proper module(s);
- Correct job submission script which holds computational parameters of the job (i.e. needed # of cores, amount of memory, run time etc.).
On Step 2 users submit job via batch job system. The latter gets as input the job submission script. As it explained below the job submission to different HPCC servers vary slightly, due to differences in /scratch's etity (separate file system vs directory) and used environment modules system (TCL modules vs Lmod).
Running jobs on Penzias, Appel and Karle
The jobs on these servers must and can start only from separate file system called scratch mounted on all production nodes. This file system is not a main file system and does not hold home directories for users. Consequently, users must prepare the set of job related files in their /scratch/<userid> directory before submit a job. Users must be aware that scratch is temporary storage and must save their data and important files (including executables) in their home directory. The minimal set of files needed to submit a job include:
- Input file(s) for the job holding input data;
- Parameter(s) file(s) for the job (if applicable);
- Set up execution environment;
- Correct job submission script.
Input files (Penzias, Appel and Karle)
The input files and parameter(s) files can be locally generated or directly transferred to /scratch/<userid>.HPCC recommends a transfer to user's home directory first (/global/u/<userid> ) before copy the needed files from user's home (/global/u/<userid> ) to /scratch/<userid> . In addition, these files can be transferred from users' local storage (i.e. local laptop) to DSMS (/global/u/<userid> ) using cea and/or Globus. The submission script must be created with use of Unix/Linux text editor only such as Vi/Vim, Edit, Pico or Nano. MS Windows Word is a word processing system and cannot be used to create the job submission scripts.
Running jobs on Arrow
Arrow is attached to 2.1 PB hybrid file system holding both users' home directories (/global/u/<userid>) and scratch directories (/scratch/<userid>). The underlying files system manipulates file placement automatically to ensure the best possible performance for different type of files. To submit job on arrow users must prepare and place set of job related files including correct job submission file in /scratch/<userid>. Users must preserve valuable files (data, executables, parameters etc) in /global/u/<userid> .
Input files (Arrow)
The input files can be locally generated. For Arrow the file transfer node cea and Globus online cannot be used to transfer file to Arrow's storage. The users are should consult HPCC for possible options to transfer their files to Arrow's storage.
Set up execution environment on all clusters
All servers at HPCC are shared resources. To ensure proper environment for every job HPCC uses environment modules system. The latter which allows a dynamic modification of a user’s environment via modulefiles. Each modulefile holds information needed to configure the shell environment for a specific software application, or to provide access to specific software tools and libraries. Modulefiles may be shared by all users on a system and users may have their own collection of module files. The users' collections may be used for "fast load" of needed modules for a job or to supplement or replace the shared modulefiles. HPCC uses 2 Environment Modules systems : the traditional Unix Modules system based on TCL and Lmod system based on Lua. The the latter has clear advantage when complex hierarcial workflows are needed since Lmod can handle the MODULEPATH hierarchical problem. In addition Lmod supports TCL modules.
Inroduction
SLURM is open source scheduler and batch system which is implemented at HPCC. Currently SLURM is used only for Penzias’ job management but the use of SLURM will be expanded to other servers in the future.
SLURM commands:
Slurm commands resemble the commands used in Portable Batch System (SLURM). The below table compares the most common SLURM and SLURM Pro commands.
A few examples follow:
If the files are in /global/u
cd /scratch/<userid> mkdir <job_name> && cd <job_name> cp /global/u/<userid>/<myTask/a.out ./ cp /global/u/<userid>/<myTask/<mydatafile> ./
If the files are in SR (cunyZone):
cd /scratch/<userid> mkdir <job_name> && cd <job_name> iget myTask/a.out ./ iget myTask/<mydatafile> ./
Set up job environment
Users must load the proper environment before start any job. The loaded environment wil be automatically exported to compute nodes at the time of execution. Users must use modules to load environment. For example to load environment for default version of GROMACS one must type:
module load gromacs
The list of available modules can be seen with command
module avail
The list of loaded modules can be seen with command
module list
More information about modules is provided in "Modules and available third party software" section below.
Running jobs on HPC systems running SLURM scheduler
To be able to schedule your job for execution and to actually run your job on one or more compute nodes, SLURM needs to be instructed about your job’s parameters. These instructions are typically stored in a “job submit script”. In this section, we describe the information that needs to be included in a job submit script. The submit script typically includes
- • job name
- • queue name
- • what compute resources (number of nodes, number of cores and the amount of memory, the amount of local scratch disk storage (applies to Andy, Herbert, and Penzias), and the number of GPUs) or other resources a job will need
- • packing option
- • actual commands that need to be executed (binary that needs to be run, input\output redirection, etc.).
A pro forma job submit script is provided below.
#!/bin/bash #SBATCH --partition <queue_name> #SBATCH -J <job_name> #SBATCH --mem <????> # change to the working directory cd $SLURM_WORKDIR echo ">>>> Begin <job_name>" # actual binary (with IO redirections) and required input # parameters is called in the next line mpirun -np <cpus> <Program Name> <input_text_file> > <output_file_name> 2>&1
Note: The #SLURM string must precede every SLURM parameter.
# symbol in the beginning of any other line designates a comment line which is ignored by SLURM
Explanation of SLURM attributes and parameters:
- --partition <queue_name> Available main queue is “production” unless otherwise instructed.
- • “production” is the normal queue for processing your work on Penzias.
- • “development” is used when you are testing an application. Jobs submitted to this queue can not request more than 8 cores or use more than 1 hour of total CPU time. If the job exceeds these parameters, it will be automatically killed. “Development” queue has higher priority and thus jobs in this queue have shorter wait time.
- • “interactive” is used for quick interactive tests. Jobs submitted into this queue run in an interactive terminal session on one of the compute nodes. They can not use more than 4 cores or use more than a total of 15 minutes of compute time.
- -J <job_name> The user must assign a name to each job they run. Names can be up to 15 alphanumeric characters in length.
- --ntasks=<cpus> The number of cpus (or cores) that the user wants to use.
- • Note: SLURM refers to “cores” as “cpus”; currently HPCC clusters maps one thread per one core.
- --mem <mem> This parameter is required. It specifies how much memory is needed per job.
- --gres <gpu:2> The number of graphics processing units that the user wants to use on a node (This parameter is only available on PENZIAS).
gpu:2 denotes requesting 2 GPU's.
Special note for MPI users
Parameters are defined can significantly affect the run time of a job. For example, assume you need to run a job that requires 64 cores. This can be scheduled in a number of different ways. For example,
#SBATCH --nodes 8 #SBATCH --ntasks 64
will freely place the 8 job chunks on any nodes that have 8 cpus available. While this may minimize communications overhead in your MPI job, SLURM will not schedule this job until 8 nodes each with 8 free cpus becomes available. Consequently, the job may wait longer in the input queue before going into execution.
#SBATCH --nodes 32 #SBATCH --ntasks 2
will freely place 32 chunks of 2 cores each. There will possibly be some nodes with 4 free chunks (and 8 cores) and there may be nodes with only 1 free chunk (and 2 cores). In this case, the job ends up being more sparsely distributed across the system and hence the total averaged latency may be larger then in case with nodes 8, ntasks 64
mpirun -np <total tasks or total cpus>. This script line is only to be used for MPI jobs and defines the total number of cores required for the parallel MPI job.
The Table 2 below shows the maximum values of the various SLURM parameters by system. Request only the resources you need as requesting maximal resources will delay your job.
Serial Jobs
For serial jobs, --nodes 1 and --ntasks 1 should be used.
#!/bin/bash # # Typical job script to run a serial job in the production queue # #SBATCH --partition production #SBATCH -J <job_name> #SBATCH --nodes 1 #SBATCH --ntasks 1 # Change to working directory cd $SLURM_SUBMIT_DIR # Run my serial job </path/to/your_binary> > <my_output> 2>&1
OpenMP and Threaded Parallel jobs
OpenMP jobs can only run on a single virtual node. Therefore, for OpenMP jobs, place=pack and select=1 should be used; ncpus should be set to [2, 3, 4,… n] where <n must be less than or equal to the number of cores on a virtual compute node.
Typically, OpenMP jobs will use the <mem> parameter and may request up to all the available memory on a node.
#!/bin/bash #SBATCH -J Job_name #SBATCH --partition production #SBATCH --ntasks 1 #SBATCH --nodes 1 #SBATCH --mem=<mem> #SBATCH -c 4 # Set OMP_NUM_THREADS to the same value as -c # with a fallback in case it isn't set. # SLURM_CPUS_PER_TASK is set to the value of -c, but only if -c is explicitly set omp_threads=1 if [ -n "$SLURM_CPUS_PER_TASK" ]; omp_threads=$SLURM_CPUS_PER_TASK else omp_threads=1 fi mpirun -np </path/to/your_binary> > <my_output> 2>&1 mpirun -np 16 </path/to/your_binary> > <my_output> 2>&1
MPI Distributed Memory Parallel Jobs
For an MPI job, select= and ncpus= can be one or more, with np= >/=1.
#!/bin/bash # # Typical job script to run a distributed memory MPI job in the production queue requesting 16 cores in 16 nodes. # #SBATCH --partition production #SBATCH -J <job_name> #SBATCH --ntasks 16 #SBATCH --nodes 16 #SBATCH --mem=<mem>
# Change to working directory cd $SLURM_SUBMIT_DIR # Run my 16-core MPI job mpirun -np 16 </path/to/your_binary> > <my_output> 2>&1
GPU-Accelerated Data Parallel Jobs
#!/bin/bash # # Typical job script to run a 1 CPU, 1 GPU batch job in the production queue # #SBATCH --partition production #SBATCH -J <job_name> #SBATCH --ntasks l #SBATCH --gres gpu:1 #SBATCH --mem <fond color="red"><mem></fond color>
# Find out which compute node the job is using hostname # Change to working directory cd $SLURM_SUBMIT_DIR # Run my GPU job on a single node using 1 CPU and 1 GPU. </path/to/your_binary> > <my_output> 2>&1
Submitting jobs for execution
NOTE: We do not allow users to run any production job on the login-node. It is acceptable to do short compiles on the login node, but all other jobs must be run by handing off the “job submit script” to SLURM running on the head-node. SLURM will then allocate resources on the compute-nodes for execution of the job.
The command to submit your “job submit script” (<job.script>) is:
sbatch <job.script>
This section in in development
Saving output files and clean-up
Normally you expect certain data in the output files as a result of a job. There are a number of things that you may want to do with these files:
- • Check the content of these outputs and discard them. In such case you can simply delete all unwanted data with rm command.
- • Move output files to your local workstation. You can use scp for small amounts of data and/or GlobusOnline for larger data transfers.
- • You may also want to store the outputs at the HPCC resources. In this case you can either move your outputs to /global/u or to SR1 storage resource.
In all cases your /scratch/<userid> directory is expected to be empty. Output files stored inside /scratch/<userid> can be purged at any moment (except for files that are currently being used in active jobs) located under the /scratch/<userid>/<job_name> directory.