Submitting Jobs

From CUNYHPC
Jump to: navigation, search

Contents

    Running Jobs

    In this section, we discuss the process for running jobs on a HPC system. Typically the process involves the following:

    • Having input files within your /scratch/<user_name> directory on the HPC system you wish to use.
    • Creating a job submit script that identifies the input files, the application program you wish to use, the compute resources needed to execute the job, and information on where you wish to write your output files.
    • Submitting the job script.
    • Saving output to SR1 using iRODS.cleaning-up.

    These steps are explained below.

    Input file on /scratch

    The general case is that you will have an input files that has data on which you wish to operate. These input files must be stored within the /scratch/<user_name> directory hierarchy of the HPC system you wish to use. These files can come from any of the following sources:

    • You can create them using a text editor.
    • You can copy them from your directory in /global/u.
    • You can copy them from your SR1 directory
    • You can copy input files from other places (such as your local computer, the web, etc...)

    Writing a job submit script

    Today’s typical HPC system consists of a login-node, a head-node, and compute-nodes.

    • The login-node is your interface to the HPC system.
    • The head-node manages the operation of the HPC system. A software package, called PBSpro runs on the head-node and is the job queuing and management system. PBSpro manages the processing of user jobs and the allocation of the HPC system resources.
    • The compute-nodes are where your job is executed.

    To be able to schedule you job for execution and to actually run your job on one or more compute nodes, PBSpro needs to be instructed about your job’s parameters. These instructions are typically stored in a job submit script. In this section, we describe the information that needs to be included in a job submit script. The submit script typically includes

    • job name
    • queue name
    • what compute resources (number of nodes, number of cores and the amount of memory, the amount of scratch disk storage, and the number of GPUs) or other resources a job will need.
    • packing option
    • actual commands that need to be executed (binary that needs to be run, input\output redirection, etc…)

    Documentation: The PBSpro reference guide can be found at http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12.1.pdf</br> A pro forma job submit script is provided below.

    #!/bin/bash
    #PBS -q /scratch/<queue_name>
    #PBS -N <job_name>
    #PBS –l select=<chunks>:ncpus=<cpus>
    #PBS –l mem=<????>mb
    #PBS -l place=<placement>
    #PBS -V
    
    # change to the working directory 
    cd $PBS_O_WORKDIR
    
    echo ">>>> Begin <job_name>"
    
    # actual binary (with IO redirections) and required input 
    # parameters is called in the next line
    
    mpirun -np <chunks * ncpus> -machinefile $PBS_NODEFILE <Program Name> <input_text_file> > <output_file_name> 2>&1
    
    echo ">>>> Begin <job_name> Run ..." 
    

    Note: The #PBS string must precede every PBS parameter. # symbol in the beginning of any other line designates a comment line which is ignored by PBSpro.

    Explanation of PBSpro attributes and parameters: -q <queue_name> The three available queues are production, development, and interactive.

    • production is the normal queue for processing your work.
    • development is used when you are testing an application. Jobs submitted to this queue can not request more than 8 cores or use more than 1 hour of total CPU time. If the job exceeds these parameters, it will be automatically killed. Development queue has higher priority and thus jobs in this queue have shorter wait time.
    • interactive is used for quick interactive tests. Jobs submitted into this queue allow you interactive terminal session on one of the compute nodes. They can not use more than 4 cores or use more than a total of 15 minutes of compute time.

    -N <job_name> The user must assign a name to each job they run. Names can be up to 15 alphanumeric characters in length.
    -l select=<chunks>: A chunk is a collection of resources (cores, memory, disk space etc…).
    -l ncpus=<cpus> The number of cpus (or cores) that the user wants to use on a node.

    • Note: PBSpro refers to cores as cpus; historically, this was true, but processing units on a node are more typically referred to as cores.


    -l mem=<mem>mb This parameter is optional. It specifies how much memory is needed per chunk. If not included, PBSpro assumes a default memory size on a per cpu (core) basis.

    -l ngpus=<gpus> The number of graphics processing units that the user wants to use on a node (This parameter is only available on Penzias).

    -l place=<placement> This parameter tells PBSpro how to distribute requested chunks of resources across nodes. placement can take one of three values: free, scatter or pack.

    • If you select free, PBSpro will place your job chunks on any nodes that have the required number of available resources.
    • If you select scatter, PBSpro will schedule your job so that only one chunk is taken from any virtual compute node.
    • If you select pack, PBSpro will only schedule your job to take all the requested chunks from one node (and if no such node is available job will be queued up)
    Special note for MPI users 
    
    How the place”, “scatter”, “select”, and ncpus parameters are defined can significantly affect the run time of a job.  For example, assume you need to run a job that requires 64 cores.  This  can be scheduled in a number of different ways.  For example, 
    
    #PBS –l place=free 
    #PBS –l select=8:ncpus=8 
     
    will freely place the 8 job chunks on any nodes that have 8 cpus available.  While this may minimize communications overhead in your MPI job, PBS will not schedule this job until 8 nodes each with 8 free cpus becomes available.  Consequently, the  job may wait longer in the input queue before going into execution.
    
    #PBS –l place=free 
    #PBS –l select=32:ncpus=2
    
    will freely place 32 chunks of 2 cores each. There will possibly be some nodes with 4 free chunks (and 8 cores) and there may be nodes with only 1 free chunk (and 2 cores). In this case, the job ends up being more sparsely distributed across the system and hence the total averaged latency may be larger then in case with select=8:ncpus=8
    
    In this example, however, it will be much easier for PBS to run the job – it won’t need to wait for 8 completely empty nodes. Therefore even though select=32:ncpus=2 may probably execute slower, it has a higher  chance to start faster and hence complete sooner. 
    
    If the following parameters are selected:
     
    #PBS –l place=scatter 
    #PBS –l select=32:ncpus=2
    
    PBS will distribute the chunks of 2 cpus across 32 nodes.
    

    mpirun -np <chunks * ncpus>. This script line is only to be used for MPI jobs and defines the total number of cores required for the parallel MPI job.

    The table below shows the maximum values of the various PBS parameters by system. Request only the resources you need as requesting maximal resources will delay your job.

    Maximum PBS settings by system
    
    	np (1)	ncpus	ngpus	mem per core (2) mem per chunk(3)
    mem
    					
    Andy	64	8	NA	2880	         23040
    Bob	1	8	NA	1920	         15360
    Penzias	128	8	2	3800   	         30400
    Salk	768     1	NA	1920	         7680
    

    Notes:
    NA = Resource Not Available on this system.
    ::(1) Largest mpi job allowed on the system.
    ::(2) Default memory size allocated per core is set to the above maximum value.
    ::(3) Requesting all of the memory on a node will result in the job staying in the input queue until a node becomes fully available; select must always be set to 1 chunk.

    Serial (Scalar) Jobs

    For serial jobs, select=1 and <ncpus=1 should be used.

    
    #!/bin/bash
    #
    # Typical job script to run a serial job in the production queue
    #
    #PBS -q production
    #PBS -N <job_name>
    #PBS -l select=1:ncpus=1
    #PBS -l place=free
    #PBS -V
    
    # Change to working directory
    cd $PBS_O_WORKDIR
    
    # Run my serial job
    </path/to/your_binary> > <my_output> 2>&1
    

    OpenMP Symmetric Multiprocessing (SMP) Parallel Jobs

    SMP jobs can only run on a single virtual node. Therefore, for SMP jobs, place=pack and select=1 should be used; ncpus should be set to [2, 3, 4,… n] where n must be less than or equal to the number of cores on a virtual compute node.

    Typically, SMP jobs will use the <mem> parameter and may request up to all the available memory on a node.

    
    #!/bin/bash
    #
    # Typical job script to run a 4-processor SMP job in 1 chunk in the production queue
    #
    #PBS -q production
    #PBS -N <job_name>
    #PBS -l select=1:ncpus=4:mem=<mem>mb
    #PBS -l place=pack
    #PBS -V
    
    # Change to working directory
    cd $PBS_O_WORKDIR
    
    export OMP_NUM_THREADS=4
    # Run my OpenMP job
    </path/to/your_binary> > <my_output> 2>&1
    
    

    Note for Salk (Cray XE) users: all jobs run on the Cray's compute nodes must be started with Cray's aprun command. In the above script, the last line will need to be modified to: aprun -n 4 </path/to/your_binary> > <my_output> 2>&1 Option -n 4 in the above line specifies number of cores that aprun will use to start the job. It should not exceed the number of cores requested in the -l select statement.

    MPI Distributed Memory Parallel Jobs

    For an MPI job, select= and ncpus= can be one or more, with np= >1</b>.

    #!/bin/bash
    #
    # Typical job script to run a distributed memory MPI job in the production queue requesting 16 chunks each with one 1 cpu.
    #
    #PBS -q production
    #PBS -N <job_name>
    #PBS -l select=16:ncpus=1
    #PBS -l place=free
    #PBS -V
    
    # Change to working directory
    cd $PBS_O_WORKDIR
    
    # Run my 16-core MPI job
    
    mpirun -np 16 </path/to/your_binary> > <my_output> 2>&1
    

    Note for Salk (Cray XE) users: all jobs run on the Cray's compute nodes must be started with Cray's aprun command. In the above script, the last line will need to be modified to:

    aprun -n 16 </path/to/your_binary> > <my_output> 2>&1

    Option -n 16 in the above line specifies number of cores that aprun will use to start the job. It should not exceed number of cores requested in the -l select statement.

    GPU-Accelerated Data Parallel Jobs

    
    #!/bin/bash
    #
    # Typical job script to run a 1 CPU, 1 GPU batch job in the production queue
    #
    #PBS -q production
    #PBS -N <job_name>
    #PBS -l select=1:ncpus=1:ngpus=1
    #PBS -l place=free
    #PBS -V
    
    # Find out which compute node the job is using
    hostname
    
    # Change to working directory
    cd $PBS_O_WORKDIR
    
    # Run my GPU job on a single node using 1 CPU and 1 GPU.
    </path/to/your_binary> >  <my_output> 2>&1
    
    

    Submitting jobs for execution

    NOTE: We do not allow users to run batch jobs on the login-node. It is acceptable to do short compiles on the login node, but all other jobs must be run by handing off the job submit script to PBSpro running on the head-node. PBSpro will then allocate resources on the compute-nodes for execution of the job. The command to submit your job submit script (job.script) is:

    qsub <job.script>

    Saving output files and clean-up

    Normally you expect certain data in the output files as a result of a job. There are a number of things that you may want to do with these files:

    • Check the content of these outputs and discard them. In such case you can simply delete all unwanted data with rm command.
    • Move output files to your local workstation. You can use scp for small amounts of data and/or GlobusOnline for larger data transfers.
    • You may also want to store the outputs at the HPCC resources. In this case you can either move your outputs to /global/u or to SR1 storage resource.

    In all cases your /scratch/<user_name> directory is expected to be empty. Output files stored inside /scratch/<user_name> can be purges at any moment (except for files that are currently being used in active PBS jobs.