BOWTIE2

From HPCC Wiki
Revision as of 20:13, 27 October 2022 by James (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

At the CUNY HPC Center BOWTIE2 is installed on ANDY and PENZIAS. BOWTIE2 is a parallel threaded code (pthreads) that takes its input from a simple text file provided on the command line. Below is an example SLURM script that will run the lambda virus test case provided with the BOWTIE2 distribution which can be copied from the local installation directory to your current location as follows:

cp /share/apps/bowtie2/default/examples/reference/lambda_virus.fa .
cp /share/apps/bowtie2/default/examples/reads/reads_1.fq . 

To include all required environmental variables and the path to the BOWTIE2 executable run the modules load command (the modules utility is discussed in detail above).

module load bowtie2

Running 'bowtie2' from the interactive prompt without any options will provide a brief description of the form of the command-line arguments and options. Here is SLURM batch script that builds lambda virus the index and aligns the sequences in serial mode:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name BOWTIE2_Serial
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUBMIT_DIR

# Point to the execution directory to run
echo ">>>> Begin BOWTIE2 Serial Run ..."
echo ""
echo ">>>> Build Index ..."
bowtie2-build lambda_virus.fa lambda_virus > lambda_virus_index.out 2>&1
echo ""
echo ">>>> Align Sequence ..."
bowtie2 -x lambda_virus -U reads_1.fq -S eg1.sam > lambda_virus_align.out 2>&1
echo ""
echo ">>>> End   BOWTIE2 Serial Run ..."

This script can be dropped in to a file (say bowtie.job) and started with the command:

qsub bowtie2.job

Running the lambda virus test case should take less than 2 minutes and will produce SLURM output and error files beginning with the job name 'BOWTIE2_serial'. The primary BOWTIE2 application results will be written into the user-specified file at the end of the CUFFLINKS command line after the greater-than sign. Here it is named 'lambda_virus_index.out' and 'lambda_virus_align.out.' The expression '2>&1' at the end of the command-line combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the SLURM spool directory on /var.

Details on the meaning of the SLURM script are covered below in the SLURM section. The most important lines are the '#SBATCH --nodes=1 ntasks=1 mem=2880'. The first instructs SLURM to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs SLURM to place this job wherever the least used resources are found (freely). The master compute node that SLURM finally selects to run your job will be printed in the SLURM output file by the 'hostname' command.

To run BOWTIE2 in parallel-threads mode several changes to the script are required. Here is a modified script that shows how to run BOWTIE2 using two threads. ANDY has as many as 8 physical compute cores per compute node and therefore as many as 8 threads might be chosen, but the larger the number of cores-threads requested the longer the job may wait to start as SLURM looks for a compute node with the free resources requested.

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name BOWTIE2_threads
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --mem=5760

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUBMIT_DIR

# Point to the execution directory to run
echo ">>>> Begin BOWTIE2 Threaded Run ..."
echo ""
echo ">>>> Build Index ..."
bowtie2-build lambda_virus.fa lambda_virus > lambda_virus_index.out 2>&1
echo ""
echo ">>>> Align Sequence ..."
bowtie2 -p 2 -x lambda_virus -U reads_1.fq -S eg1.sam > lambda_virus_align2.out 2>&1
echo ""
echo ">>>> End   BOWTIE2 Threaded Run ..."

Notice the difference in the '-l select' line where the resource 'chunk' now includes 2 cores (ntasks=2) and requests twice as much memory as before. Also, notice that the BOWTIE2 command-line now includes the '-p 2' option to run the code with 2 threads working in parallel. Perfectly or 'embarrassingly' parallel workloads can run close to 2, 4, or more times as fast as the same workload in serial mode depending on the number of threads requested, but workloads cannot be counted on to be perfectly parallel.

The speed ups that you observe will typically be less than perfect and diminish as you ask for more cores-threads. Larger jobs will typically scale more efficiently as you add cores-threads, but users should take note of the performance gains that they see as cores-threads are added and select a core-thread count the provides efficient scaling and avoids diminishing returns.