TOPHAT

From HPCC Wiki
Jump to navigation Jump to search

The other tools in this collection, BOWTIE, CUFFLINKS, and SAMTOOLS are also installed at the CUNY HPC Center. Additional information can be found at the TOPHAT home page here [1].

At the CUNY HPC Center TOPHAT is installed on ANDY. TOPHAT is a parallel threaded code (pthreads) that takes its input from a simple text file provided on the command line. Below is an example SLURM script that will run the mRNA test case provided with the distribution and which can be copied from the local installation directory to your current location as follows:

cp /share/apps/tophat/default/examples/* .

To include all required environmental variables and the path to the TOPHAT executable run the modules load command (the modules utility is discussed in detail above):

module load tophat

Running 'tophat' from the interactive prompt without any options will provide a brief description of the form of the command-line argument and options. Here is SLURM batch script that builds the index and aligns the sequences in serial mode:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name TOPHAT_serial
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUMBIT_DIR

# Point to the execution directory to run
echo ">>>> Begin TOPHAT Serial Run ..."
tophat -r 20 test_ref reads_1.fq reads_2.fq > tophat_mrna.out 2>&1
echo ">>>> End   TOPHAT Serial Run ..."

This script can be dropped in to a file (say tophat_ser.job) and started with the command:

qsub tophat_ser.job

Running the mRNA test case should take less than 2 minutes and will produce SLURM output and error files beginning with the job name 'TOPHAT_serial'. The primary TOPHAT application results will be written into the user-specified file at the end of the TOPHAT command line after the greater-than sign. Here it is named 'tophat_mrna.out.' The expression '2>&1' at the end of the command-line combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the SLURM spool directory on /var.

Details on the meaning of the SLURM script are covered below in the SLURM section. The most important lines are the '#SBATCH --nodes=1:ntasks=1 mem=2880'. The first instructs SLURM to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs SLURM to place this job wherever the least used resources can be found (freely). The master compute node that SLURM finally selects to run your job will be printed in the SLURM output file by the 'hostname' command.

To run TOPHAT in parallel-threads mode several changes to the script are required. Here is a modified script that shows how to run TOPHAT using two threads. ANDY has as many as 8 physical compute cores per compute node, and therefore as many as 8 cores-threads might be chosen. Once a parallel job starts it will generally (not always) complete in less time, but jobs requesting a larger the number of cores-threads or memory per node may wait longer to start on a busy system as SLURM looks for a compute node with all the resources requested.

#!/bin/bash
#SLURM --partition production
#SBATCH --job-name TOPHAT_threads
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --mem=5760

# Find out name of master execution host (compute node)
echo -n ">>>> SBATCH Master compute node is: "
hostname

# You must explicitly change to the working directory in SBATCH
cd $SBATCH_SUBMIT_DIR

# Point to the execution directory to run
echo ">>>> Begin TOPHAT Threaded Run ..."
tophat -p 2 -r 20 test_ref reads_1.fq reads_2.fq > tophat_thrds.out 2>&1
echo ">>>> End   TOPHAT Threaded Run ..."

Notice the difference in the '-l select' line where the resource 'chunk' now includes 2 cores (ntasks=2) and requests twice as much memory as before. Also, notice that the TOPHAT command-line now includes the '-p 2' option to run the code with 2 threads working in parallel. Perfectly or 'embarrassingly' parallel workloads can run close to 2, 4, or more times as fast as the same workload in serial mode depending on the number of threads requested, but workloads cannot be counted on to be perfectly parallel.

The speed ups that you observe will typically be less than perfect and diminish as you ask for more cores-threads. Large data jobs will typically scale more efficiently as you add cores-threads, but users should take note of the performance gains that they see as cores-threads are added and select a core-thread count the provides efficient scaling and avoids diminishing returns.