SAMTOOLS

From HPCC Wiki
Jump to navigation Jump to search

SAMTOOLS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, CUFFLINKS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the SAMTOOLS home page here [1].

At the CUNY HPC Center SAMTOOLS is installed on ANDY. SAMTOOLS is a collection of utilities for extracting, reformatting, and displaying nucleotide sequences. The primary tool is called 'samtools' and offers a large number of command-line options. For smaller tasks, SAMTOOLS can be run interactively, but should be run in SLURM batch mode when larger, longer tasks are anticipated. NOTE: that display tasks cannot be run in pure SLURM batch mode because the output is displayed. Larger display tasks should be run in SLURM interactive mode as described in the SLURM section elsewhere in this document.

Below is an example SLURM script that will convert the 'toy.sam' file provided with the distribution from SAM to BAM format. This and all the example files can be copied from the local installation directory to your current location as follows:

cp /share/apps/samtools/default/examples/* .

To include all required environmental variables and the path to the SAMTOOLS executables run the modules load command (the modules utility is discussed in detail above):

module load samtools

Running 'samtools' from the interactive prompt without any options will provide a brief description of the form of the command-line argument and options. Here is SLURM batch script that does a short format conversion in batch mode:

#!/bin/bash
#SLURM -q production
#SLURM -N SAMTLS_serial
#SLURM -l select=1:ncpus=1:mem=2880mb
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin SAMTLS Serial Run ..."
samtools view -bS toy.sam > toy.bam 2 > toy.err
echo ">>>> End   SAMTLS Serial Run ..."

This script can be dropped in to a file (say samtools_ser.job) and started with the command:

qsub samtools_ser.job

Running this conversion test case should take less than 1 minutes and will produce SLURM output and error files beginning with the job name 'SAMTLS_serial'. The primary SAMTOOLS application results will be written into the user-specified file at the end of the TOPHAT command line after the greater-than sign. Here it is named 'toy.bam' The expression '2 > toy.err' at the end of the command-line directs Unix standard error to the file 'toy.err'. Users should always explicitly specify the name of the application's output files to ensure that they are written directly into the user's working directory which has much more disk space than the SLURM spool directory on /var.

Details on the meaning of the SLURM script are covered below in the SLURM section. The most important lines are the '#SLURM -l select=1:ncpus=1:mem=2880mb' and the '#SLURM -l pack=free' lines. The first instructs SLURM to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs SLURM to place this job on the least buy node where the requested resources can be found (freely). The master compute node that SLURM finally selects to run your job will be printed in the SLURM output file by the 'hostname' command.