SAMTOOLS: Difference between revisions

From HPCC Wiki
Jump to navigation Jump to search
(Created page with "SAMTOOLS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, CUFFLINKS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the SAMTOOLS home page here [http://samtools.sourceforge.net/index.shtml]. At the CUNY HPC Center S...")
 
m (Text replacement - "[pP][bB][sS]" to "SLURM")
 
Line 7: Line 7:
utilities for extracting, reformatting, and displaying nucleotide sequences.  The primary tool is called 'samtools'
utilities for extracting, reformatting, and displaying nucleotide sequences.  The primary tool is called 'samtools'
and offers a large number of command-line options.  For smaller tasks, SAMTOOLS can be run interactively,
and offers a large number of command-line options.  For smaller tasks, SAMTOOLS can be run interactively,
but should be run in PBS batch mode when larger, longer tasks are anticipated.  NOTE: that display tasks
but should be run in SLURM batch mode when larger, longer tasks are anticipated.  NOTE: that display tasks
cannot be run in pure PBS batch mode because the output is displayed.  Larger display tasks should be
cannot be run in pure SLURM batch mode because the output is displayed.  Larger display tasks should be
run in PBS interactive mode as described in the PBS section elsewhere in this document.
run in SLURM interactive mode as described in the SLURM section elsewhere in this document.


Below is an example PBS script that will convert the 'toy.sam' file provided with the distribution from  
Below is an example SLURM script that will convert the 'toy.sam' file provided with the distribution from  
SAM to BAM format.  This and all the example files can be copied from the local installation directory
SAM to BAM format.  This and all the example files can be copied from the local installation directory
to your current location as follows:
to your current location as follows:
Line 27: Line 27:


Running 'samtools' from the interactive prompt without any options will provide a brief description of the form of the  
Running 'samtools' from the interactive prompt without any options will provide a brief description of the form of the  
command-line argument and options. Here is PBS batch script that does a short format conversion in batch mode:
command-line argument and options. Here is SLURM batch script that does a short format conversion in batch mode:


<pre>
<pre>
#!/bin/bash
#!/bin/bash
#PBS -q production
#SLURM -q production
#PBS -N SAMTLS_serial
#SLURM -N SAMTLS_serial
#PBS -l select=1:ncpus=1:mem=2880mb
#SLURM -l select=1:ncpus=1:mem=2880mb
#PBS -l place=free
#SLURM -l place=free
#PBS -V
#SLURM -V


# Find out name of master execution host (compute node)
# Find out name of master execution host (compute node)
echo -n ">>>> PBS Master compute node is: "
echo -n ">>>> SLURM Master compute node is: "
hostname
hostname


# You must explicitly change to the working directory in PBS
# You must explicitly change to the working directory in SLURM
cd $PBS_O_WORKDIR
cd $SLURM_O_WORKDIR


# Point to the execution directory to run
# Point to the execution directory to run
Line 56: Line 56:
</pre>
</pre>


Running this conversion test case should take less than 1 minutes and will produce PBS output and error files beginning with
Running this conversion test case should take less than 1 minutes and will produce SLURM output and error files beginning with
the job name 'SAMTLS_serial'. The primary SAMTOOLS application results will be written into the user-specified file at the end
the job name 'SAMTLS_serial'. The primary SAMTOOLS application results will be written into the user-specified file at the end
of the TOPHAT command line after the greater-than sign. Here it is named 'toy.bam'  The expression '2 > toy.err' at the end
of the TOPHAT command line after the greater-than sign. Here it is named 'toy.bam'  The expression '2 > toy.err' at the end
of the command-line directs Unix standard error to the file 'toy.err'. Users should always explicitly specify the name of the
of the command-line directs Unix standard error to the file 'toy.err'. Users should always explicitly specify the name of the
application's output files to ensure that they are written directly into the user's working directory which has much more disk
application's output files to ensure that they are written directly into the user's working directory which has much more disk
space than the PBS spool directory on /var.
space than the SLURM spool directory on /var.


Details on the meaning of the PBS script are covered below in the PBS section. The most important lines are the '#PBS -l select=1:ncpus=1:mem=2880mb'
Details on the meaning of the SLURM script are covered below in the SLURM section. The most important lines are the '#SLURM -l select=1:ncpus=1:mem=2880mb'
and the '#PBS -l pack=free' lines.  The first instructs PBS to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs
and the '#SLURM -l pack=free' lines.  The first instructs SLURM to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs
of memory in it for the job. The second instructs PBS to place this job on the least buy node where the requested resources
of memory in it for the job. The second instructs SLURM to place this job on the least buy node where the requested resources
can be  found (freely). The master compute node that PBS finally selects to run your job will be printed in the PBS output file
can be  found (freely). The master compute node that SLURM finally selects to run your job will be printed in the SLURM output file
by the 'hostname' command.
by the 'hostname' command.

Latest revision as of 20:36, 20 October 2022

SAMTOOLS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, CUFFLINKS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the SAMTOOLS home page here [1].

At the CUNY HPC Center SAMTOOLS is installed on ANDY. SAMTOOLS is a collection of utilities for extracting, reformatting, and displaying nucleotide sequences. The primary tool is called 'samtools' and offers a large number of command-line options. For smaller tasks, SAMTOOLS can be run interactively, but should be run in SLURM batch mode when larger, longer tasks are anticipated. NOTE: that display tasks cannot be run in pure SLURM batch mode because the output is displayed. Larger display tasks should be run in SLURM interactive mode as described in the SLURM section elsewhere in this document.

Below is an example SLURM script that will convert the 'toy.sam' file provided with the distribution from SAM to BAM format. This and all the example files can be copied from the local installation directory to your current location as follows:

cp /share/apps/samtools/default/examples/* .

To include all required environmental variables and the path to the SAMTOOLS executables run the modules load command (the modules utility is discussed in detail above):

module load samtools

Running 'samtools' from the interactive prompt without any options will provide a brief description of the form of the command-line argument and options. Here is SLURM batch script that does a short format conversion in batch mode:

#!/bin/bash
#SLURM -q production
#SLURM -N SAMTLS_serial
#SLURM -l select=1:ncpus=1:mem=2880mb
#SLURM -l place=free
#SLURM -V

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_O_WORKDIR

# Point to the execution directory to run
echo ">>>> Begin SAMTLS Serial Run ..."
samtools view -bS toy.sam > toy.bam 2 > toy.err
echo ">>>> End   SAMTLS Serial Run ..."

This script can be dropped in to a file (say samtools_ser.job) and started with the command:

qsub samtools_ser.job

Running this conversion test case should take less than 1 minutes and will produce SLURM output and error files beginning with the job name 'SAMTLS_serial'. The primary SAMTOOLS application results will be written into the user-specified file at the end of the TOPHAT command line after the greater-than sign. Here it is named 'toy.bam' The expression '2 > toy.err' at the end of the command-line directs Unix standard error to the file 'toy.err'. Users should always explicitly specify the name of the application's output files to ensure that they are written directly into the user's working directory which has much more disk space than the SLURM spool directory on /var.

Details on the meaning of the SLURM script are covered below in the SLURM section. The most important lines are the '#SLURM -l select=1:ncpus=1:mem=2880mb' and the '#SLURM -l pack=free' lines. The first instructs SLURM to select 1 resource 'chunk' with 1 processor (core) and 2,880 MBs of memory in it for the job. The second instructs SLURM to place this job on the least buy node where the requested resources can be found (freely). The master compute node that SLURM finally selects to run your job will be printed in the SLURM output file by the 'hostname' command.