PHRAP-PHRED

From HPCC Wiki
Jump to navigation Jump to search

PHRAP is a program for shotgun sequence assembly, but it can also be used for small sequence assemblies. Its key features include its use of data quality information, both direct (from phred trace analysis) and indirect (from pairwise read comparisons), to delineate the likely accurate base calls in each read. This helps discriminate repeats. It permits the use of the full reads in assembly, and allows a highly accurate consensus sequence to be generated. A probability of error is computed for each consensus sequence position, which can be used to focus human editing on particular regions. This helps to automate decision-making about where additional data are needed provides users of the final sequence with information about local variations in quality. The PHRAP documentation is available here [1]

PHRED reads DNA sequencer trace data, calls bases, assigns quality values to the bases, and writes the base calls and quality values to output files. Phred can read trace data from chromatogram files in the SCF, ABI, and ESD formats. It automatically determines the file format, and whether the chromatogram file was compressed using gzip, bzip2, or UNIX compress. After calling bases, phred writes the sequences to files in either FASTA format, the format suitable for XBAP, PHD format, or the SCF format. Quality values for the bases are written to FASTA format files or PHD files, which can be used by the phrap sequence assembly program in order to increase the accuracy of the assembled sequence. The PHRED documentation is available here [2]

All the tools referenced above are installed at the CUNY HPC Center on both KARLE and ANDY. They may be run directly on KARLE, in either command-line interactive mode, in the background (Unix batch), or within the CONSED GUI framework using the 'phredPhrap' scripting tool. The run times are generally short. On ANDY, they should be run from the within the CUNY HPC Center SLURM batch processing frame work if the jobs will take more than a minute or two of wall-clock time. On both KARLE and ANDY.

Below is a sample SLURM batch script for ANDY that reproduces each step the CONSED 'phredPhrap' script completes when it is run on KARLE. This script is meant to give you an idea of how any of these tools can be run in batch mode on ANDY. Not all these steps are always required. SLURM scripts-jobs that run only one or two of the tools present in this example can also be constructed. Details on the command-line options for each tools can be found the the manuals pointed to above.

Prior to running this example, a directory with example starting input data and the environment for each tool must be set up. One can obtain the standard test case from the PHRED installation tree on ANDY as follows:

$mkdir mytest
$
$cd mytest
$
$tar -xvf /share/apps/phred/default/data/STD.tar
$

This will created a collection of directories, some with input files, that will be referenced by the SLURM batch script. These directories are list here:

[richard.walsh@andy standard]$ls -l 
total 28
drwx------ 2 richard.walsh hpcadmin 4096 2012-12-28 17:37 chromat_dir
drwx------ 2 richard.walsh hpcadmin 4096 2012-12-28 17:37 chromats_to_add
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-02 12:56 edit_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 phdball_dir
drwx------ 3 richard.walsh hpcadmin 4096 2013-02-27 12:38 phd_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 sff_dir
drwx------ 2 richard.walsh hpcadmin 4096 2013-01-25 13:51 solexa_dir
[richard.walsh@andy standard]$

Next, the environment for each of the required tools must be loaded using the modules command.

$
$module load phred
$module load phrap
$module load consed
$

Although CONSED is not used directly in this SLURM script, files in its installation tree are referenced and its module must therefore be loaded. With the above steps completed, the following SLURM batch script can be run on ANDY:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name PHRED_PHRAP.job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
hostname

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUMBIT_DIR

# Echoing the location of the phred_phrap parameter file
echo ""
echo "Using parameter file: $PHRED_PARAMETER_FILE"
echo ""

# Define the location of the consed screen files for cross_match
export SCREEN_PATH=${CONSED_HOME}/lib/screenLibs

# Just point to the serial executable to run
echo ">>>> Begin PHRED-PHRAP Batch Serial Run ..."
echo ""
echo ">>>> Running phred ... "
phred -id chromat_dir -pd phd_dir > phred.out 2>&1
echo "Done ..."
echo ">>>> Running phd2fasta ... "
phd2fasta -id phd_dir -os seqs_fasta -oq seqs_fasta.screen.qual > phd2fasta.out 2>&1
echo "Done ..."
echo ">>>> Running cross_match ... "
cross_match seqs_fasta ${SCREEN_PATH}/vector.seq -minmatch 12 -minscore 20 -screen > cross_match.out 2>&1
echo "Done ..."
echo ">>>> Running phrap ... "
phrap seqs_fasta.screen -new_ace > phrap.out 2>&1
echo "Done ..."
echo ""
echo ">>>> End   PHRED-PHRAP Batch Serial Run ..."

This script should be copied into a file in the same directory that you 'untar-ed' the files in above (here the name is 'mytest'). This would be typically be done in a editor like 'vi' or 'emacs'. Assuming that the name given to this SLURM script file is 'phred_phrap.job', the SLURM job can be submitted with the following command:

qsub phred_phrap.job

This script walks the original sequence found in the 'chromat_dir' through all of the steps that the 'predPhrap' script would complete interactively on KARLE. Notice that four distinct programs are run, each with their own set of options. They produce all the required 'seqs_fasta' files required for viewing in CONSED. Users may wish to run only one of the tools in which case only one execution line for perhaps 'phred' or 'phd2fasta' would be required in the script.

It should take less than minutes to run and will produce SLURM output and error files beginning with the job name 'PHRED_PHRAP' along with a number of tool-specific files output files. The primary application results will be written into the user-specified file at the end of each command line after the greater-than sign. Here, four executables are run and write named 'XXX.out' output files. The expression '2>&1' combines Unix standard output from the program with Unix standard error. Users should always explicitly specify the name of the application's output file in this way to ensure that it is written directly into the user's working directory which has much more disk space than the SLURM spool directory on /var.

Details on the meaning of the SLURM script options are covered above in the SLURM section. The most important lines are the '#SLURMnodes=1 ntasks=1 mem=2880'. The first instructs SLURM to select 1 resource 'chunk' with 1 processors (cores) and 2,880 MBs of memory in it for the job. The second instructs SLURM to place this chunk on any compute node with the required resources available. All the jobs run with this script are assume by SLURM to be serial jobs.