QIIME

From HPCC Wiki
Jump to navigation Jump to search

Qiime is installed in Python/anaconda environment. There are 2 builds of Python/Anaconda - the one with python interpreter 2.7.13 and one with python interpreter 3.6.0. The text below refers to python 2.7.13. In order to use qiime the users must load module for qiime and activate the environment. The following 2 lines will do the job:

module load qiime/1.9.1_FULL_P2.7
source activate qiime1

The P2.7 above indicates that the used python interpreter is python 2.7.X. Because QIIME is a pipeline that relies heavily upon external applications, it has a variety of ways in which is "parallelizes" tasks for multiprocessor computation. These include:

1. Multi-threading on a single node 
2. Auto-generating serial jobs to the cluster, each working on a separate subtask ("Workers")
3. Actual MPI

Depending upon which subset of QIMME's functionality is needed, the user may use one or more of the above forms of parallelization. During the initial start up QIIME is concentrated on "de noising" which can take significant resources. Passing this phase the QIIME auto generates user specified number of threads (single node use) or node thread jobs. QIIME subdivides the denoising tasks into tasks that are handled at various stages by individual threads or worker jobs (depending upon setup). However the documentation on QIIME parallelization is not very explanatory, but is clear that the user can utilize EITHER auto-job generation OR node threading but NEVER BOTH in the same time. QIIME supplies few scripts for determining how subtasks are to be processed - worker jobs (cluster) or threads (single node). The relevant scripts are as follows:

start_parallel_jobs.py             - used for threading on a single node
start_parallel_jobs_torque.py - used for '''auto-generating''' single node, single core cluster worker jobs. 
                                                Please note that the utilization of the node is 1 node by 1 core. 

If the user invokes the qiime for the first time is good idea to check if all modules are properly connected. Thus on a first run users should run the command:

print_qiime_config.py -t

The above command will print initial set-up of qiime and will pass/fail internal tests.

(qiime1) [user_id@penzias ~]$ print_qiime_config.py -t

System information
==================
         Platform:	linux2
   Python version:	2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:09:15)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Python executable:	/share/usr/compilers/python/miniconda2/envs/qiime1/bin/python

QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
 https://github.com/biocore/qiime-default-reference/releases/tag/0.1.3

Dependency versions
===================
          QIIME library version:	1.9.1
           QIIME script version:	1.9.1
qiime-default-reference version:	0.1.3
                  NumPy version:	1.10.4
                  SciPy version:	0.17.1
                 pandas version:	0.18.1
             matplotlib version:	1.4.3
            biom-format version:	2.1.5
                   h5py version:	2.6.0 (HDF5 version: 1.8.16)
                   qcli version:	0.1.1
                   pyqi version:	0.3.2
             scikit-bio version:	0.2.3
                 PyNAST version:	1.2.2
                Emperor version:	0.9.51
                burrito version:	0.9.1
       burrito-fillings version:	0.1.1
              sortmerna version:	SortMeRNA version 2.0, 29/11/2014
              sumaclust version:	SUMACLUST Version 1.0.00
                  swarm version:	Swarm 1.2.19 [Mar  5 2016 16:56:02]
                          gdata:	Installed.

QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
 http://qiime.org/install/qiime_config.html
 http://qiime.org/tutorials/parallel_qiime.html

                     blastmat_dir:	None
      pick_otus_reference_seqs_fp:	/share/usr/compilers/python/miniconda2/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                         sc_queue:	all.q
      topiaryexplorer_project_dir:	None
     pynast_template_alignment_fp:	/share/usr/compilers/python/miniconda2/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta
                  cluster_jobs_fp:	start_parallel_jobs.py
pynast_template_alignment_blastdb:	None
assign_taxonomy_reference_seqs_fp:	/share/usr/compilers/python/miniconda2/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                     torque_queue:	production
                    jobs_to_start:	4
                       slurm_time:	None
            denoiser_min_per_core:	50
assign_taxonomy_id_to_taxonomy_fp:	/share/usr/compilers/python/miniconda2/envs/qiime1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt
                         temp_dir:	/tmp/
                     slurm_memory:	None
                      slurm_queue:	None
                      blastall_fp:	blastall
                 seconds_to_sleep:	1

QIIME base install test results
===============================
.........
----------------------------------------------------------------------
Ran 9 tests in 0.021s

OK

Each user must create the following qiime.config in his/her own home directory. The file (for cluster environment) is provided below, so the users must copy and paste the file.

cluster_jobs_fp /share/usr/compilers/python/miniconda2/envs/qiime1/bin/start_parallel_jobs_torque.py
python_exe_fp /share/usr/compilers/python/miniconda2/envs/qiime1/bin/python
working_dir  $HOME
blastmat_dir $HOME
blastall_fp blastall
pynast_template_alignment_fp
pynast_template_alignment_blastdb
template_alignment_lanemask_fp
jobs_to_start 4
seconds_to_sleep 60
qiime_scripts_dir /share/usr/compilers/python/miniconda2/envs/qiime1/bin
temp_dir /tmp/
denoiser_min_per_core 50
cloud_environment False
topiaryexplorer_project_dir
torque_queue main
assign_taxonomy_reference_seqs_fp
assign_taxonomy_id_to_taxonomy_fp

Lines which may be altered by the user are:

cluster_jobs_fp
   - Set to "start_parallel_jobs_torque.py" for cluster jobs
   - Set to "start_parallel_jobs.py" for single node threading
working_dir   
blastmat_dir  (blast matrices location)
jobs_to_start  (the MAXIMUM number of jobs to auto-spawn)
seconds_to_sleep (change is NOT recommended)

Lines which user must look carefully during:

qiime_scripts_dir - must be sure to always point to /share/usr/compilers/python/miniconda2/envs/qiime1/bin 
torque_queue      - must be set to "production"  which is main queue to PENZIAS.  

The meaning of each of the lines of qiime.config is:

cluster_jobs_fp : path to your cluster jobs file.  

python_exe_fp : path to python executable. Just use python. 
working_dir : a directory where work should be performed when running in parallel. USUALLY $HOME

blastmat_dir : directory where BLAST substitution matrices are stored

blastall_fp : path to blastall executable

pynast_template_alignment_fp : default template alignment to use with PyNAST as a fasta file

pynast_template_alignment_blastdb : default template alignment to use with PyNAST as a pre-formatted BLAST database

template_alignment_lanemask_fp : default alignment lanemask to use with filter_alignment.py

jobs_to_start : default number of jobs to start when running QIIME in parallel. 
                      don’t make this more than the available cores/processors on your system

seconds_to_sleep : number of seconds to wait when checking whether parallel jobs have completed

qiime_scripts_dir : directory where QIIME scripts can be found

temp_dir : directory for storing temporary files created by QIIME scripts. when a script completes successfully, any 
                 temporary files that it created are cleaned up.

denoiser_min_per_core : minimum number of flowgrams to denoise per core in parallel denoiser runs

cloud_environment : used only by the n3phele system. you should not need to modify this value

topiaryexplorer_project_dir : directory where TopiaryExplorer is installed

torque_queue : default queue to submit jobs to when using parallel QIIME with torque or SLURM. DEFAULT is production

assign_taxonomy_reference_seqs_fp : default reference database to use with assign_taxonomy.py (and parallel versions)

assign_taxonomy_id_to_taxonomy_fp : default id-to-taxonomy map to use with assign_taxonomy.py (and parallel versions)

sc_queue : default queue to submit jobs to when running parallel QIIME on StarCluster

In above script the jobs_to_start value must match number of cores requested in SLURM script. For example #SLURM -l select=1:ncpus=4 will require jobs_to_start also to have value 4. By default, QIIME is configured to run parallel jobs on systems without a queueing system (e.g., your laptop, desktop, or single AWS instance), where user tells a parallel QIIME script how many jobs should be submitted. On HPC systems however the QIIME should be submitted via SLURM batch script. In this case there are 2 scenarios:

     1. Jobs submitted for multithread on a single node
     2. Jobs submitted as single thread on single core across several nodes.

To make QIIME parallelization scheme useful the users may have to consider to make a separate script which holds the list of "tasks" that have to be started on cluster. The file is usual text file so it must be created with text editor - vi or edit in Linux environment. Please do not use word processor to create this file.

pick_otus.py -i inseqs_file1.fasta
pick_otus.py -i inseqs_file2.fasta
pick_otus.py -i inseqs_file3.fasta

The file is named test_jobs.txt. In order to be used it has to be called by one of the cluster job scripts described above. If passed to a cluster jobs script, above 3 lines should start three separate jobs corresponding to each of the commands. The name of cluster job script is defined in cluster_jobs_fp variable in qiime.config file. Remember that every user must have the qiime.conf file in his/her directory and the later file must be properly edited. The general syntax of passing the test_jobs.txt to cluster job script is:

CLUSTER_JOBS_FP -ms job_list.txt JOB_ID

Here CLUSTER_JOBS_FP is PATH to cluster job file (start_parallel_jobs.py OR start_parallel_jobs_torque.py). Then SLURM script to start multithread job on node on PENZIAS looks like:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name qime_test_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

/share/usr/compilers/python/miniconda2/envs/qiime1/bin/start_parallel_jobs.py -ms test_jobs.txt  JOB_ID  -q production
echo "QIIME  job is done."

Please note that maximum number of threads is auto generated but limited from above from the value in qiime.conf. In above script JOB_ID is prefix which will be added to each of the jobs. -ms are options for start_parallel_jobs.py. The first one means "make" the second "submit". The same JOB_ID is also used by the QIIME parallel scripts when creating names for temporary files and directories, but user script does not necessarily need to do anything with this information. The parallel variants of the scripts use the same parameters as the serial versions of the scripts, with some additional options in the parallel scripts.

Next option is to use 1 core on 1 node parallelization. To do so the other script start_parallel_jobs_torque.py must me used. This files takes the following parameters/options:

Syntax: start_parallel_jobs_torque.py [options]

Input Arguments: [OPTIONAL]

-m, --make_jobs Make the job files [default: None] -s, --submit_jobs Submit the job files [default: None] -q, --queue Name of queue to submit to [default: friendlyq] -j, --job_dir Directory to store the jobs [default: jobs/] -w, --max_walltime Maximum time in hours the job will run for [default: 72] -c, --cpus Number of CPUs to use [default:1] -n, --nodes Number of nodes to use [default:1]

Let's say the user want to submit the parallel job on PENZIAS. First the user must create a cluster job script similar to test_jobs.txt which should hold programs to be executed - one per command line just as it was shown above. Then the following script can be used to run on 4 nodes :


#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name qime_test_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880


start_parallel_jobs_torque.py -ms test_jobs.txt -c 1 -n 4 -q production
echo "QIIME  job is done."

Finally below is an example how to run the QIIME script for SERIAL align_seqs.py on penzias:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name qime_test_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

align_seqs.py -i input.fasta -m muscle
echo "QIIME  job is done."

And in PARALLEL:

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name qime_test_job
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=2880

start_parallel_jobs_torque.py -ms test_jobs.txt -c 4 -n 4 -q production
echo "QIIME  job is done."

The users are strongly encouraged to familiarize themselves with QIIME tutorials available at : http://qiime.org/tutorials/index.html