Applications Environment
Using Modules to Run your Applications
Modules is a software package that provides for the fast and convenient management of the components of a user's environment via modulefiles. When executed by the module command each module file fully configures the environment for its associated application or application group.
The modules configuration language allows for the management of applications environment conflicts and dependencies as well. The modules software allows users to load (and unload and reload) an application and/or system environment that is specific to their needs and avoids the need to set and manage a large, one-size-fits-all, generic environment for everyone at login.
Modules is the default approach to managing the user applications environment. CUNY HPC Center system BOB, currently used almost entirely for Gaussian jobs will NOT be reconfigured with the modules software. Module version 3.2.9 is the default on the CUNY HPC Center systems.
- Modules, Learning by Example
- Example 1, Basic Non-Cray System
- Example 2, Less Basic From SALK (Cray System)
Using the module package users can easily set a collection of environmental variables that are specific to their compilation, parallel programming, and/or application requirements on the HPC Center's systems. The modules system also makes it convenient to advance or regress compiler, parallel programming, or applications versions when defaults are found to have bugs or performance issues. Whatever the task, the modules package can adjust the environment in an orderly way altering or setting of such environmental variables as PATH, MANPATH, LD_LIBRARY_PATH, etc. and providing some basic descriptive information about the application version being loaded and purpose of the modules file through the module help facility.
In addition to each application-specific modulefile, the module package functions through the use of a collection of sub-commands given after the initial module command itself as in "module list" for instance. All these module sub- command are described in detail in the module man page ("man module"), but a list of some of the more important and commonly used sub-commands is provided here:
Module sub-commands: list load unload switch avail show help purge
Modules, Learning by Example
The best way to explain how to use the modules package and its sub-command is to consider some simple examples of a typical workflows that involve modules. Here are two examples. Again, for a more complete description of the modules package please refer to "man module".
Example 1, Basic Non-Cray System
Without any custom or local environmental path settings, it would look something like this with no compiler, parallel programming model, or application-specific information in it:
username@service0:~> echo $PATH | tr -s ':' '\n' /scratch/username/bin /usr/local/bin /usr/bin /bin /usr/bin/X11 /usr/X11R6/bin /usr/games /opt/c3/bin
We take note that there appears to be no path to the application that we are interested in running which is Wolfram's Mathematica in this example. Typing "which math" to find Mathematica ("math" is the command-line name for Mathematica) at the terminal yields:
username@service0:~> which math which: no math in (/scratch/username/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/c3/bin)
The Mathematica executable "math" is not found in the default PATH variabl defined by the system at login. Modules can be used to remedy this problem by adding the required path. To check which module files (if any) are already loaded into our environment, we are can type the "module list" sub-command at the terminal prompt:
username@service0:~> module list No Modulefiles Currently Loaded. username@service0:~>
No modules loaded. So the module file for Mathematica has not been loaded and it is no surprise that the Mathematica command-line "math" was not found. The next question is has the HPC Center installed Mathematica on this system and created a module file for it? To find this out we use the "module avail" sub-command:
username@service0:~> module avail ---------------------------- /share/apps/modules/default/modulefiles_UserApplications -------------------------------------- adf/2012.01(default) cesm/1.0.3 hoomd/0.9.2(default) ncar/5.2.0_NCL(default) pgi/12.3(default) auto3dem/4.02(default) cesm/1.0.4(default) intel/12.1.3.293(default) nwchem/6.1.1(default) phoenics/2009(default) autodock/4.2.3(default) cuda/5.0(default) ls-dyna/6.0.0(default) octopus/4.0.0(default) r/2.14.1(default) beagle/0.2(default) gromacs/4.5.5_32bit mathematica/8.0.4(default) openmpi/1.5.5_intel(default) wrf/3.4.0(default) best/2.2L(default) gromacs/4.5.5_64bit(default) matlab/R2012a(default) openmpi/1.5.5_pgi --------------------------------- /share/apps/modules/default/modulefiles_System ------------------------------------------- module-info modules version/3.2.9
The listing shows all available module files on this system, both those that are user-application related and those that are more system related. As shown in the output, these two types of module files are stored in different directories. Looking through the application list, there is a module for Mathematica version 8.0.4, which is also happens to be the default. On this system, the modules package has only just been installed, and therefore only one version of each application has been adapted to the module system and that version is the default.
The module file that is responsible for setting up correct environment needed to run Mathematica can now be loaded:
module load mathematica
Because there is only one version and it is the default, there is no need to include the version-specific extension to load it. To explicitly load version 8.0.4 (or any other specific and non-default version) one would use:
module load mathematica/8.0.4
After loading, the environmental PATH variable includes the path to Mathematica:
username@service0:~> echo $PATH | tr -s ':' '\n' /scratch/username/bin /usr/local/bin /usr/bin /bin /usr/bin/X11 /usr/X11R6/bin /usr/games /opt/c3/bin /share/apps/mathematica/8.0.4/Executables
This can be verified by rerunning the "which math" command:
username@service0:~> which math /share/apps/mathematica/8.0.4/Executables/math
Once the head or login node enviroment variables are properly set, one can create a SLURM script to run an Mathematica job on a compute node and ensure that the head or login node environment just set is passed on to the compute nodes by using the "#SLURM -V" option inside you SLURM script:
#!/bin/bash #SLURM -N mmat8_serial1 #SLURM -q production #SLURM -l select=1:ncpus=1:mem=1920mb #SLURM -l place=free #SLURM -V # Find out name of master execution host (compute node) echo -n ">>>> SLURM Master compute node is: " hostname # You must explicitly change to the working directory in SLURM cd $SLURM_O_WORKDIR # Just point to the serial executable to run echo ">>>> Begin Mathematica Serial Run ..." echo "" math -run <test_run.nb > output echo "" echo ">>>> End Mathematica Serial Run ..."
Since the PATH variable in the login environment is now includes the location of the Mathematica executable and the "#SLURM -V" option ensures that this is passed to the compute node that the job is run on, the last line of the SLURM script will be executed without environment-related problems.
Example 2, Less Basic From SALK (Cray System)
models, libraries, and applications. In addition, SALK uses a custom high-performance interconnect with its own libraries, has its own compiler suite and compiling system, and many other custom libraries. Setting up and/or tearing down a given environment that makes all this work correctly is more complicated that it is on the other systems at the HPC Center. Modules simplifies this process tremendously for the user.
Here is an example of how to swap out the default Cray compiler environment on SALK and swap in the compiler suite from the Portland Group including all the right MPI libraries from Cray. In this case, we swap in a new release of the Portland Group compilers, not the current default on the Cray, and the version of the NETCDF library that has been compiled with the Portland group.
Having logged into SALK, we determine what modules have been load by default with "module list":
user@salk:~> module list Currently Loaded Modulefiles: 1) modules/3.2.6.6 2) nodestat/2.2-1.0400.31264.2.5.gem 3) sdb/1.0-1.0400.32124.7.19.gem 4) MySQL/5.0.64-1.0000.5053.22.1 5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90 6) udreg/2.3.1-1.0400.4264.3.1.gem 7) ugni/2.3-1.0400.4374.4.88.gem 8) gni-headers/2.1-1.0400.4351.3.1.gem 9) dmapp/3.2.1-1.0400.4255.2.159.gem 10) xpmem/0.1-2.0400.31280.3.1.gem 11) hss-llm/6.0.0 12) Base-opts/1.0.2-1.0400.31284.2.2.gem 13) xtpe-network-gemini 14) cce/8.0.7 15) acml/5.1.0 16) xt-libsci/11.1.00 17) pmi/3.0.0-1.0000.8661.28.2807.gem 18) rca/1.0.0-2.0400.31553.3.58.gem 19) xt-asyncpe/5.13 20) atp/1.5.1 21) PrgEnv-cray/4.0.46 22) xtpe-mc8 23) cray-mpich2/5.5.3 24) SLURM/11.3.0.121723
From the list, we see that the Cray Programming Environment ("PrgEnv-cray/4.0.46") and the Cray Compiler environment are loaded ("cce/8.0.7") by default among other things (SLURM, MPICH, etc.). To unload these Cray modules and load in the Portland Group (PGI) equivalents we need to know the names of the PGI modules. The "module avail" command will tell us this:
user@salk:~> module avail . . (several sections of output removed) . . ------------------------------------------------ /opt/modulefiles ----------------------------------------------------- Base-opts/1.0.2-1.0400.31284.2.2.gem(default) gcc/4.1.2 SLURM/11.2.0.113417 PrgEnv-cray/3.1.61 gcc/4.2.4 SLURM/11.3.0.121723(default) PrgEnv-cray/4.0.46(default) gcc/4.4.2 petsc/3.1.08 PrgEnv-gnu/3.1.61 gcc/4.4.4 petsc/3.1.09 PrgEnv-gnu/4.0.46(default) gcc/4.5.1 petsc-complex/3.1.08 PrgEnv-intel/3.1.61 gcc/4.5.2 petsc-complex/3.1.09 PrgEnv-intel/4.0.46(default) gcc/4.5.3 pgi/12.10 PrgEnv-pathscale/3.1.61 gcc/4.6.1 pgi/12.3 PrgEnv-pathscale/4.0.46(default) gcc/4.7.1(default) pgi/12.6(default) PrgEnv-pgi/3.1.61 hss-llm/6.0.0(default) pgi/12.8 PrgEnv-pgi/4.0.46(default) intel/12.1.1.256 wrf/3.3.0 acml/4.4.0 intel/12.1.4.319(default) wrf/3.4.0(default) acml/5.1.0(default) intel/12.1.5.339 xt-asyncpe/5.01 admin-modules/1.0.2-1.0400.31284.2.2.gem(default) java/jdk1.6.0_24 xt-asyncpe/5.05 amber/12(default) java/jdk1.7.0_03(default) xt-asyncpe/5.13(default) cce/8.0.7(default) mazama/6.0.0(default) xt-libsci/11.0.00 chapel/1.4.0 modules/3.2.6.6(default) xt-libsci/11.0.04 chapel/1.5.0(default) mrnet/3.0.0(default) xt-libsci/11.1.00(default) fftw/2.1.5.3 pathscale/4.0.12.1(default) xt-papi/4.2.0 fftw/3.2.2.1(default) pathscale/4.0.9 xt-papi/4.3.0(default) fftw/3.3.0.1 SLURM/11.1.0.111761
There are several versions of the PGI compilers and two version of the PGI Programming Environment for the Cray (SALK). We are interested in loading PGI's 12.10 release (not the default which is "pgi/12.6") and the most current release of the PGI programming environment ("PrgEnv-pgi/4.0.46"), which is the default. The PGI programming environment for the Cray provides all the environmental settings required to use the PGI compilers on the Cray which includes a number of custom libraries.
Here is a series of module commands to unload the Cray defaults, load the PGI modules mentioned, and load version 4.2.0 of NETCDF compiled with the PGI compilers.
user@salk:~> module unload PrgEnv-cray user@salk:~> module load PrgEnv-pgi user@salk:~> module unload pgi user@salk:~> module load pgi/12.10 user@salk:~> user@salk:~> module load netcdf/4.2.0 user@salk:~> user@salk;~> cc -V /opt/cray/xt-asyncpe/5.13/bin/cc: INFO: Compiling with CRAYPE_COMPILE_TARGET=native. pgcc 12.10-0 64-bit target on x86-64 Linux Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved. Copyright 2000-2012, STMicroelectronics, Inc. All Rights Reserved.
Several comments about this series of command are perhaps useful. First, the first three commands do not include version numbers and will therefore load or unload the current default versions. In the third line, we unload the default version of the PGI compiler (version 12.6) which is loaded with the rest of the PGI Programming Environment in the second line. We then load the non-default and more recent release from PGI, version 12.10 in the fourth line. Later, we load NETCDF version 4.2.0 which, because we have already loaded the PGI Programming Environment, will load the version of NETCDF 4.2.0 compiled with the PGI compilers. Finally, we check to see which compiler the Cray "cc" compiler wrapper actually invokes after this sequence of module commands. We see that indeed "pgcc" version 12.10 is being used.
We can confirm all this by again entering "module list". Notice that the Cray-related compiler modules have been replaced by those from PGI and that NETCDF version 4.2.0 has been loaded. We are ready to use new PGI compiler suite based environment. It is left as an exercise to the reader to figure out how the series of commands listed above could have been shortened by using the "module swap" sub- command.
user@salk:~> module list Currently Loaded Modulefiles: 1) modules/3.2.6.6 2) nodestat/2.2-1.0400.31264.2.5.gem 3) sdb/1.0-1.0400.32124.7.19.gem 4) MySQL/5.0.64-1.0000.5053.22.1 5) lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90 6) udreg/2.3.1-1.0400.4264.3.1.gem 7) ugni/2.3-1.0400.4374.4.88.gem 8) gni-headers/2.1-1.0400.4351.3.1.gem 9) dmapp/3.2.1-1.0400.4255.2.159.gem 10) xpmem/0.1-2.0400.31280.3.1.gem 11) hss-llm/6.0.0 12) Base-opts/1.0.2-1.0400.31284.2.2.gem 13) xtpe-network-gemini 14) xtpe-mc8 15) cray-mpich2/5.5.3 16) SLURM/11.3.0.121723 17) xt-libsci/11.1.00 18) pmi/3.0.0-1.0000.8661.28.2807.gem 19) xt-asyncpe/5.13 20) atp/1.5.1 21) PrgEnv-pgi/4.0.46 22) pgi/12.10 23) hdf5/1.8.8 24) netcdf/4.2.0
Applications
This an overview of the user-level HPC applications supported by the HPC Center staff for the benefit of the entire CUNY HPC user community. A user can chose to install any application that they are licensed for on their own account, or appeal (based on general interest) to have it installed by HPC Center staff in the shared system directory (usually /shared/apps).
Not every user-level application is installed on every system. This is because system architectural differences, load-balancing considerations, licensing limitations, the time required to maintain them, and other factors, sometimes dictate otherwise. Here, we present the current CUNY HPC Center user-level application catalogue and note the system on which each application is installed and licensed to run.
We encourage the CUNY HPC community to help the HPC Center staff create a
applications catalogue that is closely tuned to the needs of the community. As
such, we hope that users will solicit staff-help in growing our application install
base to suite the needs of the community whatever the application discipline (CAE,
CFD, COMPCHEM, QCD, BIOINFORMATICS, etc.). The CUNY HPC will do the best to
try and satisfy reasonable software requests.
Software requests must be submitted by Supervisors and/or PI's only. Users can install applications in their own home directory as needed.
Unless otherwise noted, all applications built locally were built using our default Intel-OpenMPI applications stack. Furthermore, the SLURM Pro job submission scripts below are promised to work (at the time this section of the Wiki was written), but the number of processors (cores), memory, and process placement defined in the example scripts is not necessarily optimal for wall-clock or cpu-time performance. The user should use their knowledge of the application, the system, and the benefit of their experience to choose the optimal combination of processors and memory for their scripts. Details on how to make full use of the SLURM Pro job submission options are covered in the SLURM Pro section below.
ADCIRC
ADCIRC is a system of programs for solving time-dependent, free-surface, circulation and transport problems in two and three dimensions.
AMBER (Assisted Model Building with Energy Refinement)
Amber is the collective name for a suite of programs for classical bio-molecular simulations. The name "Amber" also denotes the family of potentials (force fields) used with Amber software. Here we discuss only simulation packages, but not the force fields or free tools available via AmberTools package. Details and submission scripts can be found at: http://wiki.csi.cuny.edu/cunyhpc/index.php/Applications_Environment/amber
ANVIO
Anvio is a tool for an analysis and visualization platform for genomics data. Anvio allows various types of workflows to be established. ANVIO
AUGUSTUS
AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences. Augustus is a gene-finding software based on Hidden Markov Models (HMMs), described in papers by Stanke and Waack (2003) and Stanke et al (2006) and Stanke et al (2006b) and Stanke et al (2008).The local version of the program is installed on Penzias. More information can be found here: AUGUSTUS
AUTODOCK
AutoDock is a suite of automated docking tools.
BAMOVA
Bamova is a package used to do genetic analysis of a wide range of organisms on the basis of next-generation sequence data. The software implements Bayesian Analysis of Molecular Variance and different likelihood models for three different types of molecular data (including two models for high throughput sequence data). For more detail on BAMOVA please visit the BAMOVA web site [6] and manual here [7]. Further information can also be found here BAMOVA.
BAYESCAN
BAYESCAN is Population Genomics Software package. It identifies outlier loci and is applicable to both, dominant and codominant data.
genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model. One of the scenarios covered consists of an island model in which subpopulation allele frequencies are correlated through a common migrant gene pool from which they differ in varying degrees. The difference in allele frequency between this common gene pool and each subpopulation is measured by a subpopulation- specific FST coefficient. Therefore, this formulation can consider realistic ecological scenarios where the effective size and the immigration rate may differ among subpopulations.
More detailed information on Bayescan can be found at the web site here [8] and in the manual here [9]. More information about our installation can be found here BAYESCAN.
BEAST
BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation.
The package implements a family of Markov chain Monte Carlo (MCMC) algorithms for Bayesian phylogenetic inference, divergence time dating, coalescent analysis, phylogeography and related molecular evolutionary analyses. It is a cross-platform Java program for Bayesian MCMC analysis of molecular sequences. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies, but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. The distribution includes a simple to use user-interface program called 'BEAUti' for setting up standard analyses and a suite of programs for analysing the results. For more detail on BEAST (and BEAUTi) please visit the BEAST web site [10]. More information about our installation can be found here BEAST.
BEST
BEST is an application aimed to estimate gene trees and the species tree from multilocus sequences.
The program uses information from multiple gene trees and performs a Bayesian analysis to estimate the topology of the species tree, divergence times and population sizes.
It provides a new approach for estimating the mutation-rate- based, phylogenetic relationships among species. Its method accounts for deep coalescence, but not for other complicating issues such as horizontal transfer or gene duplication. The program works in conjunction within the popular Bayesian phylogenetics package, MrBayes (Ronquist and Huelsenbeck, Bioinformatics, 2003). BEST's parameters are defined using the 'prset' command from MrBayes. Details on BEST's capabilities and options are avialable at the BEST web site here [11]. More information about our installation is available here BEST.
BOWTIE2
BOWTIE2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes.
BOWTIE2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. BOWTIE2 supports gapped, local, and paired-end alignment modes. BOWTIE2 is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, CUFFLINKS, SAMTOOLS, and TOPHAT are also installed at the CUNY HPC Center. Additional information can be found at the BOWTIE2 home page here [12]. Information about our installation can be found here BOWTIE2.
BPP2
BPP2 uses a Bayesian modeling approach to generate the posterior probabilities of species assignments taking into account uncertainties due to unknown gene trees and the ancestral coalescent process. For tractability, it relies on a user-specified guide tree to avoid integrating over all possible species delimitations.
BROWNIE
BROWNIE is a program for analyzing rates of continuous character evolution and looking for substantial rate differences in different parts of a tree using likelihood ratio tests and Akaike Information Criterion (AIC) statistics. It now also implements many other methods for examining trait evolution and methods for doing species delimitation. More information about our installation can be found here BROWNIE.
CGAL
The Computational Geometry Algorithms Library (CGAL), offers data structures and algorithms.
Examples of these are triangulations (2D constrained triangulations, and Delaunay triangulations and periodic triangulations in 2D and 3D), Voronoi diagrams (for 2D and 3D points, 2D additively weighted Voronoi diagrams, and segment Voronoi diagrams), polygons (Boolean operations, offsets, straight skeleton), polyhedra (Boolean operations), arrangements of curves and their applications (2D and 3D envelopes, Minkowski sums), mesh generation (2D Delaunay mesh generation and 3D surface and volume mesh generation, skin surfaces), geometry processing (surface mesh simplification, subdivision and parameterization, as well as estimation of local differential properties, and approximation of ridges and umbilics), alpha shapes, convex hull algorithms (in 2D, 3D and dD), search structures (kd trees for nearest neighbor search, and range and segment trees), interpolation (natural neighbor interpolation and placement of streamlines), shape analysis, fitting, and distances (smallest enclosing sphere of points or spheres, smallest enclosing ellipsoid of points, principal component analysis), and kinetic data structures.
The library is installed on PENZIAS.
More information can be found here http://wiki.csi.cuny.edu/cunyhpc/index.php/Applications_Environment/CGAL.
CONSED
CONSED is a DNA sequence analysis finishing tool that provides sequence viewing, editing, alignment, and assembly capabilities from a X Windows graphical user interface (GUI).
It makes extensive use of other non-graphical and underlying sequence analysis tools including PHRED, PHRAP, and CROSSMATCH that may also be used separately and are described else where in this document. It also includes a viewer called BAMVIEW. The CONSED tool chain is developed and maintained at the University of Washington and is described more completely here [14] CONSED is provided at the CUNY HPC Center under an academic license that allows use, but not the copying or out bound transfer of any of the executables or files distributed under this academic license. The license is not transferable in any way and users wishing to run the application at their own site must acquire a license directly from the authors.
The CUNY HPC Center supports CONSED version 23.0 for interactive use on KARLE. CONSED 23.0 and the tool chain described above is also installed on ANDY to allow for the batch use of underlying support tools mention above and described in detail below. In general, running GUI-based applications on ANDY's login node is discouraged. There should be little need to do this as KARLE is on the periphery of the CUNY HPC network making login there direct and KARLE shares its HOME directory file system with ANDY making files created on either system immediately available on the other.
Rather than rewrite portions of the CONSEND manual here, users are directed to the manual's "Quick Tour" section here [15] and asked to walk through some of the exercises after logging into KARLE. If problems or questions come up, please post them to "hpchelp@csi.cuny.edu". The CONSED 23.0 distribution is installed on KARLE in the following directory:
/share/apps/consed/default
All the files in the distribution can be found there.
CP2K
CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular, and biological systems.
It provides a general framework for different methods such as e.g., density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) and classical pair and many-body potentials. CP2K provides state-of-the-art methods for efficient and accurate atomistic simulations. More information about our installation can be found here CP2K.
CUFFLINKS
CUFFLINKS assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples.
It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. CUFFLINKS then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. CUFFLINKS is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. The other tools in this collection, BOWTIE, SAMTOOLS, and TOPHAT are also installed at the CUNY HPC Center.Additional information can be found at the CUFFLINKS home page here [16]. More information about our installation can be found here CUFFLINKS.
DL_POLY
DL_POLY is a general purpose molecular dynamics simulation package developed at Daresbury Laboratory by W. Smith, T.R. Forester and I.T. Todorov.
Both serial and parallel versions are available. The original package was developed by the Molecular Simulation Group (now part of the Computational Chemistry Group, MSG) at Daresbury Laboratory under the auspices of the Engineering and Physical Sciences Research Council (EPSRC) for the EPSRC's Collaborative Computational Project for the Computer Simulation of Condensed Phases ( CCP5). Later developments were also supported by the Natural Environment Research Council through the eMinerals project. The package is the property of the Central Laboratory of the Research Councils, UK. More information about our installation and use can be found here DL_POLY.
ExaML
ExaML stands for Exascale Maximum Likelihood (ExaML) code for phylogenetic inference using MPI.
The code is installed only on Penzias and implements the popular RAxML search algorithm for maximum likelihood based inference of phylogenetic trees.
It uses a radically new MPI parallelization approach that yields improved paralll efficiency, in particular on partitioned multi-gene or whole-genome datasets.
When using ExaML please cite the following paper:
Alexey M. Kozlov, Andre J. Aberer, Alexandros Stamatakis: "ExaML Version 3: A Tool for Phylogenomic Analyses on Supercomputers." Bioinformatics (2015) 31 (15): 2577-2579.
It is up to 4 times faster than RAxML-Light [1].
As RAxML-Light, ExaML also implements checkpointing, SSE3, AVX vectorization and memory saving techniques.
[1] A. Stamatakis, A.J. Aberer, C. Goll, S.A. Smith, S.A. Berger, F. Izquierdo-Carrasco: "RAxML-Light: A Tool for computing TeraByte Phylogenies", Bioinformatics 2012; doi: 10.1093/bioinformatics/bts309.
The run script for parallel job is analogous to one for running RAxML on Penzias and Andy.
ExaBayes
ExaBayes is a software package for Bayesian tree inference. It is particularly suitable for large-scale analyses on computer clusters. It is installed on Penzias server at HPCC center. The installed package is MPI parallel version.
Availability:' PENZIAS Module file:exabayes
Citation:
Fredrik Ronquist, Maxim Teslenko, Paul van der Mark, Daniel L Ayres, Aaron Darling, Sebastian Höhna, Bret Larget, Liang Liu, Marc a Suchard, and John P Huelsenbeck. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Systematic biology, 61(3):539--42, May 2012.
Alexei J Drummond, Marc a Suchard, Dong Xie, and Andrew Rambaut. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular biology and evolution, 29(8):1969--73, August 2012.
Clemens Lakner, Paul van der Mark, John P Huelsenbeck, Bret Larget, and Fredrik Ronquist. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Systematic biology, 57(1):86--103, February 2008.
Use: The example SLURM script to run the FDPPDIV on PENZIAS is given below
#!/bin/bash #SLURM -q production #SLURM -N <name_of_job> #SLURM -l select=1:ncpus=2 #SLURM -l place=free #SLURM -V # You must explicitly change to the working directory in SLURM cd $SLURM_O_WORKDIR mpirun -np 2 exabayes <input_file> > output_file
More information about application along with sample workflows are available on ExaBayes web site:
http://sco.h-its.org/exelixis/web/software/exabayes/manual/index.html#sec-11
FDPPDIV
FDPPDiv is a program for estimating divergence times on a fixed, rooted tree topology.
FDPPDiv offers two alternative approaches to divergence time estimation. The DPPDiv part refers to the Dirichlet Process Prior (DPP) model for divergence time estimation, and the F prefix (for Fossil) refers to the new Fossil Birth-Death approach. More information about our installation can be found here FDPPDIV.
GAMESS-US
GAMESS is a program for ab initio molecular quantum chemistry.
Briefly, GAMESS can compute SCF wavefunctions ranging from RHF, ROHF, UHF, GVB, and MCSCF. Correlation corrections to these SCF wavefunctions include Configuration Interaction, second order perturbation Theory, and Coupled-Cluster approaches, as well as the Density Functional Theory approximation. Excited states can be computed by CI, EOM, or TD-DFT procedures. Nuclear gradients are available, for automatic geometry optimization, transition state searches, or reaction path following. Computation of the energy hessian permits prediction of vibrational frequencies, with IR or Raman intensities. Solvent effects may be modeled by the discrete Effective Fragment potentials, or continuum models such as the Polarizable Continuum Model. Numerous relativistic computations are available, including infinite order two component scalar corrections, with various spin-orbit coupling options. The Fragment Molecular Orbital method permits use of many of these sophisticated treatments to be used on very large systems, by dividing the computation into small fragments. Nuclear wavefunctions can also be computed, in VSCF, or with explicit treatment of nuclear orbitals by the NEO code. More information, including code, can be found here GAMESS-US.
GARLI
GARLI is a program that performs phylogenetic inference using the maximum-likelihood criterion.
Several sequence types are supported, including nucleotide, amino acid and codon. Version 2.0 adds support for partitioned models and morphology-like data types. It is usable on all operating systems, and is written and maintained by Derrick Zwickl at the University of Texas at Austin. Additional information can be found on the GARLI Wiki here [17]. More information about our installation can be found here GARLI.
GAUSS
An easy-to-use data analysis, mathematical and statistical environment based on the powerful, fast and efficient GAUSS Matrix Programming Language.
GAUSS is used to solve real world problems and data analysis problems of exceptionally large scale. GAUSS is currently available on ANDY and BOB. At the CUNY HPC Center GAUSS is typically run in serial mode. (Note: GAUSS should not be confused with the computational chemistry application Gaussian.) More information about our installation can be found here GAUSS.
Gaussian09
Gaussian09 is third-party, commercially licensed software from Gaussian, Inc. It is a set of programs for calculating electronic structure.
Gaussian09 is available for general use on Andy. The Gaussian User Guide can be found here at [[18]]. More information about our installation can be found here GAUSSIAN09.
GMP
GMP is a library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface. The library is installed on PENZIAS.
Gnuplot
Gnuplot is a portable command-line driven graphing utility. It is installed on the following systems:
- Karle under /usr/bin/gnuplot
- Andy under /share/apps/gnuplot/default/bin/gnuplot
- Bob under /share/apps/gnuplot/default/bin/gnuplot
Extensive documentation of gnuplot is available at the gnuplot's homepage.
GENOMEPOP2
GenomePop2 is a newer and specialized version of the older program GenomePop.
GenomePop2 is designed to manage SNPs under more flexible and useful settings that are controlled by the user. If you need models with more than 2 alleles you should use the older GenomePop version of the program.
GenomePop2 allows the forward simulation of sequences of biallelic positions. As in the previous version, a number of evolutionary and demographic settings are allowed. Several populations under any migration model can be implemented. Each population consists of a number N of individuals. Each individual is represented by one (haploids) or two (diploids) chromosomes with constant or variable (hotspots) recombination between binary sites. The fitness model is multiplicative with each derived allele having a multiplicate effect of (1-s * h-E) onto the global fitness value. By default E=0 and h=0.5 in diploids, but 1 in homozygotes or in haploids. Selective nucleotide sites undergoing directional selection (positive or negative) in different populations can be defined. In addition, bottlenecks and/or population expansion scenarios can be settled by the user during a desired number of generations. Several runs can be executed and a sample of user-defined size is obtained for each run and population. For more detail on how to use GenomePop2, please visit the web site here [19]. More information about our installation can be found here GENOMEPOP2.
GROMACS
GROMACS (Groningen Machine for Chemical Simulations)
GROMACS is a full-featured suite of free software, licensed under the GNU General Public License to perform molecular dynamics simulations -- in other words, to simulate the behavior of molecular systems with hundreds to millions of particles using Newton's equations of motion. It is primarily used for research on proteins, lipids, and polymers, but can be applied to a wide variety of chemical and biological research questions.
Details and submission scripts for production runs can be found at: http://wiki.csi.cuny.edu/cunyhpc/index.php/Applications_Environment/gromacs Please note that preparing molecular system for simulation via GROMACS tools, cannot be done on login node. Instead the users must either use their own workstation or use interactive or development queues.
GPAW
GPAW is a density-functional theory (DFT) Python code based on the projector-augmented wave (PAW) method and the atomic simulation environment (ASE).
It uses real-space uniform grids and multigrid methods, atom-centered basis-functions or plane-waves. GPAW calculations are controlled through scripts written in the programming language Python. GPAW relies on the Atomic Simulation Environment (ASE), which is a Python package that helps to describe atoms. The ASE package also handles molecular dynamics, analysis, visualization, geometry optimization and more. More information about our installation can be found here GPAW.
Hapsembler
Hapsembler is a haplotype-specific genome assembly toolkit that is designed for genomes that are rich in SNPs and other types of polymorphism. Hapsembler can be used to assemble reads from a variety of platforms including Illumina and Roche/454.
Hapsembler is currently installed on Appel system. In order to access velvet binaries load the velvet module with
module load hapsembler
HOOMD
Performs general purpose particle dynamics simulations, taking advantage of NVIDIA GPUs to attain a level of performance equivalent to many processor cores on a fast cluster.
Unlike some other applications in the particle and molecular dynamics space, HOOMD developers have worked to implement all of the code's computationally intensive kernels on the GPU, although currently only single node, single-GPU or OpenMP-GPU runs are possible. There is no MPI-GPU or distributed parallel GPU version available at this time.
HOOMD's object-oriented design patterns make it both versatile and expandable. Various types of potentials, integration methods and file formats are currently supported, and more are added with each release. The code is available and open source, so anyone can write a plugin or change the source to add additional functionality. Simulations are configured and run using simple python scripts, allowing complete control over the force field choice, integrator, all parameters, how many time steps are run, etc. The scripting system is designed to be as simple as possible to the non-programmer.
The HOOMD development effort is led by the Glotzer group at the University of Michigan, but many groups from different universities have contributed code that is now part of the HOOMD main package, see the credits page for the full list. The HOOMD website and documentation are available here [20]. More information about our installation can be found here HOOMD.
HOPSPACK
HOPSPACK stands for Hybrid Optimization Parallel Search Package designed to help users to solve wide range of derivative free optimization problems.
The first two constraints specify linear inequalities and equalities with coefficient matrices AI and AE. The next two constraints describe nonlinear inequalities and equalities captured in functions cI(x) and cE(x). The final constraints denote lower and upper bounds on the variables. HOPSPACK allow variables to be continuous or integer-valued and has provisions for multi-objective optimization problems. In general, functions f(x),cI(x), and cE(x) can be noisy and nonsmooth, although most algorithms perform best on determinate functions with continuous derivatives.
The users are allowed to design and implement their own solver either by writing their own code or by building existing solvers already in a framework. Because all solvers (called citizens) are members of the same global class they can share assigned resources. The main features of the package are:
- Only function values are required for the optimization. - The user must provide a separate program that can evaluate the objective and nonlinear constraint functions at a given point. - A robust implementation of the Generating Set Search (GSS) solver is supplied, including the capability to handle linear constraints. - Multiple solvers can run simultaneously and are easily configured to share information. - Solvers may share a cache of computed function and constraint evaluations to eliminate duplicate work. - Solvers can initiate and control sub-problems Continuation -> HOPSACK.
HUMAnN2
HUMAnN is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). HUMAnN2 is the next generation of HUMAnN (HMP Unified Metabolic Analysis Network).
Details and submission scripts can be found at: http://wiki.csi.cuny.edu/cunyhpc/index.php/Applications_Environment/humann2
IMa2
The IMa2 application performs basic calculations ‘Isolation with Migration’ using Bayesian inference and Markov chain Monte Carlo methods.
The only major conceptual addition to IMa2 that makes it different from the original IMa program is that it can handle data from multiple populations. This requires that the user specify a phylogenetic tree. Importantly, the tree must be rooted, and the sequence in time of internal nodes must be known and specified. More information on the IMa2 and IMa can be found in the user manual here [21]. Information about our installation can be found here IMA2.
I-TASSER
I-TASSER is a platform for protein structure and function predictions. 3D models are built based on multiple-threading alignments by LOMETS and iterative template fragment assembly simulations; function inslights are derived by matching the 3D models with BioLiP protein function database.
Details and submission scripts can be found at: http://wiki.csi.cuny.edu/cunyhpc/index.php/Applications_Environment/itasser
JULIA
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments.
Julia is installed on Penzias.
HONDO PLUS
Hondo Plus is a versatile electronic structure code that combines work from the original Hondo application developed by Harry King in the lab of Michel Dupuis and John Rys, and that of numerous subsequent contributers.
It is currently distributed from the research lab of Dr. Donald Truhlar at the University of Minnesota. Part of the advantage of Hondo Plus is the availability of source implementations of a wide variety of model chemistries developed over its life time that researchers can adapt to their particular needs. The license to use the code requires a literature citation which is documented in the Hondo Plus 5.1 manual found at:
http://comp.chem.umn.edu/hondoplus/HONDOPLUS_Manual_v5.1.2007.2.17.pdf
More information about our installation can be found here HONDO PLUS.
LAMARC
LAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates.
It approximates a summation over all possible genealogies that could explain the observed sample, which may be sequence, SNP, microsatellite, or electrophoretic data. LAMARC and its sister program MIGRATE are successor programs to the older programs Coalesce, Fluctuate, and Recombine, which are no longer being supported. These programs are memory-intensive, but can run effectively on workstations. They are supported on a variety of operating systems. For more detail on LAMARC please visit the website here [22], read this paper [23], and look at the documentation here [24]. More information about our installation can be found here LAMARC.
LAMMPS
LAMMPS is a classical molecular dynamics code that models an ensemble of particles in a liquid, solid, or gaseous state.
It can model atomic, polymeric, biological, metallic, granular, and coarse-grained systems using a variety of force fields and boundary conditions. LAMMPS runs efficiently on single-processor desktop or laptop machines, but is also designed for parallel computers, including clusters with and without GPUs. It will run on any parallel machine that compiles C++ and supports the MPI message-passing library. This includes distributed- or shared-memory parallel machines and Beowulf-style clusters. LAMMPS can model systems with only a few particles up to millions or billions. LAMMPS is a freely-available open-source code, distributed under the terms of the GNU Public License, which means you can use or modify the code however you wish. LAMMPS is designed to be easy to modify or extend with new capabilities, such as new force fields, atom types, boundary conditions, or diagnostics. A complete description of LAMMPS can be found in its on-line manual here [25] or from the full PDF manual here [26]. Information about our installation can be found here LAMMPS.
LS-DYNA
From its early development in the 1970s, LS-DYNA has evolved into a general purpose material stress, collision, and crash analysis program with many built-in material and structural element models.
In recent years, the code has also been adapted for both OpenMP and MPI parallel execution on a variety of platforms. The most recent version, LS-DYNA 7.1.2, is installed on ANDY at the CUNY HPC Center under an academic license held by the City College of New York. The use of this license to do work that is commercial in anyway is prohibited.
Details on LS-DYNA's use, input deck construction, and execution options can be found in the LS-DYNA manual here [27]. All files related to the HPC Center installation of version 971 (executables and example inputs) are located in:
/share/apps/lsdyna/default/[bin,examples]
More information about our installation can be found here LSDYNA.
MAGMA
MAGMA is a library similar to LAPACK but for hybrid architectures. MAGMA provides implementations for CUDA, Intel Xeon Phi, and OpenCL. On CUNY HPCC systems, MAGMA is installed in its CUDA variant only on Penzias.
MATHEMATICA
“Mathematica” is a fully integrated technical computing system that combines fast, high-precision numerical and symbolic computation with data visualization and programming capabilities. Mathematica version 10.0 is currently installed on the CUNY HPC Center's ANDY cluster (andy.csi.cuny.edu) and KARLE standalone server (karle.csi.cuny.edu). The basics of running Mathematica on CUNY HPC systems are present here. Additional information on how to use Mathematica can be found at http://www.wolfram.com/learningcenter/
More information is available in this wiki, find it here MATHEMATICA.
MATLAB
The MATLAB high-performance language for technical computing integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation.
Typical uses include:
Math and computation Algorithm development Data acquisition Modeling, simulation, and prototyping Data analysis, exploration, and visualization Scientific and engineering graphics Application development, including graphical user interface building
More information about our installation can be found here MATLAB
Migrate
Migrate estimates population parameters, effective population sizes and migration rates of n populations, using genetic data. It uses a coalescent theory approach taking into account the history of mutations and the uncertainty of the genealogy.
inference (BI). Migrate's output is presented in an TEXT file and in a PDF file. The PDF file eventually will contain all possible analyses including histograms of posterior distributions. More information about our installation can be found here MIGRATE
MPFR
The MPFR library is a C library for multiple-precision floating-point computations with correct rounding. MPFR has continuously been supported by the INRIA and the current main authors come from the Caramel and AriC project-teams at Loria (Nancy, France) and LIP (Lyon, France) respectively; see more on the credit page.
MPFR is based on the GMP multiple-precision library. The main goal of MPFR is to provide a library for multiple-precision floating-point computation which is both efficient and has a well-defined semantics. It copies the good ideas from the ANSI/IEEE-754 standard for double-precision floating-point arithmetic (53-bit significant). The library is installed on PENZIAS.
MRBAYES
MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on certain observations.
The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees. More information about our installation can be found here MRBAYES
msABC
msABC is a program for simulating various neutral evolutionary demographic scenarios based on the software ms (Hudson 2002). msABC extends ms, calculating a multitude of summary statistics.
Therefore, msABC is suitable for performing the sampling step of an Approximate Bayesian Computation analysis (ABC), under various neutral demographic models. The main advantages of msABC are (i) use of various prior distributions, such as uniform, Gaussian, log-normal, gamma, (ii) implementation of a multitude summary statistics for one or more populations, (iii) efficient implementation, which allows the analysis of hundrends of loci and chromosomes even in a single computer, (iv) extended flexibility, such as simulation of loci of variable size and simulation of missing data. More information about our installation can be found here msABC
MSMS
MSMS is a tool to generate sequence samples under both neutral models and single locus selection models. MSMS permits the full range of demographic models provided by its relative MS (Hudson, 2002).
In particular, it allows for multiple demes with arbitrary migration patterns, population growth and decay in each deme, and for population splits and mergers. Selection (including dominance) can depend on the deme and also change with time. More information about our installation can be found here MSMS
NAMD
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. [28].
The main server for Molecular Dynamics Calculations is PENZIAS which supports both GPU and non GPU versions of NAMD. However the MPI only (no GPU support) parallel versions of NAMD are also installed on SALK and ANDY. More information about our installation can be found here NAMD
Network Simulator-2 (NS2)
NS2 is a discrete event simulator targeted at networking research. NS2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks.
NWChem
NWChem is an ab initio computational chemistry software package which also includes molecular dynamics (MM, MD) and coupled, quantum mechanical and molecular dynamics functionality (QM-MD).
NWChem has been developed by the Molecular Sciences Software group at the Department of Energy's EMSL. The software is available on PENZIAS and ANDY. More information about our installation can be found here NWChem
Octopus
Octopus is a pseudopotential real-space package aimed at the simulation of the electron-ion dynamics of one-, two-, and three-dimensional finite systems subject to time-dependent electromagnetic fields.
The program is based on time-dependent density-functional theory (TDDFT) in the Kohn-Sham scheme. All quantities are expanded in a regular mesh in real space, and the simulations are performed in real time. The program has been successfully used to calculate linear and non-linear absorption spectra, harmonic spectra, laser induced fragmentation, etc. of a variety of systems. More information about our installation can be found here OCTOPUS
OpenMM
OpenMM is both a library and a stand-alone application which provides tools for modern molecular modeling simulation. As a library it can be hooked into any code, allowing that code to do molecular modeling with minimal extra coding.
Moreover, OpenMM has a strong emphasis on hardware acceleration via GPU, thus providing not just a consistent API, but much greater performance than what one could get from just about any other code available. OpenMM was developed as a part of Physics-Based Simulation project with project leader prof. Pande.
OpenFOAM
OpenFOAM is before everything a library which users may incorporate in their own code(s). The OpenFOAM is installed on PENZIAS.
More information about our installation can be found here OpenFOAM
OpenSees
OpenSees, the Open System for Earthquake Engineering Simulation, is an object-oriented, open source software framework.
It allows users to create both serial and parallel finite element computer applications for simulating the response of structural and geotechnical systems subjected to earthquakes and other hazards. OpenSees is primarily written in C++ and uses several Fortran and C numerical libraries for linear equation solving, and material and element routines. The software is installed on PENZIAS.
ORCA
The program ORCA is electronic structure program capable to carry out geometry optimizations and to predict a large number of spectroscopic parameters at different levels of theory.
Besides the use of Hartee Fock theory, density functional theory (DFT) and semiempirical methods, high level ab initio quantum chemical methods, based on the configuration interaction and coupled cluster methods, are included into ORCA to an increasing degree. More information about our installation can be found here ORCA
ParGAP
ParGAP is build on top of GAP system. The later is a system for computational discrete algebra, with particular emphasis on Computational Group Theory. GAP provides a programming language, a library of thousands of functions implementing algebraic algorithms written in the GAP language as well as large data libraries of algebraic objects.
The ParGAP (Parallel GAP) package itself provides a way of writing parallel programs using the GAP language. Former names of the package were ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface, a well-known standard for parallelism. ParGAP is based on the MPI standard, and this distribution includes a subset implementation of MPI, to provide a portable layer with a high level interface to BSD sockets. More information about our installation can be found here ParGAP
POPABC
PopABC is a computer package to estimate historical demographic parameters of closely related species/populations (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration model.
The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC, Beaumont et al, 2002). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios. The program can be used either for research or for education and teaching purposes. Further details and a manual can be found at the POPABC website here [29] More information about our installation can be found here POPABC
PHOENICS
PHOENICS is an integrated Computational Fluid Dynamics (CFD) package for the preparation, simulation, and visualization of processes involving fluid flow, heat or mass transfer, chemical reaction, and/or combustion in engineering equipment, building design, and the environment. More detail is available at the CHAM website, here http://www.cham.co.uk.
Although we expect most users to pre- and post-process their jobs on office-local clients, the CUNY HPC Center has installed the Unix version of the entire PHOENICS package on ANDY. PHOENICS is installed in /share/apps/phoenics/default where all the standard PHOENICS directories are located (d_allpro, d_earth, d_enviro, d_photo, d_priv1, d_satell, etc.). Of particular interest on ANDY is the MPI parallel version of the 'earth' executable 'parexe' which makes full use of the parallel processing power of the ANDY cluster for larger individual jobs. While the parallel scaling properties of PHOENICS jobs will vary depending on the job size, processor type, and the cluster interconnect, larger work loads will generally scale and run efficiently on from 8 to 32 processors, while smaller problems will scale efficiently only up to about 4 processors. More detail on parallel PHOENICS is available at http://www.cham.co.uk/products/parallel.php. Aside from the tightly coupled MPI parallelism of 'parexe', users can run multiple instances of the non-parallel modules on ANDY (including the serial 'earexe' module) when a parametric approach can be used to solve their problems. More information about our installation can be found here PHOENICS
PHRAP-PHRED
PHRAP and PHRED are part of the DNA sequence analysis tool set that also includes the programs CROSSMATCH and SWAT. These tools are describe in detail here [30], but a brief description of both, extracted from their manuals, follows.
PHRED and PHRAP (along with CONSED) can be used for both small sequence assemblies and larger shotgun analyses. This makes the tools a perhaps under-utilized set for smaller non-genomic groups. Some variables may need to be adjusted, particularly in CONSED, but researchers that have multiple sequences from a small locus can use the suite, starting from their chromatogram files. More information about our installation can be found here PHRAP-PHRED
PyRAD
Reduced-representation genomic sequence data (e.g., RADseq, GBS, ddRAD) are commonly used to study population-level research questions and consequently most software packages for assembling or analyzing such data are designed for sequences with little variation across samples.
Phylogenetic analyses typically include species with deeper divergence times (more variable loci across samples) and thus a different approach to clustering and identifying orthologs will perform better. pyRAD is intended for use with any type of restriction-site associated DNA. It currently supports RAD, ddRAD, PE-ddRAD, GBS, PE-GBS, EzRAD, PE-EzRAD, 2B-RAD, nextRAD, and can be extended to other types. More information about our installation can be found here PyRAD
Python
Python is a programming language that lets you work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs. [31]
There are two supported versions installed on Andy system:
- Python 3.1.3 located under /share/apps/python/3.1.3/bin
- Python 2.7.3 located under /share/apps/epd/7.3-2/bin
More information about our installation can be found here PYTHON
Installing Python packages
Users may install python packages/modules in their own space. Many packages available in Python repositories can be installed easily with PIP manager, which is available in any of Anaconda and Miniconda builds.
Users must remember that using PIP without first loading the module for python will cause the installed modules to match system python on login node only. However the python interpreter available (after login module) on all nodes is installed in /share/usr/compilers/python space. Thus when installing packages in user space it is very important to follow the procedure outlined below. The given example demonstrates how users can install package "guppy" in their own space:
For Python 2.7.13 in Anaconda build:
module load python/2.7.13_anaconda pip install guppy --user
For Python 3.6.0 in Anaconda build
module load python/3.6.0_anaconda pip install guppy --user
For Python 2.7.13 in Miniconda
module load python/miniconda2 pip install guppy --user
For Python 3.6.0 in Miniconda 3
module load python/miniconda3 pip install guppy --user
To check if the package is properly installed type:
pip list | grep guppy
QIIME
QIIME (pronounced "chime") stands for Quantitative Insights Into Microbial Ecology. QIIME is a pipeline application that uses numerous third-party applications.
QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. More information about our installation can be found here QIIME
R
R is a free software environment for statistical computing and graphics.
General Notes
R language has become a de facto standard among statisticians for the development of statistical software, and is widely used for statistical software development and data analysis. R is available on the following HPCC's servers: Karle, Penzias, Appel and Andy. Karle is the only machine where R can be used without submitting jobs to SLURM manager. On all other systems users must submit their R jobs via SLURM batch scheduler. More information about our installation can be found here R
RAXML
Randomized Axelerated Maximum Likelihood (RAxML) is a program for sequential and parallel maximum likelihood based inference of large phylogenetic trees. It is a descendent of fastDNAml which in turn was derived from Joe Felsentein’s DNAml which is part of the PHYLIP package.
RAxML is installed at the CUNY HPC Center on ANDY. Multiple versions are available. RAxML is available in both serial and MPI parallel versions. The MPI-parallel version should be run on four or more cores. RaxML parallel MPI version is installed on Penzias. More information about our installation can be found here RAXML
SAGE
Sage can be used to study elementary and advanced, pure and applied mathematics.
This includes a huge range of mathematics, including basic algebra, calculus, elementary to very advanced number theory, cryptography, numerical computation, commutative algebra, group theory, combinatorics, graph theory, exact linear algebra and much more. More information about our installation can be found here SAGE
SAMTOOLS
SAMTOOLS provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM is compact format aims to be a format that:
Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing formats; Allows most of operations on the alignment to work without loading the whole alignment into memory; Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
More information about our installation can be found here SAMTOOLS
SAS
SAS (pronounced "sass", originally Statistical Analysis System) is an integrated system of software products provided by SAS Institute Inc.
It enables the programmer to perform:
- data entry, retrieval, management, and mining
- report writing and graphics
- statistical analysis
- business planning, forecasting, and decision support
- operations research and project management
- quality improvement
- applications development
- data warehousing (extract, transform, load)
- platform independent and remote computing
More information about our installation can be found here SAS
Stata/MP
Stata is a complete, integrated statistical package that provides tools for data analysis, data management, and graphics. Stata/MP takes advantage of multiprocessor computers. CUNY HPC Center is licensed to use Stata on up to 8 cores.
Currently Stata/MP is available for users on Karle (karle.csi.cuny.edu). More information about our installation can be found here STATA
Structurama
Structurama is a program for inferring population structure from genetic data. The program assumes that the sampled loci are in linkage equilibrium and that the allele frequencies for each population are drawn from a Dirichlet probability distribution. Two different models for population structure are implemented.
First, Structurama offers the method of Pritchard et al. (2000) in which the number of populations is considered fixed. The program also allows the number of populations to be a random variable following a Dirichlet process prior(Pella and Masuda, 2006; Huelsenbeck and Andolfatto, 2007). More information about our installation can be found here STRUCTURAMA
Structure
The program Structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed.
More information about our installation can be found here STRUCTURE
Thrust Library (CUDA)
Thrust is a C++ template library for CUDA based on the Standard Template Library (STL). Thrust allows you to implement high performance parallel applications with minimal programming effort through a high-level interface that is fully interoperable with CUDA C.
As of CUDA, Thrust has been integrated into the default CUDA distribution. The HPC Center is currently running CUDA as the default on PENZIAS which includes Thrust library. More information about our installation can be found here THRUST
TOPHAT
TOPHAT is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
TOPHAT is part of a sequence alignment and analysis tool chain developed at John Hopkins, University of California at Berkeley, and Harvard, and distributed through the Center for Bioinformatics and Computational Biology. More information about our installation can be found here TOPHAT
Trinity
Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.
Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. More information about our installation can be found here TRINITY
USEARCH
USEARCH is a unique sequence analysis tool with thousands of users world-wide.
USEARCH offers search and clustering algorithms that are often orders of magnitude faster than BLAST. More information about our installation can be found here USEARCH
VELVET
Velvet is a set of algorithms for de novo short read assembly using de Bruijn graphs. It was developed at the
European Bioinformatics Institute, Cambridge, UK.
More information about our installation can be found here VELVET
VSEARCH
VSEARCH is a open source alternative to USEARCH.
VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps.
Additional details on VSEARCH can be found at: this link
VSEARCH is installed on Penzias HPC cluster. To start using VSEARCH load corresponding module first:
module load vsearch
VMD
VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
It was developed by The Theoretical and Computational Biophysics Group at the University of Illinois. It is documented on the TCB's homepage.
VMD is installed on Karle. To use it within command-line interface login to Karle as usual and start VMD by typing "vmd" followed by return. Or alternatively use the full path: "/share/apps/vmd/default/bin/vmd"
In order to use VMD in GUI-mode, login to Karle with -X option (see this article for details) and start VMD as described above.
WRF
The Weather Research and Forecasting (WRF) model is a specific computer program with dual use for both weather forecasting and weather research.
It was created through a partnership that includes the National Oceanic and Atmospheric Administration (NOAA), the National Center for Atmospheric Research (NCAR), and more than 150 other organizations and universities in the United States and abroad. WRF is the latest numerical model and application to be adopted by NOAA's National Weather Service as well as the U.S. military and private meteorological services. It is also being adopted by government and private meteorological services worldwide. More information about our installation can be found here WRF
Xmgrace
Grace is a WYSIWYG 2D plotting tool for the X Window System and M*tif. Xmgrace is developed at Plasma Laboratory, Weizmann Institute of Science. More information about it's capabilities can be found at the web page http://plasma-gate.weizmann.ac.il/Grace/
Grace is installed on Karle. To use it within command-line interface login to Karle as usual and start Grace by typing "xmgrace" followed by return. Or alternatively use the full path: "/share/apps/xmgrace/default/grace/bin/xmgrace" In order to use Grace in GUI-mode, login to Karle with -X option (see this article for details) and start Xmgrace as described above.
MET (Model Evaluation Tools)
MET was developed by the National Center for Atmospheric Research (NCAR) Developmental Testbed Center (DTC) through the generous support of the U.S. Air Force Weather Agency (AFWA) and the National Oceanic and Atmospheric Administration (NOAA).
MET provides a variety of verification techniques, including:
- Standard verification scores comparing gridded model data to point-based observations
- Standard verification scores comparing gridded model data to gridded observations
- Spatial verification methods comparing gridded model data to gridded observations using neighborhood, object-based, and intensity-scale decomposition approaches
- Probabilistic verification methods comparing gridded model data to point-based or gridded observations
More information about use and set-up can be found here MET