Copyright (C) 2006-2009 by | Bui Quang Minh, Le Sy Vinh, Heiko A. Schmidt, |
and Arndt von Haeseler | |
Copyright (C) 2003-2005 by | Le Sy Vinh and Arndt von Haeseler |
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
IQPNNI is a computer program to reconstruct the evolutionary relationships among contemporary species based on DNA, protein, or protein-coding sequences. In case of protein-coding sequences, several codon models are implemented for inferring positive selection.
IQPNNI is a command-line and menu-driven program which allows users to specify the parameter values or let the program estimate them from the input data (a nucleotide or amino acid alignment in PHYLIP format). The options are classified into four main groups, general options, IQP options, substitution process options, and rate heterogeneity options.
IQPNNI is available free of charge from
http://www.cibiv.at/software/iqpnni/
IQPNNI is written in C++. It will run on most personal computers and workstations if compiled by a C++ compiler. Please read the Installation section 7 for more details. We suggest that this documentation should be read before using IQPNNI the first time. To find out what's new in the current version please read the Version History section 8.
Since version 3.2, the option ``Number of iterations" is changed to ``Minimum number of iterations", meaning that the program will run at least the specified number of iterations, no matter if the stopping rule is applied or not. This is to avoid the behavior that IQPNNI stops so early that does not guarantee to find a good tree. Moreover, another option with maximum number of iterations is also added, to avoid cases where IQPNNI runs ``forever" since the stopping rule suggest too many number of iterations. For more details see Section 4 and 5.
Since version 3.1, IQPNNI is extended to work on protein-coding sequences. In such cases, it will first consider the data as DNA and reconstruct a tree based on the HKY85 model. Then IQPNNI turns the alignment into codon-frames and estimates codon model parameters based on the reconstructed tree. Finally it infers sites under positive selection using Yang's empirical Bayesian method. For more details see Section 8.
To cite the program please use the following papers:
Since version 3.0, users can specify parameters through a set of command-line options, which are extremely useful to start a batch job. Run `iqpnni -h' to print out a short description of available options:
WELCOME TO IQPNNI 3.3 (sequential version) Syntax: iqpnni [OPTIONS] [Filename] GENERAL OPTIONS: -h, -? print this help dialog -n <min_iterations> make the main loop to at least min_iterations -s <stopping_rule> either on, off or max <max_iterations>; defaut is off -u <user_tree> read the starting tree from user_tree file -bs construct a bootstrap tree by resampling the alignment -prefix <prefix_out> set prefix of output files, default is aln name -sfc start from scratch, don't load the check point file -ni don't prompt for user option IQP OPTIONS: -p <probability> set the probability of deleting a sequence -k <representatives> set the number of representatives MODEL OPTIONS: -m <model> set the model type for: Nucleotides: JC69, K2P, F81, HKY85, TN93, GTR Amino acids: WAG, Dayhoff, JTT, VT, mtREV, rtREV, Blosum Protein-coding DNA: GY94, YN98, NY98, CP98, CGTR, CPR Otherwise: Name of file containing user protein model -w <rate_type> either uniform, gamma, igamma or sitespec -c <num_rate> number of rate categories, for gamma and igamma only OTHER OPTIONS: -param <pam_file> use <pam_file> for parameter input (instead of stdin) -seed <number> set random number generator seed to <number> -wsl write site log-likelihood to .sitelh (PHYLIP-like) -wpl write pattern log-likelihood to .patlh -con turn on writing .treels, off by default
You can specify some options first with the command line, and then change again using the text-menu interface. IQPNNI will start as follows: First, the `input_file.iqpnni.checkpoint' file is read if this file is available and the `-sfc' option is NOT specified. If the last run on this alignment was NOT finished, the parameters recorded in the checkpoint file will be loaded and all the command line options will be omitted. In this case, you will see some printout like:
The program was not done from the last run! Load parameters from the checkpoint file...
IQPNNI now displays the menu and waits for user input if option `-ni' is not specified, otherwise it starts the computation directly.
These two options are not independent except you specify `-s off'. In any case, IQPNNI will loop at least a number of min_iterations. If you set `-s on', the program will automatically estimate the number of iterations required to ensure that with a 95% confidence, further search will not detect a better tree. If you set `-s max max_iterations', IQPNNI will always stop after max_iterations, even if the stopping rule suggests more iterations. By `-s max 0', it will set max_iterations to 10 times of min_iterations.
If `-n 0' is specified, IQPNNI will only evaluate ML branch lengths of the starting tree (either BioNJ tree or user-tree), no topology rearrangement is perform.
-u user_tree
Instead of starting the search from BioNJ tree, IQPNNI will make use of the tree from user_tree file in Newick format. The branch lengths of this tree will be ignored, but the topology will be used to estimate the model parameters and also reestimate the branch lengths.
-bs
The orginal alignment will be randomly resampled once by non-parametric bootstrap. The tree will be reconstructed from this resampled alignment. Note that this is NOT a full bootstrap analysis. You will have to run IQPNNI times with -bs and -prefix prefix_out (see bellow) to obtain
bootstrap trees. Then, use another program like TREE-PUZZLE to construct a consensus tree from these
bootstrap trees.
-prefix prefix_out
All the output file names will apply this prefix_out, instead of using the default alignment name for the prefix. This option is very handy when combined with -bs to construct several resampling trees from a bootstrap analysis, so that the output files will not be overwritten. Following is a small bash script under Linux to do a full bootstrap analysis using IQPNNI and TREE-PUZZLE (the script should be adapted before real usage):
#! /bin/bash n=100 filename=alignment.phy #first, run iqpnni n times for ((i=1; i<=n; i++)); do iqpnni $filename -ni -bs -prefix bs-$i done #concatenate all resulting trees into a big file cat bs-*.treefile > $filename.bstrees #now call TREE-PUZZLE to to construct a consensus tree puzzle -consmrel $filename $filename.bstrees #NOTE: Choose the option to build a consensus tree from puzzle menu
-sfc
This tells the program not to load the checkpoint file to prevent IQPNNI from recovering from an interruption.
-ni
This is helpful to start a batch job. The parameters will be displayed again but the program will not prompt for user input and just start the computation directly.
These two options are concerned with the original IQP algorithm, see Minh et al. (2005); Vinh and von Haeseler (2004) for more details. In short, IQPNNI iterates through a number of steps to search the tree space. In each step, several taxa are randomly pruned away from the current best tree. The proportion of deleted leaves is determined by the option `-p probability'. Then these leaves will be reinserted into the tree in a random order following the IQP algorithm, which takes `-k representatives' parameter into account. This full tree will be rearranged according to the NNI algorithm, resulting in an intermediate tree. If this intermediate tree shows a better likelihood, the current best tree will be updated. This finishes one iteration of the IQPNNI algorithm.
-m model
For DNA alignment the following models are implemented:
For protein alignment:
Note that the BLOSUM62 matrix should better not be used for phylogenetic reconstruction, because it was constructed for database searches and does not reflect an evolutionary process.
For codon models:
-w rate_type
Note that for `-w sitespec' option, the tree is first reconstruced based on uniform rate model. In the second phase, this tree topology is used to infer site-specific rates until convergence. The procedure is described in Meyer and von Haeseler (2003).
-c num_rate
The number of gamma rate categories if `-w gamma' or `-w igamma' is specifed. Default value is 4.
User-defined protein model can be given with `-m filename'. An example file which defines the cpREV model (Adachi et al., 2000) is:
105 227 357 175 43 4435 669 823 538 10 157 1745 768 400 10 499 152 1055 3691 10 3122 665 243 653 431 303 133 379 66 715 1405 331 441 1269 162 19 145 136 168 10 280 92 148 40 29 197 203 113 10 396 286 82 20 66 1745 236 4482 2430 412 48 3313 2629 263 305 345 218 185 125 61 47 159 202 113 21 10 1772 1351 193 68 53 97 22 726 10 145 25 127 454 1268 72 327 490 87 173 170 285 323 185 28 152 117 219 302 100 43 2440 385 2085 590 2331 396 568 691 303 216 516 868 93 487 1202 1340 314 1393 266 576 241 369 92 32 1040 156 918 645 148 260 2151 14 230 40 18 435 53 63 82 69 42 159 10 86 468 49 73 29 56 323 754 281 1466 391 142 10 1971 89 189 247 215 2370 97 522 71 346 968 92 83 75 592 54 200 91 25 4797 865 249 475 317 122 167 760 10 119 0.0755 0.0621 0.0410 0.0371 0.0091 0.0382 0.0495 0.0838 0.0246 0.0806 0.1011 0.0504 0.0220 0.0506 0.0431 0.0622 0.0543 0.0181 0.0307 0.0660
The format is following. The first 19 lines describe the bellow triangle of the amino acid replacement matrix. Then comes a list of 20 amino acid frequencies. The rest of file will be ignored. The order of amino-acids is:
A R N D C Q E G H I L K M F P S T W Y V Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
GENERAL OPTIONS z Construct a sample tree by bootstrap? No o Display as outgroup? FL-1-103 n Minimum number of iterations? 200 s Stopping rule? No, stop after 200 iterations IQP OPTIONS p Probability of deleting a sequence? 0.5 k Number representatives? 4 SUBSTITUTION PROCESS d Type of sequence input data? Nucleotides m Model of substitution? HKY85 (Hasegawa et al. 1985) t Ts/Tv ratio (0.5 for JC69)? Estimate from data f Base frequencies? Estimate from data RATE HETEROGENEITY r Model of rate heterogeneity? Uniform rate quit [q], confirm [y], or change [menu] settings:
In the following the available options will be briefly introduced.
See Section for more details of these parameters.
The subsequent options depend on the type of data and model selected. For DNA models the following options are available:
For protein models:
For codon models:
Running results as well as input parameters are summarized in PREFIX.iqpnni. PREFIX is by default the input alignment file name. However, if -prefix <prefix_out> option is specified, PREFIX will be assigned with <prefix_out>.
Resulting tree will be written to PREFIX.iqpnni.treefile in Newick format.
If Gamma, Gamma+I, or Meyer and von Haeseler's site-specific model is used, the rates for each alignment position will be written to PREFIX.iqpnni.rate.
IQPNNI will also create several files:
PREFIX.iqpnni.bionj - BioNJ tree, in Newick format.
PREFIX.iqpnni.treels - List of all intermediate trees, if option -con is specified.
PREFIX.iqpnni.dist - Maximum likelihood distance matrix based on the specified model, in Phylip format.
PREFIX.iqpnni.sitelh - Site likelihood, if option -wsl is specified.
PREFIX.iqpnni.patlh - Pattern frequency and likelihood, if option -wpl is specified.
PREFIX.iqpnni.checkpoint - program current parameters, will be loaded in case of a crash or interruption.
PREFIX.iqpnni.prediction - is used internally by the stopping rule. This file is necessary for recovering from crash or interruption.
PREFIX.iqpnni.bootsample - the bootstrap alignment resampled from the original alignment, if option -bs is specified. This file is also necessary for recovering from crash or interruption.
See below for information how to install/build the different versions of the IQPNNI software. Executable versions of the sequential, that is, non-parallel program are intended for a number of operating systems. The parallel program (pIQPNNI) has to be built from the sources, as it depends on the MPI library locally installed in your system.
If you encounter problems, please ask your local administrator for help.
To build IQPNNI from the sources you need a C++ compiler installed (This is usually the case on UNIX/Linux systems. For Windows you might want to obtain CygWin/MinWG/MS Visual C++ or XCode for MacOSX). Then you can follow the procedure below:
./configure
This should configure the package for the build. You might also refer to the INSTALL file for more (general) details.
make
This compiles and builds the executable iqpnni (or iqpnni.exe on Windows systems) to be found in the src directory. This executable can copied to your system's search path such that it is found by your system or it can be installed to the default destination (e.g., /usr/local/bin on UNIX/Linux) using
make install
If you encounter problems, please ask your local administrator for help.
There will be no binary version of the parallel program because it depends on the MPI library you have installed locally.
To build the MPI-parallel version of IQPNNI (pIQPNNI) you need a functional C++ compiler installed (This is usually the case on UNIX/Linux systems. For Windows you might want to obtain CygWin or XCode for MacOSX). In addition you have to install an implementation of the MPI (Message Passing Interface) library. There is a list of (free) implementations at http://www.lammpi.org/mpi/implementations/ available.
Then you can follow the procedure below:
env CXX=mpiCC CXXFLAGS="-DPARALLEL -O2" ./configure
This should configure the package for the build using mpiCC as the C++ compiler. You might also want to refer to the INSTALL file for more (general) details.
make
This compiles and builds the executable iqpnni (or iqpnni.exe on Windows systems) to be found in the src directory. This executable should be renamed to piqpnni and copied to your system's search path such that it is found by your system.
If you encounter problems, please ask your local administrator for help.
- Zero state frequencies: they are now replaced by a very small number.
- Checkpoint: now correctly recovered from stopped point.
- Restriction on number of sites: from limit 100,000 to unlimited now.
- Parallel version on Infiniband system under MPICH.
Some parts of the code were taken from TREE-PUZZLE package (Schmidt et al., 2002). The source code to construct the BIONJ tree were taken from BIONJ software (Gascuel, 1997).
Financial support from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF) is greatly appreciated.