Copyright 2006-2007 by | Bui Quang Minh, Steffen Kläre, and Arndt von Haeseler |
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Given n taxa (e.g. species) connected in a phylogenetic tree or a split network, this software computes a subset of k taxa with maximal phylogenetic diversity (PD). For trees the PD of any subset of taxa is defined as the sum of the branch-lengths of the minimal subtree connecting the taxa in the subset (Faith, 1992). For split networks we extend the PD notion as the sum of the weights of all the splits which separate at least two taxa in the subset.
For trees the program implemented two efficient algorithms (gPDA, pPDA; Minh et al., 2006) based on the greedy strategy which always finds the maximal PD set of a given size k (Steel, 2005; Pardi and Goldman, 2005). For split networks we have recently proposed a dynamic programming algorithm for so-called circular split networks, a commonly used class of networks. Such a network can be reconstructed by e.g. the Neighbor-net method (Bryant and Moulton, 2004). For general networks the software uses an exhaustive search which is only applicable if the subset size k is small.
PDA is available free of charge from
http://www.cibiv.at/software/pda/
PDA is written in C++ using Standard Template Library (STL). It will run on most personal computers and workstations if compiled by an appropriate C++ compiler. Please read the Installation section 9 for more details.
We suggest that this documentation should be read before using PDA the first time. To find out what's new in the current version please read the Version History section 10.
The methods are described in detail in the following papers:
Minh, B. Q., Klaere, S. and von Haeseler, A. (2006) Phylogenetic diversity within seconds. Syst. Biol., 55, 769-773.
A manuscript describing the method for split networks is in preparation.
Usage: ./pda [OPTIONS] <file_name> BASIC OPTIONS: -h Print this help dialog. <file_name> User tree in NEWICK format or split network in NEXUS format. -k <num_taxa> Number of taxa to be preserved. -o <taxon> Root name to compute optimal rooted PD. If not specified, consider tree/network as unrooted. -i <file> File containing taxa to be included into PD set. -e <file> File containing branch/split coefficient and taxa weights. OPTIONS FOR TREE: -g, --greedy Run greedy algorithm only. -p, --pruning Run pruning algorithm only. -b, --both Run both algorithms. NOTE: by default, the program automatically chooses suitable algorithm. OPTIONS FOR SPLIT NETWORK: -mk <min_taxa> Find all PD sets with size from <min_taxa> to <num_taxa>. -all Identify multiple optimal PD sets. -exhaust Force to use exhaustive search. NOTE: by default, the program applies dynamic programming algorithm on circular networks and exhaustive search on general networks.
./pda test.tree -k 4
Infer the maximal PD-tree of 4 taxa from the tree in test.tree (in NEWICK format). gPDA or pPDA algorithm (Minh et al., 2006) will be determined automatically. Resulting tree will be written to test.tree.4.pdtree. Resulting taxa set will be printed to test.tree.4.pdtaxa.
NOTE: The program will automatically detect the type of the input file (either NEWICK or NEXUS) to apply appropriate PDA algorithms. It should not depend on the file name (.tree or .nex does not matter).
./pda test.tree -k 4 -o c
Compute the ``rooted PD", the tree is rooted at taxon 'c'. 'c' will be included into the final PD-set. See section 5.2 for more detail.
./pda test.tree -k 4 -g
Same as the first command, but only apply the gPDA algorithm.
./pda test.tree -k 4 -b
Run both algorithms. Resulting trees will be written into test.tree.4.greedy and test.tree.4.pruning.
./pda test.tree -k 4 -e test.pam
Read the weight information from test.pam file (more detail in section 5.3) and integrate this into the tree in test.tree. Then run the program as the first example command.
./pda test.tree -k 4 -i test.taxa
Include the ``favourite" taxa listed in test.taxa (more detail in section 5.4) into the final PD-set.
./pda test.tree -k 4 -e test.pam -i test.taxa
Combining both features of the above two example commands.
./pda 1000.tree -r 1000
Generate a 1000-taxa random tree under Yule Harding Model. Write resulting tree into 1000.tree file under NEWICK format.
./pda test.nex -k 3
Find the maximal PD3 set of the split network in test.nex (in NEXUS format, as produced by e.g., SplitsTree 4 program (Huson and Bryant, 2006)). PDA will detect whether the input split system is circular or not. If yes, apply the dynamic programming algorithm, otherwise, use exhaustive search. Resulting taxa set will be printed to test.nex.3.pdtaxa.
./pda test.nex -k 3 -o 2
Compute the ``rooted PD", the split system is rooted at taxon '2'. '2' will be included into the final PD-set. See section 5.2 for more detail.
NOTE: With this option, the program will normal perform much faster (the time-complexity reduces by a factor of n, where n is the number of taxa). So always specify -o <taxon_name> if you are sure that some taxon must be present in the final PD set (e.g. the taxon with a very long terminal split).
Other basic options (-i, -e <file>) should also work fine with split network.
./pda test.nex -k 4 -mk 2
Identify all optimal PD sets containing 2 to 4 taxa. The resulting PD sets will be printed to test.nex.2.pdtaxa, test.nex.3.pdtaxa, test.nex.4.pdtaxa. The PD scores are written to test.nex.score containing several lines. Each line as <sub_size> <corresponding_score>, where <sub_size> should go from 2 to 4.
./pda test.nex -k 3 -all
Find all multiple optimal PD3 sets: if there are more than 1 optimal 3-set, all of them will be printed. The second optimal set will be in test.nex.3.pdtaxa.1, the third in test.nex.3.pdtaxa.2, etc.
NOTE: This optimal might lead to exponential computing time, as it actually depends on the number of optimal PD sets!
./pda test.nex -k 4 -mk 2 -all
Combine the features of the two previous commands.
More information on NEWICK tree format can be found at http://evolution.genetics.washington.edu/phylip/newicktree.html.
More information on NEXUS file format can be found in the article Maddison et al. (1997) or at http://awcmee.massey.ac.nz/spectronet/nexus.html.
Please note that the additional coefficient/weights will be incoporated into the resulting PD tree / sub-split system, i.e., the final tree / split system will also reflect the coefficient and weights in its branch lengths / split weights.
More information on those additional parameters can be found in Steel (2005).
This option might be handy in comparative genomics when you have already sequenced several species and have to make a decision what species to be sequenced next. Then the species names, which were already sequenced, can be listed in this file. See Pardi and Goldman (2005) for a discussion.
For split network, if the option -all is specified and multiple optimal is observed, subsequent optimal taxa sets will be written into: <file_name>.<k>.pdtaxa.1, <file_name>.<k>.pdtaxa.2, ... The score is printed to <file_name>.score.
For tree, resulting sub-trees are written into:
NOTE: for the case of split network, no resulting sub-network is written.
If you choose option to generate a random tree/network, it will be written to the <file_name>.
There could be several blocks in mynet.nex input file. However, PDA only cares for TAXA and SPLITS blocks. All other will be ignored, including CHARACTERS, DISTANCES, NETWORKS, ST_ASSUMPTIONS, etc. You can also prepare your own split network. A simple input file is inside the src/ folder under the name test.nex.
NOTE: The algorithm for circular network is very fast. So always specify the CYCLE command inside the SPLITS block of the NEXUS file. Otherwise, an exhaustive search will be applied and very slow.
Simply use -i <file> option together with -k <num_taxa> where <num_taxa> is equal to the number of taxa listed in the <file>.
You might recognize that this feature is not available for trees, but only for networks. So here is a simple way. Open the tree file, e.g., test.tree with SplitsTree 4, then save it to another file, e.g., test.tree.nex. Then run PDA on test.tree.nex and PDA will consider it as a circular network! Now you can use -all option in the command line.
To build PDA from the sources you need a functional C++ compiler installed (This is usually the case on UNIX/Linux systems. For Windows you might want to obtain CygWin or XCode for MacOSX). Then you can follow the procedure below:
./configure
This should configure the package for the build. You might also want to refer to the INSTALL file for more (general) details.
make
This compiles and builds the executable 'pda' (or 'pda.exe' on Windows systems) to be found in the 'src' directory. This executable can copied to your system's search path such that it is found by your system or it can be installed to the default destination (e.g., /usr/local/bin on UNIX/Linux) using
make install
If you encounter problems, please ask your local administrator for help.
The parser for NEXUS file format is derived from the Nexus Class Library (Lewis, 2003).
The authors would like to thank Dan Faith for helpful suggestions on the subject and Heiko Schmidt and Tanja Gesell for fruitful discussions. We also thank Mike Steel for constructive comments. Financial support from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF) is greatly appreciated.