Copyright 2006-2008 by Bui Quang Minh, Steffen Kläre, and Arndt von Haeseler |
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
Phylogenetic Diversity (PD), coined by Faith (1992), is a qualitative measure to assess the biodiversity of species based on a phylogeny. Given taxa (or species) connected in a phylogenetic tree, the PD of any subset of taxa is defined as the sum of the branch-lengths of the minimal subtree connecting these taxa. We have recently proposed an extension of PD for split networks (or split systems). Given a split system of
taxa, the PD of any subset of taxa is equal to the sum of the weights of all the splits separating at least two taxa in the subset. This definition coincides with the tree-based PD when the underlying split system turns out to be a tree.
Phylogenetic Diversity Analyzer (PDA) is a software tool to analyze various aspects of PD and PD-related measures based on both phylogenetic trees and networks. The major features include:
PDA was previously named ``Phylogenetic Diversity Algorithm''. Since version 0.5 we decided to change it to Phylogenetic Diversity Analyzer due to its extended functionalities. An easy-to-use web-interface is now made available at
http://www.cibiv.at/software/pda/web-pda/.
Source code is available free of charge under GNU GPL license from
http://www.cibiv.at/software/pda/
PDA is written in C++ using Standard Template Library (STL). It will run on most personal computers and workstations if compiled by an appropriate C++ compiler. Please read the Installation section 6 for more details.
We suggest that this documentation should be read before using PDA the first time. To find out what's new in the current version please read the Version History section 7.
The methods are described in detail in the following papers:
Usage: pda [OPTIONS] <file_name> [<output_file>] GENERAL OPTIONS: -h Print this help dialog. Use -hh to display all options. <file_name> User tree in NEWICK format or split network in NEXUS format. <output_file> Output file to store results, default is '<file_name>.pda'. -k <num_taxa> Find optimal PD set of size <num_taxa>. -k <min>:<max> Find optimal PD sets of size from <min> to <max>. -k <min>:<max>:<step> Find optimal PD sets of size <min>, <min>+<step>, <min>+2*<step>,... -o <taxon> Root name to compute rooted PD, default is unrooted. -i <file> File containing taxa to be included into PD set. -e <file> File containing branch/split scale and taxa weights. -a, -all Identify multiple optimal PD sets. -lim <max_limit> The maximum number of PD sets for each k if -a is specified. -min Compute minimal PD sets, default is maximal PD sets. -1out Also print taxa sets and scores to separate files. -oldout Also print output compatible with version 0.3. -v Verbose mode. OPTIONS FOR TREE: -root Make the tree ROOTED, default is unrooted. NOTE: this option and -o <taxon> cannot be both specified. -g, --greedy Run greedy algorithm only. -p, --pruning Run pruning algorithm only. NOTE: by default, the program automatically chooses suitable algorithm. OPTIONS FOR SPLIT-NETWORK: -exhaust Force to use exhaustive search. NOTE: by default, the program applies dynamic programming algorithm on circular networks and exhaustive search on general networks. OPTIONS FOR BUDGET-CONSTRAINT: -u <file> File containing total budget and taxa preservation costs. -b <budget> Total budget to conserve taxa. -b <min>:<max> Find all PD sets with budget from <min> to <max>. -b <min>:<max>:<step> Find optimal PD sets with budget <min>, <min>+<step>, <min>+2*<step>,... OPTIONS FOR AREA ANALYSIS: -s <taxa_file> Compute PD of areas (user-defined sets) in <taxa_file>. -excl Compute area exclusive PD. -endem Compute area endemic PD. -compl <areas> Compute area PD-complementarity given the listed <areas>.
More information on NEWICK tree format can be found at http://evolution.genetics.washington.edu/phylip/newicktree.html.
More information on NEXUS file format can be found in the article Maddison et al. (1997) or at http://awcmee.massey.ac.nz/spectronet/nexus.html.
This is to set the output file name instead of the default
<file_name>.pda where
<file_name> is the input file name defined above.
Please note that the additional coefficient/weights will be incoporated into the resulting PD tree / sub-split system, i.e., the final tree / split system will also reflect the coefficient and weights in its branch lengths / split weights.
More information on those additional parameters can be found in Steel (2005).
This option might be handy in comparative genomics when you have already sequenced several species and have to make a decision what species to be sequenced next. Then the species names, which were already sequenced, can be listed in this file. See Pardi and Goldman (2005) for a discussion.
This option allows you to identify multiple optimal sets for a specific
.
This is useful in case there are more than one optimal sets with the same
score. Note that if you specify -k <min>:<max>, PDA will
handle correctly for each
. The -a option can be used in conjunction
with -lim <max_limit> option (see the next sub-section).
When you set -a option, the number of multiple optimal sets for each
to be reported will be limited to at most <max_limit>. The default
limit is 100 if you don't specify. This is simply to avoid PDA from memory overflow
if millions of such optimal sets exist.
This option tells PDA to find the minimal sets instead of the
default maximal ones. Note that algorithmically, on trees the greedy
algorithm does not work anymore for
minimization. However, the
dynamic programming algorithm presented in Minh et al. (2007) can be easily
adapted for this case by negating all the branch lengths!
This tells PDA to write the list of taxa sets into *.pdtaxa
file and the scores into *.score file. See Section
3 for more details.
This is for compatibility reason only since by default, version 0.5 only produces the output file *.pda which contains more information than only optimal sets, scores, and sub-trees. So this option tells PDA to write extra resulting files as outputed in version 0.3. See Section 3 for more details.
From version 0.5 PDA is extended to cope with budget constraints.
The extended problem is formulated as follows. Given a tree or split network,
integer preservation costs for each taxon
and a total integer budget
. Find a
subset
of taxa to maximize
such that the total cost do not exceed
the given budget:
. The restriction to integer numbers is not limitation
since budgets are normally expressed in integer. If not, it can be easily
transformed into integer. This problem is not solvable by a greedy strategy but
by a dynamic programming algorithm. A paper describing this is still in preparation.
The list of areas is given in <taxa_file>. This file can be in one of the two formats: simple text file or NEXUS format. PDA will automatically detect the type of this file.
#nexus begin sets; taxset 'a1' = 'a' 'b' 'c'; taxset a2 = c d g n; taxset a3 = h i; taxset a4 = j k; taxset a5 = l m; taxset a6 = h i; taxset a7 = j k; taxset a8 = l m; end; [sets]
This tells PDA to compute the exclusive of all areas as well.
In short, given
the set of all taxa and
an area,
the exclusive
of
is simply:
.
See Lewis and Lewis (2005) for the original description of this measure.
This tells PDA to compute the endemic of all areas as well.
Given
the set of all taxa and
all areas
you have. Let
be the union taxon set of all areas.
Then the endemic
of a particular area
is:
.
See Faith et al. (2004) for more details and interpretation.
This tells PDA to compute the -complementarity
of all areas given the list <areas>. The list can contain
one area name or several area names separated by commas.
Let
the union taxon set of all given areas. Then the
-complementarity of a particular area
is:
.
See Faith et al. (2004) for more details and interpretation.
If you run 'pda -hh', PDA will display some more features:
GENERATING RANDOM TREES: -r <num_taxa> Create a random tree under Yule-Harding model. -ru <num_taxa> Create a random tree under Uniform model. -rcat <num_taxa> Create a random caterpillar tree. -rbal <num_taxa> Create a random balanced tree. -rcsg <num_taxa> Create a random circular split network. -rlen <min_len> <mean_len> <max_len> minimum, mean, and maximum branch lengths of the random trees. MISCELLANEOUS: -d <sample_size> Compute PD distribution of random sets of size k. -dist <outfile> Calculate the distance matrix inferred from tree. -seed <number> Set the seed for random number generator. -con <treefile> Compute an extended majority-rule consensus tree. -conet <treefile> Compute a consensus network. -sup <treefile> Assign support values of each node in input tree.
The usage is explained there. Feel free to use!
If you specify -1out, all the taxa sets are additionally written
to <file_name>.pdtaxa and the scores are written
to <file_name>.score.
If you specify -oldout, additional files are written as of version 0.3 as follows.
Resulting PD taxa set will be written into: <file_name>.<k>.pdtaxa
If the option -a or -all is specified and multiple optimal
is observed, subsequent optimal taxa sets will be written into:
<file_name>.<k>.pdtaxa.1,
<file_name>.<k>.pdtaxa.2,...
The score is printed to <file_name>.score.
For tree, resulting sub-trees are written into:
NOTE:
./pda test.tree -k 4
Infer the maximal PD-tree of 4 taxa from the tree in test.tree (in NEWICK
format). or
algorithm (Minh et al., 2006) will be determined automatically.
Resulting tree will be written to test.tree.4.pdtree. Resulting taxa set
will be printed to test.tree.4.pdtaxa.
NOTE: The program will automatically detect the type of the input file (either NEWICK or NEXUS) to apply appropriate PDA algorithms. It should not depend on the file name (.tree or .nex does not matter).
./pda test.tree -k 4 -o c
Compute the ``rooted PD", the tree is rooted at taxon 'c'. 'c' will be included into the final PD-set. See section 2.1.4 for more detail.
./pda test.tree -k 4 -g
Same as the first command, but only apply the algorithm.
./pda test.tree -k 4 -b
Run both algorithms. Resulting trees will be written into
test.tree.4.greedy and test.tree.4.pruning.
./pda test.tree -k 4 -e test.pam
Read the weight information from test.pam file (more detail in section 2.1.5) and integrate this into the tree in test.tree. Then run the program as the first example command.
./pda test.tree -k 4 -i test.taxa
Include the ``favourite" taxa listed in test.taxa (more detail in section 2.1.6) into the final PD-set.
./pda test.tree -k 4 -e test.pam -i test.taxa
Combining both features of the above two example commands.
./pda 1000.tree -r 1000
Generate a 1000-taxa random tree under Yule Harding Model. Write resulting tree into 1000.tree file under NEWICK format.
./pda test.nex -k 3
Find the maximal set of the split network in test.nex (in NEXUS format,
as produced by e.g., SplitsTree 4 program (Huson and Bryant, 2006)). PDA will detect whether the
input split system is circular or not. If yes, apply the dynamic programming
algorithm, otherwise, use exhaustive search. Resulting taxa set will be printed
to test.nex.3.pdtaxa.
./pda test.nex -k 3 -o 2
Compute the ``rooted PD", the split system is rooted at taxon '2'. '2' will be included into the final PD-set. See section 2.1.4 for more detail.
NOTE: With this option, the program will normal perform much faster (the
time-complexity reduces by a factor of , where
is the number of taxa).
So always specify -o <taxon_name> if you are sure that some taxon must be
present in the final PD set (e.g. the taxon with a very long terminal split).
Other basic options (-i, -e <file>) should also work fine with split network.
./pda test.nex -k 4 -mk 2
Identify all optimal PD sets containing 2 to 4 taxa. The resulting PD sets
will be printed to test.nex.2.pdtaxa, test.nex.3.pdtaxa, test.nex.4.pdtaxa.
The PD scores are written to test.nex.score containing several lines. Each
line as
<sub_size> <corresponding_score>, where <sub_size> should go
from 2 to 4.
./pda test.nex -k 3 -all
Find all multiple optimal sets: if there are more than 1 optimal 3-set,
all of them will be printed. The second optimal set will be in
test.nex.3.pdtaxa.1, the third in test.nex.3.pdtaxa.2, etc.
NOTE: This optimal might lead to exponential computing time, as it actually depends on the number of optimal PD sets!
./pda test.nex -k 4 -mk 2 -all
Combine the features of the two previous commands.
There could be several blocks in mynet.nex input file. However, PDA only cares for TAXA and SPLITS blocks. All other will be ignored, including CHARACTERS, DISTANCES, NETWORKS, ST_ASSUMPTIONS, etc. You can also prepare your own split network. A simple input file is inside the src/ folder under the name test.nex.
NOTE: The algorithm for circular network is very fast. So always specify the CYCLE command inside the SPLITS block of the NEXUS file. Otherwise, an exhaustive search will be applied and very slow.
To build PDA from the sources you need a functional C++ compiler installed (This is usually the case on UNIX/Linux systems. For Windows you might want to obtain CygWin or XCode for MacOSX). Then you can follow the procedure below:
./configure
This should configure the package for the build. You might also want to refer to the INSTALL file for more (general) details.
make
This compiles and builds the executable 'pda' (or 'pda.exe' on Windows systems) to be found in the 'src' directory. This executable can copied to your system's search path such that it is found by your system or it can be installed to the default destination (e.g., /usr/local/bin on UNIX/Linux) using
make install
If you encounter problems, please ask your local administrator for help.
The parser for NEXUS file format is derived from the Nexus Class Library (Lewis, 2003).
The authors would like to thank Dan Faith for helpful suggestions on the subject and Heiko Schmidt and Tanja Gesell for fruitful discussions. We also thank Mike Steel for constructive comments. Financial support from the Wiener Wissenschafts-, Forschungs- and Technologiefonds (WWTF) is greatly appreciated.