ML and Topology Testing session (Advanced)

ML and Topology Testing practical session for the 14th International Bioinformatics Workshop on Virus Evolution and Molecular Epidemiology, Cape Town, 2008

http://www.cibiv.at/~hschmidt/veme/ML-test

We sill use the following software in the exercise:

TREE-PUZZLE software to compute the site-likelihoods. (Version 5.3.rc1 should be installed already. You can also get it here: treepuzzle.exe / treepuzzle-clicky.exe.
IQPNNI, a Maximum Likelihood programm. (Version 5.3.rc1 should be installed already. You can also get it here: iqpnni.exe.
Consel to do further tests. (Since the original software does not work on the Workshop machine please download the recompiled versions of the following programs: makermt.exe, consel.exe, catpv.exe.

and the following dataset files:

Create a new directory (e.g. 'ML-test') on your desktop and save the treepuzzle , iqpnni executables and datasets there.

The exercise:

We will analyze a dataset of HIV1 group M sequences of types A, B, C, D, and G and some outgroup sequences (HIV1 group O and SIV).

Check for phylogenetic contents
To do so, we start tree-puzzle-clicky.exe in the same directory as the dataset - enter the dataset name (hivALN.phy) when prompted for. To measure the phylogenetic content of a dataset switch the type of analysis to likelihood mapping ('b') and do not group the sequences. Switch on rate heterogeneity ('w') and set the number of categories ('c') to 4. Start the mapping analysis typing 'y'.
Examine the result of the datasets in hivALN.phy.puzzle with a text editor and the likelihood mapping diagram dataset-name.eps (or with a postscript viewer like gsview (should be installed).
Is the dataset suitable for phylogenetic analysis?
Reconstruct the Maximum Likelihood tree with IQPNNI.
Start iqpnni.exe in the same directory as the dataset with a double click. Enter the dataset name (hivALN.phy) when prompted for. We switch the the model to incorporate rate heterogeneity ('r'). Start the analysis by typing 'y'.
The results will be found in hivALN.phy.iqpnni, the ML tree hivALN.phy.iqpnni.treefile.
View the two estimated tree with a tree viewer like FigTree (should be installed).
Can we be sure about the groupings? Do they fit your view of HIV1 evolution?
Reconstruct a tree with TREE-PUZZLE to get support values.
Start tree-puzzle-clicky.exe with a double-click in the same directory as the dataset. Enter the dataset name (hivALN.phy) when prompted for. Change the option 'w' to switch rate heterogeneity on and change the number of Gamma rate categories to 4 with option ('c'). Start the analysis by typing 'y'.
The results will be found in hivALN.phy.puzzle, the tree in hivALN.phy.eps.
View the two estimated tree with a tree viewer like FigTree (should be installed). The numbers are PUZZLE support values. They behave similar to Bootstrap values, but they are NOT the same.
What is different compared to the ML tree? What do the support values tell you?
Evaluate and test the trees with TREE-PUZZLE
An (unrooted) multifurcation with 5 branches can be resolved in 15 different ways. The file hivALN.15trees contains all 15 resolutions to the TREE-PUZZLE tree.
We now want to compute the ML values for each of them and compare these whether we still can find a best tree.
Start tree-puzzle-clicky.exe with a double-click in the same directory as the dataset. When asked whether you want to add commandline options type 'y' and add the parameter '-wsl' for writing out the site likelihood to a file ('*.sitelh'). Enter the dataset name (hivALN.phy) when prompted for a sequence file and the hivALN.15trees when prompted for a trees-file.
To test trees on a dataset switch the tree search procedure to evaluate user defined trees ('k'). Change option to use neighbor-joining tree ('x') for parameter estimation. Change the option 'w' to switch rate heterogeneity on and change the number of Gamma rate categories to 4 with option ('c'). Start the analysis by typing 'y'.
Examine the results in hivALN.15trees.puzzle with a text editor. At the end is a table of results from three different tests (Kishino-Hasegawa test 1pKH, Shimodaira-Hasegawa test SH, Expected Likelihood Weights ELW) containing all the trees. Those being marked by '-' are considered significantly worse than the best tree while those with '+' are not.
Is there a clear preference for a single tree? If not think of explanations why this might be the case.
Use more tests implemented in the Consel program
Here we need the following programs from the Consel package: makermt.exe, consel.exe, catpv.exe. You have to download them here because the official (and pre-installed) executables do not work on these computers!!!
This program has to be run the a DOS window. So, to get a DOS prompt open the Command Prompt program from the All Programs, and then the Accessories directory. You have to change to the Desktop by typing cd Desktop and then you cd to your data directory. You can inspect the filenames in your directory with the command dir.
Since Consel has a bug in handling files with more than one dot in the names. Hence we want to rename the file hivALN.15trees.sitelh to hivALN.sitelh. You might use the file manager to do that or the command copy hivALN.15trees.sitelh hivALN.sitelh. (Hint: the can use the tab-key to auto-complete names. You have to hit it several times to get the different choice one after the other.)
To do the analysis with Consel we have to run a few commands. First Consel has to draw the differently large bootstrap samples for all tests including the AU test: run 'makermt.exe --puzzle hivALN.sitelh'. Secondly, Consel is performing all different tests (KH, SH, wKH, wSH, and AU): 'consel.exe hivALN'. Finally we want to view all p-values which we can by using 'catpv.exe -s 1 hivALN'.
The option '-s 1' causes the lines to be sorted according to the order in the tree file hivALN.15trees, otherwise the lines would be sorted by p-value.
We can also use 'catpv.exe -s 1 hivALN > somename.txt' to write the p-values in a file which we can view which text-program like wordpad, notepad, or Word.
The -s 1 option causes the trees to be output in the order of the tree file instead of their likelihood (the default). The output contains the index (rank,which would differ when ordered by likelihood), the tree number in the input (item), the log-likelihood difference to the best tree (obs), except for the best itself, which shows the negative distance of the second best. Among other statistics and information values it prints the p-values of the AU test (au), KH test (kh), the SH test (sh), and weighted variants of the two (wkh and wsh).

Resolution and References
There was no way to obtain a fully resolved tree. After it had been speculated that there might have been inter-subtype recombination, recently it has been shown that subtype G is not a pure subtype but rather a recombinant form (Abecassis et al., 2007).
- Abecasis, A. B., Lemey, P., Vidal, N., de Oliveira, T., Peeters, M., Camacho, R., Shapiro, B., Rambaut, A. and Vandamme, A.-M. (2005) Recombination is confounding the early evolutionary history of hiv-1: subtype g is a circulating recombinant form. J. Virol., 81, 8543-8551. DOI:10.1128/JVI.00463-07, PMID:17553886.
- Sanderson, M.J. and Shaffer, H. B. (2002) Troubleshooting Molecular Phylogenetic Analyses. Annu. Rev. Ecol. Syst., 33, 49-72. DOI:10.1146/annurev.ecolsys.33.010802.150509, PDF.

ML and Topology Testing practical session for the 14th International Bioinformatics Workshop on Virus Evolution and Molecular Epidemiology, Cape Town, 2008

http://www.cibiv.at/~hschmidt/veme/ML-test

We sill use the following software in the exercise:

and the following dataset files:

The exercise:

Resolution and References