Practical ML session for the 12th International Workshop on Virus Evolution and Molecular Epidemiology, Athens, 2006

    http://www.cibiv.at/~hschmidt/veme/


    Slides and Software

    • Slides of the PhyloInfo session.
    • The software can be downloaded from here.

    The exercise:


    1. Try to find out whether sequence 1 (SouthCarolina1918) clusters with human (h*), swine (s*), or avian (a*) viruses with likelihood mapping.
      flu-a-full.phy or
      flu-a-1000.phy

      Hint: We will examine this partial dataset on the Spanish Flu (South Carolina, 1918) from Worobey et al. (2002; DOI: 10.1126/science.296.5566.211a) using a 4-cluster likelihood mapping. Start tree-puzzle in the same directory as the dataset - enter the dataset name when prompted for. We switch the type of analysis to likelihood mapping ('b') and group sequences into 4 clusters ('g'):

      • avian/bird virus sequences (starting with 'a') to cluster A
      • swine/pig virus sequences (starting with 's') to cluster B
      • human virus sequences (starting with 'h') to cluster C
      • Spanish Flu virus (SouthCarolina1918) to cluster D
      • (sequences can be excluded from analysis by assigning them to X - not needed here!)
      Start the analysis by typing 'y'.
      The results will be found in flu-a-full.phy.puzzle or flu-a-1000.phy.puzzle the likelihood mapping plot in flu-a-full.phy.eps. or flu-a-1000.phy.eps.

    2. Run an tree reconstruction of flu-a-full-test.phy. Run an tree reconstruction of flu-a-full-test.phy. Identify the the outlier, remove it from the dataset. And re-analyse it with IQPNNI and TREE-PUZZLE.

      Hint:

      • Start tree-puzzle in the same directory as the dataset - enter the dataset name when prompted for. (You might change parameters like the model of evolution.) Examine the output in flu-a-full-test.phy.puzzle with a text editor. Find out especially in the QUARTET STATISTICS part about the most unresolved/partly resolved species and remove it from it the dataset. Save the new alignment as flu-a-new.phy.
      • Start IQPNNI in the same directory as the dataset - enter the dataset name flu-a-new.phy when prompted for... and reconstruct an ML tree. The tree can be found in flu-a-new.phy.iqpnni.treefile. Examine it with treeview. Note: that this tree has no support values.
      • Start tree-puzzle in the same directory as the dataset - enter the dataset flu-a-new.phy name when prompted for... and re-do the analysis. Compare this tree and its support values with the tree reconstructed before with IQPNNI. Are there groups which cannot be found in both trees. Which groups are supported by tree-puzzle support values with less then 70%.

    3. Evaluate and test the trees in flu-a-full.trees (with the alignment file flu-a-full.phy or flu-a-1000.phy). Is there a clearly supported tree?

      Hint: We start tree-puzzle in the same directory as the dataset - enter the dataset name when prompted for, and the treefile as well later. To test trees on a dataset switch the tree search procedure to evaluate user defined trees ('k'). Change option to use neighbor-joining tree ('x') for parameter estimation. Start the analysis by typing 'y'.

      Examine the results in flu-a-full.phy.puzzle or flu-a-1000.phy.puzzle with a text editor. At the end is a table of results from three different tests (Kishino-Hasegawa test, Shimodaira-Hasegawa test, Expected Likelihood Weights) containing all the trees, with those being marked by '-' which are significantly worse than the best tree while those with '+' are not.

      There is a good overview and discussion on testing trees by Golman et al. (2000; DOI: 10.1080/106351500750049752 and there are free copies found by google).