A Short Course in Comparative Analysis of Molecular Sequence
Data.
James McInerney,
Department of Biology,
National University of Ireland Maynooth,
Co. Kildare,
Ireland.
http://bioinf.may.ie/
Index:
Exercise Name Page
A PAUP 2
B Trees 6
C Characters 8
D Reconstructions 13
E Treefit 14
F Heuristic
Searches 15
G Exhaustive
Searches 16
H Random 18
I Bootstrap 19
Requirements:
Software Ð RETREE
(from the PHYLIP package), PAUP (www.sinauer.com).
Operating System
Ð Can be any operating system.
A
PAUP
Aim:
To familiarise the user with the paup program.
PAUP is the
most versatile software for comparative evolutionary analysis of DNA
sequences. In this exercise, you will become
familiar with how to use the
software. There are hundreds of commands and
options available when using the
PAUP software,
you will use a small subset of these commands in this practical.
To illustrate
some of the commands, we shall use a dummy dataset of protein
sequences.
1. Read the
file 'garfield.nex' (use the unix command 'more' to read the file).
The datafile is
heavily-commented. Note the
structure of the data file. This
file is in
NEXUS format, one of the most popular formats for comparative
datasets. The NEXUS data format consists of a
series of 'blocks'. These blocks
begin with the
word 'begin' and end with the word 'end;'. NEXUS files always
begin with the
hash mark followed by the word NEXUS.
2. Note the line that begins:
Begin data;
This line indicates that
the 'data' block is to follow.
This statement ends
in a
semi-colon. This is a standard end
to a statement in the NEXUS format.
3. The next three lines indicate the size
of the dataset:
Dimensions
ntax=5
nchar=21;
ntax means
"number of taxa"
nchar means
"number of characters"
NOTE 1: This
statement takes the format:
Keyword
option=value
option=value;
NOTE 2: The
statement ends in a semi-colon.
4. Information concerning the format of
the dataset follows:
Format
datatype=protein
gap=-
missing=?
matchchar=.
interleave;
Once again, the
information is in the same format and the statement ends in a
semi-colon.
5. The next part of the file includes a
statement instructing the program to
treat the
INDELs in the data matrix as though they were an additional character
state (a 21st
amino acid)
Options
gapmode=newstate;
6. The next part of the file contains the
sequence data. In this case, the
data is
protein. Note that once again this
part of the file ends with a
semi-colon:
Matrix
Chick
GARFIELDTHELAZY------
Kangaroo
GARFIELDTHELAZY---CAT
Human GARFIELDTHE----FATCAT
Chimp
GARFIELD-------FATCAT
Dog
GARFIELDTHELAZYFATCAT
;
These sequences
are protein and use the single-letter code for proteins. There
appears to be
length-variation in these sequences and therefore it has been
necessary to
introduce INDELs in order to construct a sensible alignment. As a
result, all the
sequences in this matrix are now of the same length and
homologous
positions have been aligned to each other. If they were not of the
same length,
PAUP will complain and will not read the sequences into memory.
The INDELs in
these sequences will be treated as though they were a 21st amino
acid.
NOTE: Because
of evolutionary changes, not all organisms have the same
sequences. The Chick sequence, for instance is
missing the FAT and the CAT
domain.
7. The last part of the file contains the
word:
end;
This indicates
that the data block has ended.
8. Read the data into the program's
memory. This can be achieved in
one of
two ways. You can either specify the file name at
the command line:
paup garfield.nex
or you can
start the program (type paup) and then when the program has started,
you can type:
execute garfield.nex
9. Note what has been printed to
screen. Does this make sense? If not,
discuss it with
your demonstrator.
10. You should
now be presented with the 'paup prompt':
paup>
This indicates
that you are now working within the PAUP environment.
11. The most
important command within this environment is the help command.
This can be
used in two ways. You can either
issue the help command without any
trailing
qualifiers:
paup> help;
or you can
type:
paup> help commands;
which prints a
one-line short description of each command. Try out both of
these options
now.
NOTE: It is
good practice to finish each command with a semi-colon. This tells
the PAUP
software that you have finished issuing a command. It also means that
you can issue a
number of commands on one line, each one terminated by a
semi-colon
e.g.:
paup> help; help commands;
12. There is a
general format for commands. You
can see this format if you type
a question mark
(?) after a command name. try this
with the command for logging
all output to a
file.
paup> log ?;
You should see
the following:
Usage: Log
[options...] ;
Available
options:
Keyword ----
Option type ------------------------ Current default setting --
File
<log-file-name>
garfield.log
Replace No|Yes *No
Append
No|Yes
*No
Start
No|Yes
*No
Stop
No|Yes
*No
FlushLog No|Yes
No
*Option
is nonpersistent
The first
column contains the keywords that can be used to modify the way in
which the
command 'log' works.
The second
column contains the possible options for this keyword.
The third
column contains the current setting for this option.
You can start
logging your paup session to a file called 'blah.log' by issuing
the command:
paup> log File = blah.log start = yes;
Everything you
type from here on and everything that appears on the screen will
be
simultaneously printed to a file.
The format of commands is:
paup> command keyword =
option;
13. Can you
figure out which command you can use to 'show the character-data
matrix'? If you find the correct command, it
will print the data matrix to the
screen.
14. To remove
the chicken sequence from the analysis type:
paup> delete ?;
after looking at the
options, type:
paup> delete Chick;
you can now look at the
data matrix in order to see that it has been
successfully
removed.
15. To restore
this sequence type:
paup> help;
and figure out the correct
command to restore the sequence.
16. To exclude
the parsimony-uninformative sites type:
paup> exclude uninf;
look at the datamatrix now
and write down the new datamatrix that results
from issuing
this command. Can you see why
these sites are
parsimony-informative?
17. Can you
figure out how to include those sites you have previously excluded?
write the
command into your lab book.
TEST
1. Ask paup for
the time. Write the exact answer
in your lab book.
2. Ask paup for
the current character status.
Write down the output.
3. Ask paup for
the current file status for the data, log and tree files.
Quit the
program.
Examine the log
file.
B
Trees
Aim:
This practical session reviews the concepts of trees and tree topologies.
You will use
the program 'retree' from the PHYLIP packages of programs. The
PHYLIP package
was one of the first package of programs for performing
phylogeny reconstruction. The package was written by Dr. Joe
Felsenstein in
Seattle,
Washington.
Retree can be
used to read a file containing the 'nested parentheses'
description of
a tree.
The following
is an example of a nested parenthesis tree file:
(Cow,((Mouse,Rat),(Chimp,Human)));
This tree
indicates that the human and chimp are each others closest relatives,
the Mouse and
Rat are each others closest relatives and the Cow is not
specifically
related to any of the terminal taxa.
Task 1: Read
the treefile into memory.
1. Start the program retree by typing its
name:
linux$ retree
2. You should see a screen that looks
something like this:
--------------------------------------------------------------------------------
Tree
Rearrangement, version 3.6a2.1
Settings for
this run:
U Initial tree
(arbitrary, user, specify)? User
tree from tree file
N Format to write out trees (PHYLIP, Nexus, XML)? PHYLIP
0
Graphics type (IBM PC, ANSI)?
(none)
W Width of terminal screen, of plotting area? 80, 80
L
Number of lines on screen?
24
Are these
settings correct? (type Y or the letter for one to change)
--------------------------------------------------------------------------------
3. We shall accept all the default
options. You can do this by typing
the
letter 'y'. If
you wanted to change any of the options, you could do so by
typing the
character that appears in the leftmost column (L,N,0,W,L).
4. The program
now asks you for the name of the tree file. We have placed a
treefile in
this directory. Its name is
Mammal.tre and it contains the tree
detailed
above. You should now be seeing
the following:
--------------------------------------------------------------------------------
Reading tree
file ...
retree: can't
find input tree file "intree"
Please enter a
new file name>
--------------------------------------------------------------------------------
Enter the name
of the treefile (Mammal.tre)
5. You should
now see an ASCII-character tree and a series of options like this:
--------------------------------------------------------------------------------
,>>>>>>>>>>>1:Cow
!
--6
,>>2:Mouse
!
,>>>>>8
!
!
`>>3:Rat
`>>7
! ,>>4:Chimp
`>>>>>9
`>>5:Human
NEXT? (Options:
R . U W O T F D B N H J K L C + ? X Q) (? for Help)
--------------------------------------------------------------------------------
6. The program has read the nested
parentheses file and has represented this
tree on the
screen. You can see from this
treefile that the Chimp and Human are
joined to each
other through a single node (node number 9). Can you describe
the rest of the
tree and the nodes that join various groups?
7. In order to
see the options that are available to us, you can type the
question
mark. You should then see the
following:
--------------------------------------------------------------------------------
. Redisplay the same tree again
U Undo the most recent change in the
tree
W Write tree to a file
+ Read next tree from file (may blow up
if none is there)
R Rearrange a tree by moving a node or
group
O select an Outgroup for the tree
T Transpose immediate branches at a node
F Flip (rotate) subtree at a node
D Delete or restore nodes
B Change or specify the length of a
branch
N Change or specify the name(s) of
tip(s)
H Move viewing window to the left
J Move viewing window downward
K Move viewing window upward
L Move viewing window to the right
C show only one Clade (subtree) (might
be useful if tree is too big)
? Help (this screen)
Q (Quit) Exit from program
X Exit from program
TO CONTINUE, PRESS ON THE Return OR
Enter KEY
--------------------------------------------------------------------------------
8. Type 'F' and flip the branches at
node 7. Draw the resulting tree
and
comment on
whether or not it has the same meaning as the original tree.
9. Type 'F' again and this time flip node
8. Draw the tree again and once
again comment
on what it now means.
10. Specify the
mouse as the outgroup of the tree.
How does this affect the
tree? Does it make sense?
11. Experiment
with options R, T, D, C and N.
12. Quit when
you are finished.
C
Characters
Aim:
Tracing characters on phylogenetic trees.
In this
practical we shall look at how different characters require
different
numbers of steps on different trees.
We shall use the program
PAUP*4.0. This program is being developed by Dr.
David Swofford, originally at
the Smithsonian
Institute, Washington DC and now at Florida State University,
Talahassee,
Florida.
The dataset we shall use is
a vertebrate morphological dataset.
We shall
use only four
animals in order to show that characters do not always agree with
one
another. In the end, we shall use
one of Charles Darwin's dictums "...the
aggregate of
characters" for deciding which tree topology (branching order) is
the preferred
one.
In this folder, you will
find a dataset in "NEXUS" format. This file has a
number of parts
that are all equally relevant. The
first line of the file
begins with:
#NEXUS
This tells the program that
the file is in NEXUS-format. A
file in this
format must
conform to a very strict set of guidlines. Specifically, the file
must be in
discrete "blocks" of information.
The next thing we encounter
in the file is a short commentary.
In the same
way as comments
are entered into programming code, we can enter some information
in an input
file:
-------------------------------------------------------------------------------
[!Data
from: Lizard, Dog, Human and Frog
]
[!This
practical shows how data sometimes contain homoplastic characters ]
[!Data
are not always completely congruent
]
[!However,
the aggregate of characters is usually used to infer relationships ]
-------------------------------------------------------------------------------
The square
brackets are used to indicate a comment, the exclamation mark (!)
indicates that
this comment will be printed to the screen by the PAUP program.
The next part
of the file is the taxa block:
-------------------------------------------------------------------------------
begin
taxa;
dimensions ntax=4;
taxlabels
Lizard
Dog Human Frog
;
end;
-------------------------------------------------------------------------------
This block indicates that there are 4
taxa and it also indicates their names.
Like all NEXUS
blocks, the block begins with the word 'begin' and ends with the
word
'end'. All statements end in a
semi-colon.
The next block
is the 'characters' block.
-------------------------------------------------------------------------------
begin
characters;
dimensions
nchar=6
;
format
symbols = "01"
;
charlabels
AMNION
HAIR
LACTATION
TAIL
ONE_JAWBONE
PLACENTA
;
[!
The first character is the Amnion
The second character is hair
The third character is lactation
The fourth character is tail
The fifth character is single bone in the lower jaw
The sixth character is the placenta
]
matrix
Lizard 0 0 0 1 0 0
Dog 1 1 1 1 1 1
Human 1 1 1 0 1 1
Frog 0 0 0 0 0 0
;
end;
-------------------------------------------------------------------------------
Again, the block begins
with the word 'begin' and ends with the word 'end'.
There is a
short commentary, telling the user what each of the characters mean.
Then we see the
word 'matrix' which indicates that the scientific data is about
to follow. The
data is arranged with all homologous characters present in the
same column.
As the commentary says, the
first column represents the observations
regarding the
amnion. As we can see, this
character is absent (indicated by a
zero) in the
Lizard and the Frog. It is present
in the Human and Dog. Can you
figure out the
distribution of character states among the rest of the
characters?
The last part of the file
contains three alternative tree topologies:
-------------------------------------------------------------------------------
begin
trees;
tree best = [&U]
(1,((2,3),4));
tree second = [&U]
(1,2,(3,4));
tree worst = [&U] ((1,3),(2,4));
;
end;
-------------------------------------------------------------------------------
The trees have been given
three different names - 'best', 'second', 'worst'.
The four
numbers that are used in the nested parentheses treefiles indicate the
four taxa in
the order in which they are represented in the data matrix (1 =
Lizard, 2 =
Dog, 3 = Human, 4 = Frog).
[&U] means that the trees are not
rooted, they
are Unrooted.
EXERCISE
1. Start the PAUP program. This can be done in two different
ways. You can
either type the
program name followed by the NEXUS file:
linux$ paup Hair_tail.nex
or
alternatively, you can start the program by typing paup and then reading the
datafile into
memory using the 'execute' command:
linux$ paup
paup> execute Hair_tail.nex;
2. When the PAUP program starts, you will
see a 'splash page' that looks
something like
this:
-------------------------------------------------------------------------------
P A U P *
Portable
version 4.0b10 for Unix
Sat Oct 5 20:05:34 2002
-----------------------------NOTICE-----------------------------
This is a beta-test version.
Please report any crashes,
apparent calculation errors, or other anomalous results.
There are no restrictions on publication of results obtained
with this version, but you should check the WWW site
frequently for bug announcements and/or updated versions.
See
the README file on the distribution media for details.
----------------------------------------------------------------
paup>
-------------------------------------------------------------------------------
If you read a
datafile into memory at the same time as starting the program, you
should see a
little more information:
-------------------------------------------------------------------------------
Processing of
file "Hair_tail.nex" begins...
Data from:
Lizard, Dog, Human and Frog
This practical
shows how data sometimes contain homoplastic characters
Data are not
always completely congruent
However, the
aggeregate of characters is usually used to infer relationships
The first character
is the Amnion
The second character
is hair
The third character
is lactation
The fourth character
is tail
The fifth character
is single bone in the lower jaw
The sixth character is the placenta
Data matrix has
4 taxa, 6 characters
Valid
character-state symbols: 01
Missing data
identified by '?'
3 trees read
from TREES block
Time used = <1 sec (CPU time =
0.00 sec)
Processing of
file "Hair_tail.nex" completed.
paup>
-------------------------------------------------------------------------------
3. Read the
data into memory now.
4. The most
important command for PAUP is the 'help' command. Type this command
now. You should see something like the following
list of available commands:
-------------------------------------------------------------------------------
The following
commands are always available:
!
Edit
Help
Quit
CD
Execute Leave Set
Defaults Factory Log
Time
DSet
FStatus LSet
ToNEXUS
The following
commands require data from a DATA (or TAXA and CHARACTERS) or
DISTANCES block
(* = requires only TAXA block):
*Agree
*DerootTrees
*LoadConstr
Reweight SurfCheck
AllTrees *DescribeTrees LScores *RootTrees *TaxPartition
AncStates DScores *MatrixRep SaveAssum *TaxSet
Assume Exclude
MPRSets SaveDist *TreeDist
BandB Export
NJ
*SaveTrees
*TreeInfo
BaseFreqs ExSet *Outgroup
ShowAnc *TreeWts
Bootstrap *Filter PairDiff
*ShowConstr
*TStatus
CharPartition GammaPlot Permute ShowDist
TypeSet
CharSet *GenerateTrees PScores ShowMatrix *Undelete
*ClearTrees *GetTrees PSet
ShowCharParts UPGMA
Condense HomPart
Puzzle ShowRateSets UserType
*Constraints HSearch
RandTrees
ShowTaxParts
Weights
*ConTree
Include *RateSet
*ShowTrees
Wts
CStatus *Ingroup
Reconstruct
ShowUserTypes WtSet
CType
Jackknife
*Restore *SortTrees
*Delete
Lake
RevFilter
StarDecomp
Type "HELP
COMMANDS" or "HELP CMDS" for a one-line description of each
command.
Type
"<cmdname> ?" to see brief usage and current default settings.
-------------------------------------------------------------------------------
5. The first
command we shall use is the 'showmatrix' command. Type this
command
now. you should see a column
containing the names of the taxa and a
column
containing the data. If you do not
see this, then the data has not been
successfully
read into memory. Modify the
showmatrix command so that you can
see the
"Character Matrix Labels" and so that the width of each column in the
character
matrix is 5 spaces. Re-issue the
command with the modifers.
HINT: type the
showmatrix command followed by a question mark.
6. The next command is the 'showtrees'
command. This command will print
trees
to the
screen. This command needs to know
which trees to print
to the screen
(by default it just prints the first one). You could type
'showtrees 1'
if you wanted to see the first tree.
However, in this case, we
wish to see all
the trees, so you should type 'showtrees all'.
NOTE: The
showtrees command takes a slightly different format to other commands.
Because we with to find out some
information relating to a tree in memory, we
must specify
which tree we are interested in.
As a result, the format of the
command is:
Usage:
ShowTrees [tree-list] [/ options...] ;
e.g.
showtrees
1 3-7 9 / showtaxnum=yes;
7. Draw each of
these trees in your lab notebook.
These trees are rooted using
an
outgroup. In reality, they are
unrooted trees. They have just
been drawn in
this way for
simplicity.
NOTE: In some
circumstances, PAUP recognises the word all.
8. We would now like to see the scores
each of these trees would receive
using the
parsimony criterion for evaluating trees. The command for printing the
parsimony
scores to the screen is 'pscores'.
Can you figure out how to use this
command?
NOTE: the
'pscores' command has the same format as the 'showtrees' command
above.
If you have
successfully issued the pscores command, paup will calculate the
fit of each
character to the tree. It will
then add these scores together and
give you the
'tree length'.
9. Write down the tree length for each
tree.
10. We would
like to see the parsimony score for each individual character.
This can be
achieved using the command:
pscores all /single=all;
Type this now
and record the answers. In your
practical book, explain why
you see the
results on this screen.
11. We need to find the
parsimony-informative characters.
Often it is useful
to exclude
these characters from a dataset, since they contribute equally to the
tree score for
all possible trees. You can use
the exclude command to remove
the
uninformative sites. This can be
achieved using the command:
paup> exclude uninf;
How many sites were
removed? Why?
Questions:
1. Which is the preferred tree using the
parsimony criterion?
2. How many steps are required to describe
the character 'amnion' on the first
tree?
3. Which character requires the most steps
on the first tree?
When you are
finished, you may quit the program.
D
Reconstructions
Aim: To examine different characters on
trees.
In this
directory you will find two files, each with the same dataset and trees.
1. Examine this
dataset. You will see that this
dataset consists of three
blocks. The first block is the Data block. The second block is the trees block
and the third
block is a block consisting of commands that the paup software
will read and
execute. There are two commands in
the input files - one to print
the matrix to
the screen and the other to reconstruct the characters on the
various
trees. In the file Vert_Hair.nex
the program is being instructed to
reconstruct the
evolution of the character "Hair" on each of the two trees. In
the file
Vert_Tails.nex the program is being instructed to reconstruct the
evolution of
the character "Tails" on each of the two trees.
2. Read the
file Vert_Hair.nex into paup.
Examine the results as they are
printed to the
screen. Write these results into
your notebook.
3. Read the
file Vert_Tails.nex into paup.
Examine the results as they are
printed to the
screen. Write these results into
your notebook.
The program
quits automatically each time.
E
Treefit
Aim: To determine the fits of
characters on alternative trees.
1. Read the
input file 'data.nex' using the unix command 'more'. Note that
there is a data
block and a trees block. The trees
block contains fifteen
trees. We shall use examine the fits of
characters to these trees.
2. Read the
data file into paup's memory. Use
the showmatrix command to ensure
that the
dataset has been successfully read by the program.
3. Which
characters are parsimony uninformative?
4. Determine
the parsimony scores for each tree.
You can find the options by
typing the
command followed by a question mark.
paup> pscores ?;
*Note that this
command has two discrete parts.
Immediately following the
command, you
must supply a tree or number of trees, so that paup can return the
figures that
are relevant for these trees. If
you want to modify the default
output for the
options within the 'pscores' command, you must first use a slash
(/). In this case, you might issue a command
like:
paup> pscores 1 /total=no;
or
paup> pscores 5 /total=yes;
try these options now. You can find out the parsimony scores
for all trees
using the
command:
paup> pscores all;
5. Exclude the
parsimony-uninformative scores and determine the scores again.
Why are there
differences?
6. Which tree
is the most parsimonious?
7. Using the
'pscores' command, find out the character statistics for every
'single'
character for tree number 1.
8. Reconstruct
the evolution of character 8 on tree number 11. Draw the answer
in your lab
book. *Note that the reconstruct
command also has an unusual
syntax.
Quit the
program.
F
Heuristic searches.
Aim:
To search alternative trees using heuristic methods.
When the number
of sequences to be evaluated is larger than about 10, it is
necessary to
use approximate methods in order to evaluate alternative tree
topologies. These methods are called heuristic
searches. There are a wide
variety of tree
searches and we will use only a few of these. The usual
approach is to
generate a tree (in some fast way) and then swap branches on this
tree, looking
for trees with better scores.
The dataset we
shall use is a set of primate mitochondrial genes. This dataset
was collected
in order to solve the age-old question concerning the relationship
between humans,
chimpanzees and gorillas. These
are the real data used in the
publication in
1988.
1. Take a look
at the dataset. Its format is
slightly different, as we are
using a dot
"." to indicate that a nucleotide has the same character state as
the first
sequence in the dataset (the lemur).
There is also a numbering system
on top of the
data matrix that indicates the position in the alignment.
2. Read the
dataset into the program's memory.
3. Look at the
options for a heuristic search (command 'hs'). The first of the
options
concerns the kind of heuristic search that can be carried out. The
choice is
between a "Nearest Neighbor Interchange", a "Sub-tree Pruning
and
Regrafting"
or a "Tree-bisection and Reconnection" search. These three kinds of
search are in order
of the rigorousness of the approach.
4. Perform a
search of trees using the NNI procedure, record the number of trees
searched and
the score of the best tree.
5. Use the
showtrees command to see the best tree(s).
6. Perform a
search of trees using the SPR procedure, record the number of trees
searched and
the score of the best tree.
7. Use the
showtrees command to see the best tree(s).
8. Perform a
search of trees using the TBR procedure, record the number of trees
searched and
the score of the best tree.
9. Use the
showtrees command to see the best tree(s).
10. Write in
your copybook your interpretation of these trees. What does this
mean for the
relationships between these primate species?
11. Use the
'contree' command to generate a consensus tree. What part of the tree is 'collapsed' into a trichotomy? Why?
12. Quit the
program.
G
Exhaustive Searches
Aim: To search through all
possible tree topologies in order to find the most
parsimonius.
In this
practical, we shall examine all possible tree topologies for a
collection of
sequences. Exhaustive searching of
all tree topologies is by far
the best method
of finding the most parsimonious solutions. However, for very
large datasets,
it is impractical. The number of
possible tree topologies is:
T = (2n-5!)/(2n-3(n-3)!)
The growth in
tree numbers proceeds as follows:
n
T
4
3
5
15
6
105
7
945
8
10,395
9
135,135
10
2,027,025
However, for
small numbers of sequences, we can search through all tree
topologies
relatively easily.
1. Take a look
at the data file called BigHIV.nex.
This file contains a total
of eight
sequences of HIV viruses isolated from patients in a variety of
different
countries. Read the datafile into
paup's memory.
2. Confirm that
you have successfully read the file into memory by printing the
matrix to the
screen.
3. Take a look
at the options for searching through "all trees". You can ask
the program to
search through all these trees by simply issuing the relevant
command without
any options. Do this now.
4. Note the
number of trees that were evaluated and the scores of the shortest
and the longest
trees that were encountered.
5. There will
be a print out of the tree scores that were encountered during the
search. It should look something like this:
================================================================================
Frequency
distribution of tree scores:
mean=24.572872 sd=2.034705
g1=-0.787999 g2=0.198534
/---------------------------------------------------------------------
16 | (5)
17 | (13)
18 |# (34)
19 |##### (151)
20 |#######
(205)
21
|############### (449)
22
|################################# (978)
23
|########################## (771)
24
|##################################################################### (2056)
25
|##################################################### (1595)
26
|##################################################################### (2070)
27 |#####################################################################
(2068)
\---------------------------------------------------------------------
================================================================================
This frequency
distribution indicates the number of times a tree of a particular
length was
encountered during a tree search.
We expect that well-structured
data with low
levels of homoplasy and strong phylogenetic signal will produce a
frequency
distribution where the largest portion of trees are much longer than
the most
parsimonious solutions.
6. Run the
exhaustive search again, this time changing the Frequency
Distribution
display to 'Histogram" and printing the output to a file called
"BigHIV.freq". Take a look at the output file, note
the number of trees.
7. Quit the
program.
H
Random
Aim:
To perform a search of tree space using a dataset that has random
sequences. We will demonstrate what "tree
space" looks like for this dataset.
1. Take a look
at the dataset and then read this dataset into paup's memory.
2. Perform an
exhaustive search of tree space.
3. Note the
distribution of tree lengths.
4. Record in
your lab books how this distribution differs from the
distribution in
the previous exercise. This
distribution is indicative of a
dataset that
does not contain well-structured and phylogenetically robust data.
3. Quit the
program.
I
Bootstrap
Aim:
To evaluate the level of support for internal branches in a phylogenetic
tree. We shall demonstrate the differences we
might see in "good" and "poor"
data
sets.
We use a
resampling procedure such as bootstrapping in an effort to evaluate
the relative
levels of support and conflict for various sets of relationships.
The original
data is sampled with replacement and pseudo-replicates of the
original matrix
are used to construct trees. This
resampling strategy is
repeated a
large number of times and at the end the data is summarised using a
majority-rule
consensus procedure.
1. Read the
HIV.nex dataset into paup's memory.
Take a look at the bootstrap
commands
options. The first option is
BSeed. This is a seed number that
computers
sometimes need in order to generate a random number. PAUP will
usually use the
clock time as a random number.
2. Use the
'bootstrap' command to perform a bootstrap analysis using the default
parameters. Record the random 'starting seed'
number. Draw the resulting tree
into your lab
book with the support values associated with the internal
branches.
3. You will
also see a 'partition table' that details the number of times a
particular
internal branch was seen during the resampling procedure. you must
remember that
during bootstrap resampling, it is likely that trees will be
produced that
do not look like the final consensus tree.
This is an
example of a partition table:
12345678 Freq
------------------
.***.... 79.18
..**.... 76.76
.******. 72.31
.....**. 52.28
....***. 50.51
....**** 23.69
.**..... 16.47
.***.**. 15.65
.*..***. 6.43
This indicates
that during the analysis a branch that separated sequences
1,5,6,7 and 8
from the rest was seen 79.18% of the time. Can you work out how
many times a
branch was seen that separated sequences 2 and 3 from the rest?
Would this
branch be seen in the majority-rule consensus tree?
*NOTE:
The following
notations can be used to describe the same tree:
Tree diagram -
,>>>>>>>>>>>1:Cow
!
--6
,>>2:Mouse
!
,>>>>>8
! ! `>>3:Rat
`>>7
!
,>>4:Chimp
`>>>>>9
`>>5:Human
Nested
parentheses -
((Human,
Chimp),(Mouse, Rat), Cow);
Partition table
-
12345
...**
.**..
.****
4. Read the
Random.nex file into paup's memory.
Perform the same bootstrap
analysis on this
data set. Record the results and
indicate how the trees differ
in terms of the
degree of support seen throughout the tree. Bear in mind that
the Random.nex
file contains sequences that we have randomised - there is no
real set of
phylogenetic relationships.
5. Perform the
bootstrap analyses again on both datasets, this time alter the
number of
replicates so that you perform 1,000 resampling iterations.
6. Record the
results and quit the program.