Main Introduction Page Electronic Reference Library Citation for this website Know-how Glossary of acronyms and other terms used on this website Support the Butterflies of America Foundation Interactive Listing of American Butterflies Learn about contributing your photos Photographer Credits Contact us
index sitemap advanced

Long branch attraction illustrated on 4 skippers (Hesperiidae)

© Nick V. Grishin

Phenotypes of organisms are complex as reflected in their various shapes and textures of form and richness of color. With proper experience it is relatively easy to see differences between organisms and hypothesize about relationships between them. However, these hypotheses are necessarily subjective. DNA sequences bring an impression of objectivity to the analysis. Indeed, after being experimentally obtained, DNA sequences can be fed to a computer program that outputs phylogenetic tree and some values indicating how reliable the tree is. Researcher with all of his/hers feelings, beliefs, intuition, preconceptions and all other equally subjective components can be completely eliminated from such an analysis streamlined as "DNA – computer – tree". In other words, this procedure sounds like science in its essence. Can it go wrong?

3trees.gif

One species is a node (=leaf), not a tree; two species give a single edge between them; three species give a single unrooted tree with three edges "growing" from a single internal node and culminating in leaves representing these species, see image on the right. Although the last two of these trees offer interesting tasks about them, such as estimation of branch lengths, there is just a single tree topology for 2 and a single tree topology for 3 species.

Four species is the minimal number to play with, as there are 3 possibilities of grouping these 4 species in an unrooted tree: Tree 1 groups species 1 with species 2, which implies that species 3 is grouped with species 4; Tree 2 groups species 1 with species 3, which implies that species 2 is grouped with species 4; and finally Tree 3 groups species 1 with species 4, which implies that species 2 is grouped with species 3. These three tree topologies are illustrated on an example of 4 skippers: a Giant-Skipper (Agathymus mariae), a grass skipper (Ancyloxypha numitor) and two spread-wing skippers (Thorybes pylades and Pyrrhopyge zenodorus). Which of the three trees do you think represents reality?

According to recent studies (Warren et al. 2008, Warren et al. 2009) and general knowledge, tree 2 is correct, and Agathymus mariae is a neighbor of Ancyloxypha numitor. Here we illustrate an interesting phenomenon in phylogeny reconstruction called "long branch attraction". It may result in incorrect tree topology, frequently derived with high confidence. We take DNA sequences of Elongation factor-1 α (EF1-a) gene (partial) from these 4 skipper species, obtain the most parsimonious tree, and see that it is likely to be incorrect. The reasons behind this are discussed.


Methods: All sequences used in this study were obtained from the GenBank database as submitted by Warren et al. (2008). Sequences were aligned using the MUSCLE server at EBI. The trees were reconstructed using 4 phylogenetic methods as implemented by the phylogeny.fr server with default parameters, and visualized with TreeView and ATV. Tutorial about how to perform these procedure is available from here.

Data: DNA sequences of Elongation factor-1 α (EF1-a) gene for 4 skippers, their alignment and trees built by TNT, BioNJ, PhyML, MrBayes can be downloaded as text files using the links in this sentence.

Discussion: Default method for phylogenetic reconstruction by taxonomists if maximum parsimony. This method is quite easy to understand, it operates on DNA sequences the same way as on coded strings of morphological and biological characters, and gives an impression of being devoid of assumptions (not true!). The most parsimonious tree of the 4 skipper species is shown on the right. We see that Agathymus and Pyrrhopyge are grouped together against common sense. The reasons for such a grouping become apparent after a careful look at the sequences. Available EF1-a partial DNA sequences comprise 731 nucleotides at positions present in all 4 species. Multiple alignment is shown below. In most positions, the same nucleotide is present in each of the 4 sequences. E.g. "A" in position 2, "T" in position 3, "G" in position 5 etc. These invariant positions are marked by asterisks below the alignment:

CLUSTAL W (1.81) multiple sequence alignment
1.Agathymus_mariae      AATTGGTTATAACCCAGCTGCCGTTGCTTTCGTACCCATTTCTGGCTGGCACGGAGATAACATGCTGGAGCCATCTACCAAGATGCCTTGGTTCAAGGGATGGAATGTTGAGCGTAAGGAAGGTAAGGCTGAAGGGAAATGCCTTATTGAGGCTCTTGATGCCATCCTACCCCCAGCTCGTCCCACAGACAAAGCTCTCCGACTTCCTCTGCAAGACGTCTACAAAATCGGTGGTATTGGTACAGTGCCTGTAGGCAGAGTAGAAACTGGAATCTTGAAGCCCGGTACCATTGTGGTATTTGCTCCTGCTAATATCACTACTGAAGTAAAATCTGTGGAGATGCACCACGAGGCTCTCCAAGAGGCTGTGCCTGGAGACAATGTAGGATTCAACGTAAAGAACGTATCTGTCAAAGAATTGCGTCGTGGATATGTCGCCGGTGACTCCAAAAACAACCCACCAAAGGGTGCCGCTGACTTCACAGCACAAGTAATTGTGCTCAACCATCCTGGTCAAATTTCTAATGGTTATACACCAGTGCTGGATTGCCACACTGCTCACATTGCCTGCAAATTTGCAGAGATTAAAGAGAAGGTTGATCGTCGTACTGGTAAATCAACTGAGGAGAATCCTAAATCTATCAAGTCTGGTGATGCTGCCATTGTTAACTTAGTACCTTCTAAGCCTCTATGTGTAGAATCCTTCCAGGAGTTTCCACCCCTCGGTCGTT
2.Pyrrhopyge_zenodorus  GATTGGTTACAATCCGGCTGCCGTCGCTTTCGTACCCATTTCTGGCTGGCACGGAGACAATATGTTGGAGCCATCAACCAAAATGCCTTGGTTCAAGGGATGGGCTGTTGACCGTAAAGAAGGTAAGGCTGAAGGCAAGTGCCTGATTGAGGCCTTGGACGCCATCCTGCCTCCAGCTCGCCCTACCGACAAGCCCCTTCGCCTTCCCCTGCAAGACGTCTACAAAATTGGTGGTATTGGAACAGTGCCCGTAGGCAGAGTGGAGACTGGTATCTTGAAGCCTGGTACCATTGTTGTATTTGCTCCTGCTAACATCACAACTGAAGTAAAATCTGTGGAGATGCACCATGAAGCTCTTCAAGAGGCTGTGCCTGGAGACAATGTTGGTTTCAACGTAAAGAACGTGTCTGTGAAGGAATTGCGCCGTGGATACGTTGCAGGTGACTCCAAGAACAACCCCCCTAAGGGTGCTGCTGACTTCACTGCACAAGTCATTGTGCTTAACCACCCTGGTCAAATTTCCAATGGTTACACACCTGTATTGGATTGCCACACAGCTCACATTGCTTGTAAATTCGCAGAAATTAAGGAGAAGGTTGACCGTCGTACTGGCAAATCAACAGAACAAAATCCCAATGCAATCAAGTCTGGTGACGCTGCCATTGTTAACTTAGTACCTTCCAAGCCTCTGTGTGTGGAGTCCTTCCAAGAATTCCCACCCCTTGGTCGTT
3.Ancyloxypha_numitor   GATCGGTTACAACCCAGCTGCCGTCGCTTTCGTACCCATTTCTGGCTGGCACGGAGACAACATGTTGGAGCCATCCACTAAGATGCCCTGGTTCAAGGGATGGAATGTCGAGCGTAAGGAAGGTAAGGCTGAAGGCAAATGCCTCATTGAGGCCCTCGACGCCATCCTGCCTCCCGCTCGTCCCACAGACAAAGCCCTCCGTCTTCCCCTGCAGGACGTCTACAAAATCGGCGGTATTGGTACAGTGCCCGTAGGCAGAGTGGAGACAGGTATCCTGAAGCCCGGTACCATTGTTGTATTCGCCCCTGCTAACATCACCACTGAAGTAAAATCGGTGGAGATGCACCACGAAGCTCTCCAAGAGGCTGTGCCCGGAGACAACGTAGGTTTCAACGTAAAGAACGTTTCTGTCAAGGAATTGCGTCGTGGCTACGTCGCTGGTGACTCCAAGAACAACCCACCCAAGGGTGCTGCTGACTTCACCGCACAAGTCATTGTGCTCAACCACCCTGGTCAAATTTCCAACGGTTACACGCCTGTGTTGGATTGCCACACTGCTCACATTGCCTGCAAATTCGCAGAAATCAAAGAGAAGGTAGACCGTCGTACTGGTAAATCTACTGAAGATAATCCTAAATCCATCAAGTCTGGAGATGCTGCTATCGTTAACCTAGTACCTTCCAAGCCTCTTTGTGTGGAGTCCTTCCAGGAGTTCCCACCCCTCGGTCGTT
4.Thorybes_pylades      AATCGGTTACAACCCAGCTGCCGTCGCTTTCGTACCCATTTCTGGCTGGCACGGAGACAACATGTTGGAGCCATCCACCAAAATGCCCTGGTTCAAGGGATGGAACGTTGAGCGTAAGGAAGGTAAGGCTGAAGGCAAGTGCCTCATTGAGGCTTTGGACGCCATCCTGCCTCCCGCTCGTCCCACAGACAAGGCCCTGCGTCTTCCCCTACAGGACGTCTACAAAATCGGTGGTATTGGTACAGTGCCCGTAGGCCGTGTCGAAACTGGCATCCTTAAGCCTGGTACCATTGTTGTATTCGCCCCCGCTAACATCACCACTGAAGTTAAGTCAGTGGAGATGCACCACGAAGCTCTCCAAGAGGCTGTGCCTGGAGACAATGTTGGTTTCAACGTAAAGAACGTCTCTGTTAAGGAATTGCGTCGTGGCTACGTTGCCGGTGACTCCAAGAACAACCCACCCAAGGGTGCCGCCGACTTCACCGCACAGGTCATCGTGCTCAACCACCCCGGTCAAATTTCCAACGGTTACACACCCGTTTTGGATTGCCACACTGCTCACATTGCCTGCAAATTCGCAGAAATCAAAGAGAAGGTTGACCGTCGTACTGGTAAATCAACTGAAGAGAACCCGAAATCAATCAAGTCTGGTGATGCTGCCATTGTTAACCTAGTACCCTCCAAGCCTCTGTGTGTAGAGTCCTTCCAGGAGTTCCCACCCTTCGGTCGTT
                         ** ***** ** ** ******** ******************************** ** *** ********** ** ** ***** ***************   ** ** ***** ***************** ** ***** ********  * ** ******** ** ** ***** ** ** *****  * ** ** ***** ** ** ************** ** ******** ******** ****** * ** ** ** ** *** * ***** *********** ***** ** ** ***** ***** ******** ** ** ************** ** ***** ************** ******** ** ** ***************** ***** ** ******** ***** ** ** ** *********** ******** ** ******** ** ******** ***** ** ** ***** ***** ** *********** ** ***** ** ** **  ************* *********** ** ***** ***** ** ** ******** ** *********** ***** ** **  * ** ** **  * *********** ** ***** ** ****** ******* ** ******** ***** ** ******** ** ** ****** * *******

Since positions with identical nucleotides do not directly tell us which sequence is a neighbor of which (there are no changes in such positions), these positions are not considered in parsimony analysis and can be removed from the alignment. As a result, we get the following alignment of 131 positions:

1.Agathymus_mariae     30  ATTCATTCCTCGTAATTGGGATTCTTACATCAAGTCATGACTTTAAAATATGCGTTTTTAATCGCTTAAACATATCCAAACTAAATCTTTTTAAGCTCCTGTATTTATGGGTTATTTTCTTTTAAAGGTCC  30  Agathymus_mariae      
2.Pyrrhopyge_zenodorus 31  GTCTGCCTTACATGCTTCACGGCTGCGTACTCGCCTCCGATTACAAGGTTTGTTTTTCAAATTATTTTTGGGCACTAGCTTTTACTTCTCTCATATATTCATGTCCAAACATCTGATCCTTTCGGGAACCT  31  Pyrrhopyge_zenodorus  
3.Ancyloxypha_numitor  12  GCCCACCCTCTGCAATCGGCACCCCCGTCTCAAGCCTCGGCCTCAAGGATCGCTCCTCCAAGCACCCATTCGTCCCTGACTTCACTCCTCCCGTGTTCCCACAACTTTAGTTTATCATTCCTCTGGGGCCC  12  Ancyloxypha_numitor   
4.Thorybes_pylades     15  ACCCACCCTCCACAACTGGCGCTTGCGTCTCAGGCGTCAGCTTCCTCATCCTTTCCCCCTGACACTTTTCTGTCCTCGACCCCGCCCCCCCCACTTTCCCACATCTATAGGCGATATTCTCCCGAGGGCTC  15  Thorybes_pylades      

Nucleotides unique to each sequence (i.e. position has the same nucleotide in 3 sequences, but it is different in the 4th sequence, =semi-invariant position) are highlighted in gray and their number is shown before (and after) each sequence on yellow background. In the simplest incarnation of the maximum parsimony method we need to find the tree that can be explained by the smallest number of nucleotide substitutions. These positions with nucleotides unique to one sequence can be explained by a single substitution: from the nucleotide present in remaining 3 sequences to a nucleotide present in the 4th sequence. E.g. the 3rd position {T,C,C,C} can be explained by a single substitution of C to T. Since T is present only in one sequence, this substitution should be assigned to (=presumable happened on) the tree edge (=branch) leading the sequence with T as a leaf, i.e. the C → T substitution happened on the terminal branch leading to Agathymus mariae. Because for all 3 tree topologies positions with nucleotides highlighted in gray can be explained with a single substitution, these terminal branches do not directly assist us in topology selection. Removing semi-invariant positions from the alignment, we get even shorter alignment of only 43 positions:

1.Agathymus_mariae      ATTGTATTCTAACAAAAATCTTTTAACACCACATAGTGTTTAA      Agathymus_mariae      
2.Pyrrhopyge_zenodorus  GTAATGGCTGAGTCAGGTTTTTATTGGATATTTTTATACATGG  11  Pyrrhopyge_zenodorus  
3.Ancyloxypha_numitor   GCCGCACCCCCACTGGGTCCCCCGATCCCTCTCCTGCTTCCTG   7  Ancyloxypha_numitor   
4.Thorybes_pylades      ACCACGCTTGCGGTGCACCTCCCATCTCTCCCCCCTCGGACGA   5  Thorybes_pylades      

Next, we highlight "2+2" positions, i.e. positions with the same nucleotide in 2 sequences and a different nucleotide common to the other two sequences. There are 3 types of such positions: type 1 (highlighted red) has the same nucleotide in sequences 1 and 2, type 2 (highlighted green), has the same nucleotide in sequences 1 and 3 and type 3 (highlighted blue) has the same nucleotide in sequences 1 and 4. Number of these positions is shown highlighted yellow to the right of the sequences: type 1 after sequence 2, type 2 after sequence 3 and type 3 after sequence 4. These positions can also be explained by a single substitution, but on the middle tree branch, and only in case the middle branch separates the species pairs with the same nucleotide in a pair, but different nucleotide between the pairs. I.e. the first position in the alignment {A,G,G,A} can be explained by a single change (either A → G, or G → A) in case the tree 3 (A.mariae is joined with T.pylades) is correct. For the other two tree topologies (A.mariae joined with P.zenodorus, or A.mariae joined with A.numitor), minimum 2 changes are necessary. E.g. for the tree 1 (A.mariae joined with P.zenodorus) we need a change G → A in A.mariae and another change G → A in T.pylades, which makes it 2 changes. Alternatively, this position can be explained by a change A → G in P.zenodorus and another change A → G in A.numitor – also 2 changes. Thus the position {A,G,G,A} votes for the grouping of A.mariae with T.pylades, as the smallest number of changes are needed to explain it for the tree 3. Positions of type 1 (red) support the tree 1 (A.mariae joined with P.zenodorus), positions of type 2 (green) support the tree 2 (A.mariae joined with A.numitor), and positions of type 3 (blue) support the tree 3 (A.mariae joined with T.pylades). These "2+2" positions are called parsimony-informative, as the smallest number of changes assigned to them depends on the tree topology. The position type with the highest vote wins parsimony. Apparently, in this case 11 > 7 > 5, thus red positions (type 1) win, and the most parsimonious tree is 1, in which A.mariae is joined with P.zenodorus. It has to be mentioned that the smallest number of changes assigned to positions without highlight in the alignment (e.g. {T,A,C,C}) does not depend on tree topology (for {T,A,C,C} it is always 2), and can be ignored in this method.

We see that the tree 1 is supported by an excess of 11 − 7 = 4 positions, which may be not that many, but these positions are the reason for the parsimony result. What is the relevance of the term "long branch attraction"? Number of gray-highlighted nucleotides in the alignment of 131 positions (above) indicates the number of changes on terminal branches. The expected number of changes is used as a measure of the branch "length". We see that A.mariae and P.zenodorus sequences have at least twice as many changes as A.numitor and T.pylades sequences (30 and 31 vs. 12 and 15 changes). Thus, A.mariae and P.zenodorus are the leaves on two long branches. Parsimony groups these two long branches together, the long branches "attract". Is the "attraction" for the wrong reasons?

To address this question, we use other phylogeny reconstruction methods. Commonly used approach, largely due to its speed and ability to handle alignments of very many taxa, is distance-based. Pairwise "distances" (defined as expected numbers of changes) are estimated for each pair of sequences. Algebraic manipulations with the matrix of these distances result in selection of the tree topology. BioNJ implementation of the distance method produces the same tree as maximum parsimony. The branches are shown to scale and reflect the lengths. It is apparent that A.mariae and P.zenodorus branches are long, and the other two branches are shorter. The tree is statistically supported by a reasonably strong bootstrap value of 80%. The distance tree still contradicts the result of Warren et al. 2008? Here is a distance matrix the tree was based on. The shortest distance is between Ancyloxypha and Thorybes (0.076660), the longest one is, as expected, between Agathymus and Pyrrhopyge (0.138806), which is almost twice as long.

             Agathymus Pyrrhopyg Ancyloxyp Thorybes
Agathymus    0.000000  0.138806  0.109057  0.120632
Pyrrhopyge   0.138806  0.000000  0.116435  0.117912
Ancyloxypha  0.109057  0.116435  0.000000  0.076660
Thorybes     0.120632  0.117912  0.076660  0.000000

Another method to try is Maximum Likelihood. This method attempts to model substitution process statistically, it uses all the positions (even invariant positions that are ignored by parsimony) to deduce the tree. PhyML implementation of the method produces the tree that is exactly the same as the tree reported by Warren et al. 2008. Interestingly, we see that A.mariae and P.zenodorus branches are still long, but they are not grouped with each other. Instead, each is grouped with a shorter branch. Statistical support for this tree is quite moderate though, about 60%. It is not uncommon to obtain trees with different topologies using different methods. Which one is correct?

Recently, Bayesian methods for phylogeny reconstruction have been gaining popularity. The best-known program is MrBayes, and it leads to the same tree as the one obtained by Maximum Likelihood. The two methods are somewhat close in their strategy, as both use statistical modeling of the substitution process and attempt to utilize all the positions in the calculation, so no signal is lost. Why are Bayesian and Maximum Likelihood reconstructions leading to a tree different from the parsimony tree? Apparently, these methods explain the excess of "2+2" positions voting for the {A.mariae, P.zenodorus} pair differently. Since A.mariae and P.zenodorus are placed on long branches each, the same nucleotide in these two sequences is likely to be a result of independent substitutions that happened on each branch (A.mariae and P.zenodorus). As there are only 4 nucleotides, it makes sense to think that about 1/4th of all independent substitutions in 2 sequences result in the same nucleotide being present in both. Thus if there are more substitutions, there are more nucleotides of the same type in the two sequences generated by these substitutions. Instead of indicating "closer" relationship, same nucleotide pairs may be a result of a larger number of substitutions. Maximum Likelihood and Bayesian Inference attempt to model these processes statistically, and apparently the larger number of sites voting for the grouping of {A.mariae, P.zenodorus} are better explained by the larger number of substitutions in each of these sequences, rather than by a closer relationship of the two species. The tree 2 grouping of A.mariae with A.numitor is consistent with morphology (Warren et al. 2009) and is more likely to be correct.

Conclusions:

17-Aug-2009 © Nick V. Grishin


Frequently Asked Questions Our Supporters Bylaws of the Butterflies of America Foundation
Read our 501(c)(3) status letter

This website is supported by Butterflies of America Foundation, a U.S. registered 501(c)(3) tax-deductible nonprofit 170(b)(1)(A)(vi) public charity.