©1996-2019
All Rights Reserved. Online
Journal of Bioinformatics . You may not store these pages in any form
except for your own personal use. All other usage or distribution is illegal
under international copyright treaties. Permission to use any of these pages in
any other way besides the before mentioned must
be gained in writing from the publisher. This article is exclusively
copyrighted in its entirety to OJB publications. This article may be copied
once but may not be, reproduced or re-transmitted
without the express permission of the editors. This journal satisfies the refereeing requirements
(DEST) for the Higher Education Research Data Collection (Australia). Linking:To link to this page or
any pages linking to this page you must link directly to this page only here
rather than put up your own page.
OJBTM
Online Journal of
Bioinformatics©
Volume
8 (2):165-178, 2007.
Constructing Phylogeny
from Whole Genome Alignment
Chan
PY, Lam TW, Liu CM, Yiu SM
Department of
Computer Science, The
ABSTRACT
Chan
PY, Lam TW, Liu CM, Yiu SM., Constructing
Phylogeny from Whole Genome Alignment, Onl J Bioinform., 8 (2):165-178, 2007. To
reconstruct a phylogeny for a given set of species, most of the previous
approaches are based on the similarity information derived from a subset of
conserved regions (or genes) in the corresponding genomes. In some cases, the
regions chosen may not reflect the evolutionary history of the species and may
be too restricted to differentiate the species. It is generally believed that
the inference could be more accurate if whole genomes are being considered. The
best existing solution that makes use of complete genomes was proposed by Henz et al.[14] They can construct a phylogeny for 91
prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on
the measurement of non-trivial splits) while other approaches that use whole
genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using
BLASTN which is not primarily designed for whole genome alignment. Also, their
approach is not scalable, for example, it probably takes over 1000 CPU hours to
construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In
addition, we found that non-trivial splits is only a
rough indicator of the accuracy of the phylogeny. In this paper, we propose the
followings. (1) To evaluate the quality of a phylogeny with respect to a model
answer, we suggest using the concept of the maximum agreement subtree as it can
capture the structure of the phylogeny. (2) We propose to use whole genome
alignment software (such as MUMmer) to measure the
distances between the species and derive an efficient approach to generate
these distances. From the experiments on real data sets, we found that our
approach is more accurate and more scalable than Henz
et al.’s approach. We can construct a phylogenetic tree for the same set of 91
genomes with an accuracy more than 20% higher (with respect to both evaluation
measures) in 2 CPU hours (more than 80 times faster than their approach). Also,
our approach is scalable and can construct a phylogeny for 230 prokaryotic
genomes with accuracy as high as 85% in only 9.5 CPU hours.
Keywords: whole genome alignment,
phylogeny, maximum compatible subtree, maximum agreement subtree.
FULL-TEXT (SUBSCRIBE OR PURCHASE TITLE $25USD)