©1996-2019 All
Rights Reserved.
Online Journal of Bioinformatics. You may not store
these pages in any form except for your own personal use. All other usage or
distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the
before mentioned must be gained in writing from the publisher. This
article is exclusively copyrighted in its entirety to OJB publications. This
article may be copied once but may not be reproduced or re-transmitted without
the express permission of the editors.
OJB©
Online Journal of Bioinformatics©
Volume 6
(2) 162-173, 2005.
The Effect of Inverse Document Frequency
Weights on DNA Sequence Retrieval
O'Kane KC
Department of Computer Science, The University of Northern Iowa, Cedar Falls, Iowa
50613-0507, USA.
ABSTRACT
O'Kane KC, The Effect of Inverse Document
Frequency Weights on DNA Sequence Retrieval, Onl J Bioinform., 6 (2):162-173, 2005. This paper presents a method to identify
weighted n-gram sequence fragments in large genomic databases whose indexing
characteristics permits the construction of fast, indexed, sequence retrieval
programs where query processing time is determined mainly by the size of the
query and number of sequences retrieved rather than, as is the case in
sequential scan based systems such as BLAST, FASTA, and Smith-Waterman, the
size of the database. The weighting scheme is based on the inverse document frequency (IDF)
method, a weighting formula that calculates the relative importance of indexing
terms based on term distribution. In the experiments, the relative IDF
weights of all segmented, overlapping, fixed n-grams of length eleven in the
NCBI “nt” and other databases were calculated and the
resulting n-grams ranked and used to create an inverted index into the sequence
file. The system was evaluated on test cases constructed from randomly
selected known sequences which were randomly fragmented and mutated and the
results compared with BLAST and MegaBlast for
accuracy and speed. Due to the speed of query processing, the system is
also capable of database sequence clustering, examples of which are given
KEY WORDS: inverse document
frequency; sequence retrieval; sequence clustering; n-grams; inverted index.
FULL-TEXT (SUBSCRIBE OR PURCHASE TITLE $25USD)