©2023-2035 All Rights Reserved. Online Journal of Bioinformatics.
You may not store these pages in any form except for your own personal use. All
other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the
before mentioned must be gained in writing from the publisher. This
article is exclusively copyrighted in its entirety to OJB publications. This
article may be copied once but may not be reproduced or re-transmitted without
the express permission of the editors.
OJB©
Online Journal of Bioinformatics©
Onl J Bioinform©
Established 1995
ISSN 1443-2250
Volume 25 (1) : 26-35, 2024.
Inverse document frequency weights and DNA sequence retrieval.
O'Kane KC
Department of Computer Science, The University of Northern Iowa, Cedar Falls, Iowa 50613-0507,
USA
ABSTRACT
O'Kane KC, Inverse document frequency weights and
DNA sequence retrieval, Onl J Bioinform., 25 (1) : 26-35, 2024. Author describes weighted
n-gram sequence fragments in genomic databases indexed for sequence retrieval
programs where query processing time is determined by size of query and number
of sequences based on inverse document frequency (IDF). This formula calculates
relative importance of indexing terms based on distribution. IDF weights
of segmented, overlapping, fixed n-grams of length in NCBI were calculated to
create inverted index into sequence file. System was evaluated on cases from
random known sequences fragmented, mutated and compared with BLAST and MegaBlast. Due to the speed of query processing, the
system is also capable of database sequence clustering with examples
.
KEY WORDS: inverse document frequency; sequence retrieval; sequence
clustering; n-grams; inverted index.
FULL-TEXT (SUBSCRIBE OR PURCHASE TITLE)