©2023-2035 All Rights Reserved. Online Journal of Bioinformatics. You may not store these pages in any form except for your own personal use. All other usage or distribution is illegal under international copyright treaties. Permission to use any of these pages in any other way besides the before mentioned must be gained in writing from the publisher. This article is exclusively copyrighted in its entirety to OJB publications. This article may be copied once but may not be reproduced or re-transmitted without the express permission of the editors.


 

OJB©

Online Journal of Bioinformatics©

Onl J Bioinform©


Established 1995

ISSN  1443-2250

 

Volume 25 (1) : 26-35, 2024.


Inverse document frequency weights and DNA sequence retrieval.

 

O'Kane KC

 Department of Computer Science, The University of Northern Iowa, Cedar Falls, Iowa 50613-0507, USA

 

ABSTRACT

 

O'Kane KC, Inverse document frequency weights and DNA sequence retrieval, Onl J Bioinform., 25 (1) : 26-35, 2024. Author describes weighted n-gram sequence fragments in genomic databases indexed for sequence retrieval programs where query processing time is determined by size of query and number of sequences based on inverse document frequency (IDF). This formula calculates relative importance of indexing terms based on distribution.  IDF weights of segmented, overlapping, fixed n-grams of length in NCBI were calculated to create inverted index into sequence file.  System was evaluated on cases from random known sequences fragmented, mutated and compared with BLAST and MegaBlast.  Due to the speed of query processing, the system is also capable of database sequence clustering with examples

KEY WORDS: inverse document frequency; sequence retrieval; sequence clustering; n-grams; inverted index.


MAIN

 

FULL-TEXT (SUBSCRIBE OR PURCHASE TITLE)