On Computing Average Common Substring Over Run Length Encoded Sequences

Hooshmand, S.; Abedin, P.; Tavakoli, N.; Thankachan, S. V.

doi:10.3233/FI-2018-1743

Artykuł - szczegóły

Tytuł artykułu

On Computing Average Common Substring Over Run Length Encoded Sequences

Autorzy

Hooshmand S. , Abedin P. , Tavakoli N. , Thankachan S. V.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

DOI

10.3233/FI-2018-1743

Warianty tytułu

Języki publikacji

Abstrakty

The Average Common Substring (ACS) is a popular alignment-free distance measure for phylogeny reconstruction. The ACS of a sequence X[1; x] w.r.t. another sequence Y[1; y] is ACS(X;Y) =[formula] The lcp(., .) of two input sequences is the length of their longest common prefix. The ACS can be computed in O(n) space and time, where n = x + y is the input size. The compressed string matching is the study of string matching problems with the following twist: the input data is in a compressed format and the underling task must be performed with little or no decompression. In this paper, we revisit the ACS problem under this paradigm where the input sequences are given in their run-length encoded format. We present an algorithm to compute ACS(X,Y) in O(N log N) time using O(N) space, where N is the total length of sequences after run-length encoding.

Słowa kluczowe

compression RL encoding string algorithms suffix tree

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2018

Tom

Vol. 163, nr 3

Strony

267--273

Opis fizyczny

Bibliogr. 19 poz.

Twórcy

autor

Hooshmand S.

sahar@cs.ucf.edu

Department of Computer Science, University of Central Florida, Orlando, USA

autor

Abedin P.

paniz@cs.ucf.edu

Department of Computer Science, University of Central Florida, Orlando, USA

autor

Tavakoli N.

neda.tavakoli@gatech.edu

School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, USA

autor

Thankachan S. V.

sharma.thankachan@ucf.edu

Department of Computer Science, University of Central Florida, Orlando, USA

Bibliografia

[1] Burstein, D., Ulitsky, I., Tuller, T., Chor, B.: Information theoretic approaches to whole genome phylogenies, Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2005. doi:10.1007/11415770_22.
[2] Aluru, S., Apostolico, A., Thankachan, S. V.: Efficient alignment free sequence comparison with bounded mismatches, Proceedings of the 19th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2015. doi:10.1007/978-3-319-16706-0_1.
[3] Apostolico, A., Guerra, C., Landau, G. M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance, Theoretical Computer Science, 638, 2016, 76-90. doi:10.1016/j.tcs.2016.01.023.
[4] Leimeister, C.-A., Morgenstern, B.: kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison, Bioinformatics, 30(14), 2014, 2000-2008. doi:10.1093/bioinformatics/btu331.
[5] Manzini, G.: Longest Common Prefix with Mismatches, Proceedings of the 22nd International Symposium on String Processing and Information Retrieval (SPIRE), Springer, 2015. doi:10.1007/978-3-319-23826-5_29.
[6] Thankachan, S. V., Apostolico, A., Aluru, S.: A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem, Journal of Computational Biology, 23(6), 2016, 472-482. doi:10.1089/cmb.2015.0235.
[7] Thankachan, S. V., Chockalingam, S. P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation, Journal of Computational Biology, 23(6), 2016, 452-460. doi:10.1089/cmb.2015.0217.
[8] Thankachan, S. V., Chockalingam, S. P., Liu, Y., Krishnan, A., Aluru, S.: A greedy alignment-free distance estimator for phylogenetic inference, Proceedings of 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), 2015. doi:10.1109/ICCABS.2015.7344711.
[9] Thankachan, S. V., Aluru, C., Chockalingam, S. P., Aluru, S.: Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis, Proceedings of the 922nd Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2018. doi:10.1007/978-3-319-89929-9_14.
[10] Apostolico, A.: Maximal words in sequence comparisons based on subword composition, in: Algorithms and Applications, Springer, 2010, 34-44. doi:10.1007/978-3-642-12476-1_2.
[11] Bonham-Carter, O., Steele, J., Bastola, D.: Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis, Briefings in bioinformatics, 2013, bbt052. doi:10.1093/bib/bbt052.
[12] Chang, G., Wang, T.: Phylogenetic analysis of protein sequences based on distribution of length about common substring, The Protein Journal, 30(3), 2011, 167-172. doi:10.1007/s10930-011-9318-0.
[13] Comin, M., Verzotto, D.: Alignment-free phylogeny of whole genomes using underlying subwords, Algorithms for Molecular Biology, 7(1), 2012, 1. doi:10.1186/1748-7188-7-34.
[14] Domazet-Lošo, M., Haubold, B.: Efficient estimation of pairwise distances between genomes, Bioinformatics, 25(24), 2009, 3221-3227. doi:10.1093/bioinformatics/btp590.
[15] Guyon, F., Brochier-Armanet, C., Guénoche, A.: Comparison of alignment free string distances for complete genome phylogeny, Advances in Data Analysis and Classification, 3(2), 2009, 95-108. doi:10.1007/s11634-009-0041-z.
[16] Weiner, P.: Linear pattern matching algorithms, Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory (SWAT), 1973. doi:10.1109/SWAT.1973.13.
[17] McCreight, E. M.: A space-economical suffix tree construction algorithm, Journal of the ACM (JACM), 23(2), 1976, 262-272. doi:10.1145/321941.321946.
[18] Farach, M.: Optimal Suffix Tree Construction with Large Alphabets, 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, 1997. doi:10.1109/SFCS.1997.646102.
[19] Bender, M. A., Farach-Colton, M.: The LCA Problem Revisited, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, 2000. doi:10.1007/10719839_9.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-6fbc1b0e-1dc6-4dc3-aa76-2e9b0bca3d1a