Self-Indexed Grammar-Based Compression

Claude, F.; Navarro, G.

Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

Artykuł - szczegóły

Czasopismo

Fundamenta Informaticae

2011 | Vol. 111, nr 3 | 313-337

Tytuł artykułu

Self-Indexed Grammar-Based Compression

Autorzy

Claude, F. , Navarro, G.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Warianty tytułu

Języki publikacji

Abstrakty

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1, u], then an SLP-compressed representation of T requires 2n log2 n bits. For that same SLP, our self-index takes O(n log n) + n log2 u bits. It extracts any text substring of length m in time O((m+ h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log2 n(1+o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.

Słowa kluczowe

grammar-based compression straight-line programs self-indexes compressed text databases highly repetitive sequences pattern matching data structures

Wydawca

Czasopismo

Fundamenta Informaticae

Rocznik

2011

Tom

Vol. 111, nr 3

Strony

313-337

Opis fizyczny

Bibliogr. 50 poz., wykr.

Twórcy

autor

Claude, F.

autor

Navarro, G.

David R. Cheriton School of Computer Science, University of Waterloo. 200 University Avenue West, Waterloo, Ontario, Canada. N2L 3G1, fclaude@cs.uwaterloo.ca

Bibliografia

[1] Amir, A., Benson, G.: Efficient two-dimensional compressed matching, Proc. 2nd Data Compression Conference (DCC), 1992.
[2] Apostolico, A., Lonardi, S.: Some theory and practice of greedy off-line textual substitution, Proc. 8th Data Compression Conference (DCC), 1998.
[3] Arora, S., Hazan, E., Kale, S.: O(√log n) approximation to SPARSEST CUT O(n2) in time, Proc. 45th Annual Symposium on Foundations of Computer Science (FOCS), 2004.
[4] Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index, Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009, 2006.
[5] Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations, Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN), LNCS 6034, 2010.
[6] Barbay, J., Golynski, A., Munro, I., Rao, S. S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents, Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009, 2006.
[7] Bender, M., Farach-Colton, M.: The level ancestor problem simplified, Theoretical Computer Science, 321(1), 2004, 5-12.
[8] Bille, P., Landau, G., Weimann, O.: Random access to grammar compressed strings, 2010, ArXiv preprint 1001.1565.
[9] Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem, IEEE Transactions on Information Theory, 51(7), 2005, 2554-2576.
[10] Clark, D.: Compact Pat Trees, Ph.D. Thesis, University of Waterloo, 1996.
[11] Claude, F., Farina, A., Mart´ınez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences, Proc. 10th IEEE Conference on Bioinformatics and Bioengineering (BIBE), 2010.
[12] D. Bentley et al.: Accurate whole human genome sequencing using reversible terminator chemistry, Nature, 456(7218), 2008, 53-59.
[13] Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction, Journal of the ACM, 47(6), 2000, 987-1011.
[14] Ferragina, P., Manzini, G.: Indexing compressed texts, Journal of the ACM, 52(4), 2005, 552-581.
[15] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3(2), 2007, article 20.
[16] Fischer, J.: Optimal Succinctness for Range Minimum Queries, Proc. 9th Symposium on Latin American Theoretical Informatics (LATIN), LNCS 6034, 2010.
[17] Gasieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files, Proc. 15th Data Compression Conference (DCC), 2005.
[18] Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching, Fundamenta Informaticae, 56(1-2), 2003, 137-154.
[19] Golynski, A., Munro, I., Rao, S.: Rank/select operations on large alphabets: a tool for text indexing, Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006.
[20] Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes, Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003.
[21] Hon, W. K., Sadakane, K., Sung, W. K.: Breaking a time-and-space barrier in constructing full-text indices, Proc. 44th Annual Symposium on Foundations of Computer Science (FOCS), 2003.
[22] Inenaga, S., Bannai, H.: Finding characteristic substrings from compressed texts, Proc. 14th Prague Stringology Conference, Czech Technical University in Prague, 2009.
[23] Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction, Proc. 30th International Colloquium on Automata, Languages, and Programming (ICALP), LNCS 2719, 2003.
[24] Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching, Proc. 3rd South American Workshop on String Processing (WSP), Carleton University Press, 1996.
[25] Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions, Nordic Journal of Computing, 4(2), 1997, 172-186.
[26] Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications, Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 2089, 2001.
[27] Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching, Theoretical Computer Science, 298(1), 2003, 253-272.
[28] Kieffer, J., Yang, E.-H.: Grammar-based codes: A new class of universal lossless source codes, IEEE Transactions on Information Theory, 46(3), 2000, 737-754.
[29] Knuth, D., Morris, J., Pratt, V.: Fast pattern matching in strings, SIAM Journal on Computing, 6(1), 1977, 323-350.
[30] Kreft, S., Navarro, G.: LZ77-like compression with fast random access, Proc. 20th Data Compression Conference (DCC), 2010.
[31] Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets, Proc. 13th Annual International Conference on Computational Molecular Biology (RECOMB), 2009, Poster.
[32] Larsson, J., Moffat, A.: Off-line dictionary-based compression, Proceedings of the IEEE, 88(11), 2000, 1722-1732.
[33] Mäkinen, V., Navarro, G.: Rank and select revisited and extended, Theoretical Computer Science, 387(3), 2007, 332-347.
[34] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches, SIAM Journal of Computing, 22(5), 1993, 935-948.
[35] Morrison, D.: PATRICIA - practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15(4), 1968, 514-534.
[36] Munro, J., Raman, R., Raman, V., Rao, S. S.: Succinct representations of permutations, Proc. 30th International Colloquium on Automata, Languages, and Programming (ICALP), LNCS 2719, 2003.
[37] Navarro, G.: Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms, 2(1), 2004, 87-114.
[38] Navarro, G., Mäkinen, V.: Compressed full-text indexes, ACM Computing Surveys, 39(1), 2007, article 2.
[39] Nevill-Manning, C., Witten, I., Maulsby, D.: Compression by induction of hierarchical grammars, Proc. 4th Data Compression Conference (DCC), 1994.
[40] Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
[41] Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary, Information Retrieval, 11(4), 2008, 359-388.
[42] Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression, Theoretical Computer Science, 302(1-3), 2003, 211-222.
[43] Sadakane, K.: New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48(2), 2003, 294-313.
[44] Sadakane, K.: Compressed suffix trees with full functionality, Theory of Computing Systems, 41(4), 2007, 589-607.
[45] Sadakane, K., Navarro, G.: Fully-functional succinct trees, Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010.
[46] Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression, Journal of Discrete Algorithms, 3, 2005, 416-430.
[47] Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections, Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, 2008.
[48] Storer, J., Szymanski, T.: Data compression via textual substitution, Journal of the ACM, 29(4), 1982, 928-951.
[49] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23(3), 1977, 337-343.
[50] Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding, IEEE Transactions on Information Theory, 24(5), 1978, 530-536.

Typ dokumentu

Bibliografia

Identyfikatory

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS8-0021-0009