Nowa wersja platformy, zawierająca wyłącznie zasoby pełnotekstowe, jest już dostępna.
Przejdź na https://bibliotekanauki.pl

PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
2011 | Vol. 111, nr 3 | 313-337
Tytuł artykułu

Self-Indexed Grammar-Based Compression

Wybrane pełne teksty z tego czasopisma
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1, u], then an SLP-compressed representation of T requires 2n log2 n bits. For that same SLP, our self-index takes O(n log n) + n log2 u bits. It extracts any text substring of length m in time O((m+ h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log2 n(1+o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.
Wydawca

Rocznik
Strony
313-337
Opis fizyczny
Bibliogr. 50 poz., wykr.
Twórcy
autor
autor
  • David R. Cheriton School of Computer Science, University of Waterloo. 200 University Avenue West, Waterloo, Ontario, Canada. N2L 3G1, fclaude@cs.uwaterloo.ca
Bibliografia
  • [1] Amir, A., Benson, G.: Efficient two-dimensional compressed matching, Proc. 2nd Data Compression Conference (DCC), 1992.
  • [2] Apostolico, A., Lonardi, S.: Some theory and practice of greedy off-line textual substitution, Proc. 8th Data Compression Conference (DCC), 1998.
  • [3] Arora, S., Hazan, E., Kale, S.: O(√log n) approximation to SPARSEST CUT O(n2) in time, Proc. 45th Annual Symposium on Foundations of Computer Science (FOCS), 2004.
  • [4] Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index, Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009, 2006.
  • [5] Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations, Proc. 9th Latin American Symposium on Theoretical Informatics (LATIN), LNCS 6034, 2010.
  • [6] Barbay, J., Golynski, A., Munro, I., Rao, S. S.: Adaptive searching in succinctly encoded binary relations and tree-structured documents, Proc. 17th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 4009, 2006.
  • [7] Bender, M., Farach-Colton, M.: The level ancestor problem simplified, Theoretical Computer Science, 321(1), 2004, 5-12.
  • [8] Bille, P., Landau, G., Weimann, O.: Random access to grammar compressed strings, 2010, ArXiv preprint 1001.1565.
  • [9] Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem, IEEE Transactions on Information Theory, 51(7), 2005, 2554-2576.
  • [10] Clark, D.: Compact Pat Trees, Ph.D. Thesis, University of Waterloo, 1996.
  • [11] Claude, F., Farina, A., Mart´ınez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences, Proc. 10th IEEE Conference on Bioinformatics and Bioengineering (BIBE), 2010.
  • [12] D. Bentley et al.: Accurate whole human genome sequencing using reversible terminator chemistry, Nature, 456(7218), 2008, 53-59.
  • [13] Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction, Journal of the ACM, 47(6), 2000, 987-1011.
  • [14] Ferragina, P., Manzini, G.: Indexing compressed texts, Journal of the ACM, 52(4), 2005, 552-581.
  • [15] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes, ACM Transactions on Algorithms, 3(2), 2007, article 20.
  • [16] Fischer, J.: Optimal Succinctness for Range Minimum Queries, Proc. 9th Symposium on Latin American Theoretical Informatics (LATIN), LNCS 6034, 2010.
  • [17] Gasieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files, Proc. 15th Data Compression Conference (DCC), 2005.
  • [18] Gasieniec, L., Potapov, I.: Time/space efficient compressed pattern matching, Fundamenta Informaticae, 56(1-2), 2003, 137-154.
  • [19] Golynski, A., Munro, I., Rao, S.: Rank/select operations on large alphabets: a tool for text indexing, Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006.
  • [20] Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes, Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003.
  • [21] Hon, W. K., Sadakane, K., Sung, W. K.: Breaking a time-and-space barrier in constructing full-text indices, Proc. 44th Annual Symposium on Foundations of Computer Science (FOCS), 2003.
  • [22] Inenaga, S., Bannai, H.: Finding characteristic substrings from compressed texts, Proc. 14th Prague Stringology Conference, Czech Technical University in Prague, 2009.
  • [23] Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction, Proc. 30th International Colloquium on Automata, Languages, and Programming (ICALP), LNCS 2719, 2003.
  • [24] Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching, Proc. 3rd South American Workshop on String Processing (WSP), Carleton University Press, 1996.
  • [25] Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions, Nordic Journal of Computing, 4(2), 1997, 172-186.
  • [26] Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications, Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 2089, 2001.
  • [27] Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching, Theoretical Computer Science, 298(1), 2003, 253-272.
  • [28] Kieffer, J., Yang, E.-H.: Grammar-based codes: A new class of universal lossless source codes, IEEE Transactions on Information Theory, 46(3), 2000, 737-754.
  • [29] Knuth, D., Morris, J., Pratt, V.: Fast pattern matching in strings, SIAM Journal on Computing, 6(1), 1977, 323-350.
  • [30] Kreft, S., Navarro, G.: LZ77-like compression with fast random access, Proc. 20th Data Compression Conference (DCC), 2010.
  • [31] Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets, Proc. 13th Annual International Conference on Computational Molecular Biology (RECOMB), 2009, Poster.
  • [32] Larsson, J., Moffat, A.: Off-line dictionary-based compression, Proceedings of the IEEE, 88(11), 2000, 1722-1732.
  • [33] Mäkinen, V., Navarro, G.: Rank and select revisited and extended, Theoretical Computer Science, 387(3), 2007, 332-347.
  • [34] Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches, SIAM Journal of Computing, 22(5), 1993, 935-948.
  • [35] Morrison, D.: PATRICIA - practical algorithm to retrieve information coded in alphanumeric, Journal of the ACM, 15(4), 1968, 514-534.
  • [36] Munro, J., Raman, R., Raman, V., Rao, S. S.: Succinct representations of permutations, Proc. 30th International Colloquium on Automata, Languages, and Programming (ICALP), LNCS 2719, 2003.
  • [37] Navarro, G.: Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms, 2(1), 2004, 87-114.
  • [38] Navarro, G., Mäkinen, V.: Compressed full-text indexes, ACM Computing Surveys, 39(1), 2007, article 2.
  • [39] Nevill-Manning, C., Witten, I., Maulsby, D.: Compression by induction of hierarchical grammars, Proc. 4th Data Compression Conference (DCC), 1994.
  • [40] Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets, Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
  • [41] Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary, Information Retrieval, 11(4), 2008, 359-388.
  • [42] Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression, Theoretical Computer Science, 302(1-3), 2003, 211-222.
  • [43] Sadakane, K.: New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, 48(2), 2003, 294-313.
  • [44] Sadakane, K.: Compressed suffix trees with full functionality, Theory of Computing Systems, 41(4), 2007, 589-607.
  • [45] Sadakane, K., Navarro, G.: Fully-functional succinct trees, Proc. 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010.
  • [46] Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression, Journal of Discrete Algorithms, 3, 2005, 416-430.
  • [47] Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections, Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, 2008.
  • [48] Storer, J., Szymanski, T.: Data compression via textual substitution, Journal of the ACM, 29(4), 1982, 928-951.
  • [49] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression, IEEE Transactions on Information Theory, 23(3), 1977, 337-343.
  • [50] Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding, IEEE Transactions on Information Theory, 24(5), 1978, 530-536.
Typ dokumentu
Bibliografia
Identyfikatory
Identyfikator YADDA
bwmeta1.element.baztech-article-BUS8-0021-0009
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.