Compact Suffix Array - A Space-Efficient Full-Text Index

Mäkinen, V.

Artykuł - szczegóły

Tytuł artykułu

Compact Suffix Array - A Space-Efficient Full-Text Index

Autorzy

Mäkinen V.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

Suffix array is a widely used full-text index that allows fast searches on the text. It is constructed by sorting all suffixes of the text in the lexicographic order and storing pointers to the suffixes in this order. Binary search is used for fast searches on the suffix array. Compact suffix array is a compressed form of the suffix array that still allows binary searches, but the search times are also dependent on the compression. In this paper, we give efficient methods for constructing and querying compact suffix arrays. We also study practical issues, such as the trade off between compression and search times, and show how to reduce the space requirement of the construction. Experimental results are provided in comparison with other search methods. With a large text corpora, the index took 1.6 times the size of the text, while the searches were only two times slower than from a suffix array.

Słowa kluczowe

suffix array index compression full-text index string matching

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2003

Tom

Vol. 56, nr 1,2

Strony

191--210

Opis fizyczny

tab., bibliogr. 27 poz.

Twórcy

autor

Mäkinen V.

Department of Computer Science, P.P.Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland, vmakinen@cs.helsinki.fi

Bibliografia

[1] A. Blumer, J. Blumer, D. Haussler, and R. McConnell. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34:3, pp. 578-595, 1987.
[2] A. Blumer, D. Haussler, and A. Ehrenfeucht. Average sizes of suffix trees and dawgs. Discrete Applied Mathematics, 24, pp. 37-45, 1989.
[3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. DEC SRC Research Report 124, 1994.
[4] M. Crochemore. Transducers and repetitions. Theor. Comp. Sci., 45, pp. 63-86, 1986.
[5] M. Crochemore and Renaud Vérin. Direct Construction of Compact Directed Acyclic Word Graphs. In Proc. 8th Annual Symposium on Combinatorial Pattern Matching (CPM 1997), LNCS 1264, pp. 116-129, 1997.
[6] P. Ferragina and G. Manzini. Opportunistic Data Structures with Applications. In Proc. IEEE Symposium on Foundations of Computer Science (FOCS 2000), pp. 390-398, 2000.
[7] P. Ferragina and G. Manzini. An Experimental Study of an Opportunistic Index. In Proc. 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2001), pp. 269-278, 2001.
[8] R. Giegerich, S. Kurtz, and J. Stoye. Efficient Implementation of Lazy Suffix Trees. In Proc. Third Workshop on Algorithmic Engineering (WAE99), LNCS 1668, Springer Verlag, pp. 30-42, 1999.
[9] G. H. Gonnet and R. Baeza-Yates. Handbook of Algorithms and Data Structures - In Pascal and C. Addison-Wesley, Wokingham, UK, 1991. (second edition).
[10] G. H. Gonnet, R. A. Baeza-Yates, and T. Snider. Lexicographical indices for text: Inverted files vs. PAT trees. Technical Report OED-91-01, Centre for the New OED, University of Waterloo, 1991.
[11] R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 841-850, 2003.
[12] R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proc. 32nd ACM Symposium on Theory of Computing (STOC 2000), pp. 397-406, 2000.
[13] R. N. Horspool. Practical fast searching in strings. Soft. Pract. and Exp., 10, pp. 501-506, 1980.
[14] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching (CPM 2001), LNCS 2089, pp. 181-192, 2001.
[15] J. Kärkkäinen. Repetition-Based Text Indexes, PhD Thesis, Report A-1999-4, Department of Computer Science, University of Helsinki, Finland, 1999.
[16] J. Kärkkäinen and E. Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP’96), pp. 141-155. Carleton University Press, 1996.
[17] N. Jesper Larsson and K. Sadakane. Faster Suffix Sorting. Technical Report, number LU-CS-TR:99-214, Department of Computer Science, Lund University, Sweden, 1999.
[18] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22, pp. 935-948, 1993.
[19] E. M. McCreight. A space economical suffix tree construction algorithm. Journal of the ACM, 23, pp. 262-272, 1976.
[20] V. Mäkinen. Compact Suffix Array. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM 2000), LNCS 1848, pp. 305-319, 2000.
[21] V. Mäkinen. Trade Off Between Compression and Search Times in Compact Suffix Array. In Proc. 3rd Workshop on Algorithm Engineering and Experimentation (ALENEX 2001), LNCS 2153, pp. 189-201, 2001.
[22] G. Navarro. Indexing Text using the Ziv-Lempel Trie. In Proc. 9th Intl. Symp. String Processing and Information Retrieval (SPIRE 2002), LNCS 2476, pp. 325-336, 2002.
[23] G. Navarro. The LZ-index: A Text Index Based on the Ziv-Lempel Trie. Technical Report R/DCC-2003-1, Univ. of Chile, Department of Computer Science, January 2003.
[24] K. Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proc. 11th Intl. Symp. Algorithms and Computation (ISAAC’00), LNCS 1969, pp. 410-421, 2000.
[25] S. Srinivasa Rao. Time-space trade-offs for compressed suffix arrays. Information Processing Letters, Vol. 82 (6), pp. 307-311, 2002.
[26] E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14, pp. 249-260, 1995.
[27] P. Weiner. Linear pattern matching algorithms. In Proc. IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1-11, 1973.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS2-0004-0131