Efficient String Matching in Huffman Compressed Texts

Fredriksson, K.; Tarhio, J.

Artykuł - szczegóły

Tytuł artykułu

Efficient String Matching in Huffman Compressed Texts

Autorzy

Fredriksson K. , Tarhio J.

Wybrane pełne teksty z tego czasopisma

https://fi.episciences.org/

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n[(log₂σ)/ b]) time, where n is the size of the compressed text in bytes, s is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O([b/( H log₂σ)]) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2^b) space and O(b2^b) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n [(log₂σ)/ Hω]), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n[(log₂ s)/ b]+t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n [(log₂σ)/ b] é (m+s-1)/ω] u`+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O([b/( H log₂σ)]). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.

Słowa kluczowe

Huffman compression string matching natural language processing

Wydawca

IOS Press

Czasopismo

Fundamenta Informaticae

Rocznik

2004

Tom

Vol. 63, nr 1

Strony

1--16

Opis fizyczny

Bibliogr. 19 poz., wykr.

Twórcy

autor

Fredriksson K.

kfredrik@cs.joensuu.fi

Department of Computer Science, University of Joensuu, PO Box 111, 80101 Joensuu, Finland

autor

Tarhio J.

tarhio@cs.hia.fi

Department of CSE , Helsinki University of Technology, PO Box 5400. 02015 Hut. Finland

Bibliografia

[1] Aho, A. V.. Corasick, M. J.: Efficient string matching: an aid to bibliographic search. Commun. ACM. 18(6), 1975,333-340.
[2] Amir, A., Benson, G., Farach, M.: Let Sleeping Files Lie: Pattern Matching in Z-Conipressed Files, /Comput. Syst. Sci..52{2h 1996,299-307.
[3] Amir, A., Benson, G., Farach, M.: Optimal two-dimensional compressed matching, J. Algorithms, 24(2), 1997,354-379.
[4] Baeza-Yates, R. A., Gonnet, G. H.: A new approach to text searching. Commun. ACM, 35( 10), 1992, 74-82.
[5] Baeza-Yates. R. A., Navarro, G.: Multiple Approximate String Matching, Proceedings of the 5th Workshop on Algorithms and Data Structures (F. K. H. A. Dehne, A. Rau-Chaplin, J.-R. Sack, R. Tamassia, Eds.), number 1272 in Lecture Notes in Computer Science. Springer-Verlag, Berlin, Halifax, Nova Scotia, Canada, 1997.
[6] Choueka, Y, Klein, S., Perl, Y: Efficient variants of Huffman codes in high-level languages, Proceedings of SJG!R'85, 8th Annual International Conference of Research and Development in Information Retrieval, ACM. 1985.
[7] Fredriksson, K.: Faster String Matching with Super-Alphabets, Proceedings of the 9th International Symposium Symposium on Siring Processing and Information Retrieval (SPIRE'2002), LNCS 2476, Springer-Verlag, 2002.
[8] Heaps, H. S.: Information retrieval: theoretical and computational aspects. Academic Press, New York, NY. 1978.
[9] Huffman. D. A.: A method for the construction of minimum redundancy codes, Proc. I.H.E., 40, 1951, 1098-1101.
[10] Klein, S., Shapira. D.: Pattern matching in Huffman encoded texts. Proceedings of Uth IEEE Data Compression Conference (DCC'OJ), IEFF Computer Society Press, 2001.
[11] Kytöjoki. J., Salmela, L., Tarhio, J.: Tuning string matching for huge pattern sets. Proceedings of Cominatorial Pattern Matching (CPMV3), LNCS 2676. Springer-Verlag, 2003.
[12] Moura, E., Navarro. G., Ziviani, N., Baeza-Yates, R.: Fast and Flexible Word Searching on Compressed Text, ACM Transactions on Information Systems (TOIS), 18(2), 2000,113-139.
[13] Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming, J.Assoc. Comput. Mach., 46(3). 1999,395-415.
[14] Navarro, G.. Kida, T, Takeda, M.. Shinohara. A., Arikawa. S.: Faster Approximate String Matching over Compressed Text, Proc. 11th IEEE Data Compression Conference (DCC'Ol), 2001.
[15] Takeda, M., Miyamoto, S., Kida. T, Shinohara, A., Fukamachi, S., Shinohara. T, Arikawa, S.: Processing text files as is: Pattern matching over compressed texts, multi-byte character texts, and semi-structured texts, Proceedings of String Processing and Information Retrieval (SPIRE '02), LNCS 2476. Springer-Verlag, 2002.
[16] Takeda, M.. Shibata, Y.. Matsumoto. T. Kida. T, Shinohara. A., Fukamachi. S.. Shinohara. T, Arikawa, S.: Speeding up string pattern matching by text compression: The dawn of a new era, Transactions of Information Processing Society of Japan, 42(3), 2001, 370-384.
[17] Wu, S., Manber, U.: Fast text searching allowing errors, Commun. ACM, 35(10), 1992, 83-9L
[18] Ziv, J., Lempel, A.: A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory, 23, 1977,337-343.
[19] Ziv, J., Lempel. A.: Compression of individual sequences via variable length coding, IEEE Trans. Inf. Theory, 24, 1978,530-536.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BUS2-0005-0089