Given a sequence S of n symbols over some alphabet Σ of size σ, we develop new compression methods that are (i) very simple to implement; (ii) provide O(1) time random access to any symbol (or short substring) of the original sequence. Our simplest solution uses at most 2h+o(h) bits of space, where h = n(H_o(S)+1), and H_o(S) is the zeroth-order empirical entropy of S. We discuss a number of improvements and trade-offs over the basic method. For example, we can achieve n(H_k(S)+1)+o(n(H_k(S)+1)) bits of space, for k = o(logσ(n)). Several applications are discussed, including text compression, (compressed) full-text indexing and string matching.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. >From the algorithmic point of view this has several advantages: (i) the number of words is much less than the number of characters, which in effect means shorter text (less possible matching positions); (ii) the pattern effectively becomes shorter, so bit-parallel techniques become more applicable; (iii) the alphabet size becomes much larger, so the probability that two symbols (in this case, words) match is reduced. We extend several known approximate string matching algorithms for this scenario, allowing \kW insertions, deletions or substitutions of symbols (natural language words). We further extend the algorithms to allow \kC errors inside the pattern symbols (words) as well. The two error thresholds \kW and \kC can be applied simultaneously and independently. Hence we have in effect two alphabets, and perform approximate matching in both levels. >From the application point of view the advantage is that the method is flexible, allowing simple solutions to problems that are hard to solve with traditional approaches. Finally, we extend the algorithms to handle multiple patterns at the same time. Depending on the search parameters, we obtain algorithms that run in linear or sublinear time and that perform the optimal number of word comparisons on average, We conclude with experimental results showing that the methods work well in practice.
3
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(n[(log2σ)/ b]) time, where n is the size of the compressed text in bytes, s is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O([b/( H log2σ)]) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2b) space and O(b2b) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n [(log2σ)/ Hω]), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n[(log2 s)/ b]+t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n [(log2σ)/ b] é (m+s-1)/ω] u`+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O([b/( H log2σ)]). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.