Wyniki wyszukiwania - BazTech

Ograniczanie wyników

Znaleziono wyników: 2

Liczba wyników na stronie

Wyniki wyszukiwania

Wyszukiwano:
w słowach kluczowych: compressed pattern matching

Sortuj według:

Ogranicz wyniki do:

Making dense codes even denser

Grabowski S.

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

2008

T. 12, z. 3

769-779

Dense byte oriented compression codes are a useful tool for compressing textual databases over a large alphabet. The requirement for large alphabet is naturally fulfilled for most human languages, where the symbols can be words, but also non segmented texts can be handled similarly, using q-grams. Recently, several interesting schemes, combining speed, high compression ratios, fast search support and simplicity, have been presented. In this work, we show a couple of simple ideas increasing slightly the compression ratios of common byte codes, like (s,c)-DC or tagged Huffman, assuming the text is static. Preliminary experimental results with one of those techniques show that it is more efficient with q-gram compression, and the compression ratio improves in those cases often by 1% or more, without compromising the search or decoding efficiency and simplicity.

Gęste kody bajtowe są użytecznym narzędziem kompresji tekstowych baz danych, przy założeniu, że alfabet użytych symboli jest duży. To założenie jest w prosty sposób spełnione dla większości języków naturalnych, gdzie symbolami są słowa; jednakże również teksty bez segmentacji mogą być obsługiwane w podobny sposób, przy użyciu q-gramów. W ostatnich latach zostało przedstawionych w literaturze kilka interesujących algorytmów kodowania dla dużych alfabetów, łączących szybkość kompresji i dekompresji, wysokie stopnie kompresji, wsparcie dla szybkiego wyszukiwania wzorca bezpośrednio w tekście, a przy tym cechujących się prostotą. W niniejszej pracy przedstawiamy kilka prostych idei zwiększających nieco stopień kompresji popularnych kodów bajtowych, takich jak (s,c)-DC oraz otagowany Huffman, przy założeniu, że kompresowany tekst jest statyczny. Wstępne eksperymenty z użyciem jednej z zaproponowanych technik pokazują, iż jest ona bardziej efektywna przy kompresji na bazie q-gramów niż na bazie słów, a stopień kompresji poprawia się w tych przypadkach często o 1% lub więcej, bez uszczerbku szybkości wyszukiwania wzorca czy dekompresji, oraz przy zachowaniu prostoty oryginalnych idei.

Time/Space Efficient Compressed Pattern Matching

Gąsieniec L., Potapov I.

Fundamenta Informaticae

2003

Vol. 56, nr 1,2

137-154

An exact pattern matching problem is to find all occurrences of a pattern p in a text t. We say that the pattern matching algorithm is optimal if its running time is linear in the sizes of t and p, i.e., O(t+p). Perhaps one of the most interesting settings of the pattern matching problem is when one has to design an efficient algorithm with a help of a small extra space. In this paper we explore this setting to the extreme. We work under an assumption that the text t is available only in a compressed form, represented by a straight-line program. The compression methods based on efficient construction of straight-line programs are as competitive as the compression standards, including the Lempel-Ziv compression scheme and recently intensively studied text compression via block sorting, due to Burrows and Wheeler. Our main result is an algorithm that solves the compressed string matching problem in an optimal linear time, with a help of a constant extra space. We also discuss an efficient implementation of a version our algorithm showing that the new concept may have also some interesting real applications. Our result is in contrast with many other compressed pattern matching algorithms where the goal is to find all pattern occurrences in time related to the size of the compressed text. However one must remember that all previous algorithms used at least a linear (in a compressed text, a dictionary, or a pattern) extra memory while our algorithm can be implemented in a constant size extra space. Also from the practical point of view, when the compression ratio is constant (very rarely smaller than 25%), there is no dramatic difference between the running time based on the size of the compressed text and the size of the original text, while an extra space resources might be strictly limited.