Making dense codes even denser

Grabowski, S.

Artykuł - szczegóły

Tytuł artykułu

Making dense codes even denser

Autorzy

Grabowski S.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Jak zagęścić kody gęste

Języki publikacji

Abstrakty

Dense byte oriented compression codes are a useful tool for compressing textual databases over a large alphabet. The requirement for large alphabet is naturally fulfilled for most human languages, where the symbols can be words, but also non segmented texts can be handled similarly, using q-grams. Recently, several interesting schemes, combining speed, high compression ratios, fast search support and simplicity, have been presented. In this work, we show a couple of simple ideas increasing slightly the compression ratios of common byte codes, like (s,c)-DC or tagged Huffman, assuming the text is static. Preliminary experimental results with one of those techniques show that it is more efficient with q-gram compression, and the compression ratio improves in those cases often by 1% or more, without compromising the search or decoding efficiency and simplicity.

Gęste kody bajtowe są użytecznym narzędziem kompresji tekstowych baz danych, przy założeniu, że alfabet użytych symboli jest duży. To założenie jest w prosty sposób spełnione dla większości języków naturalnych, gdzie symbolami są słowa; jednakże również teksty bez segmentacji mogą być obsługiwane w podobny sposób, przy użyciu q-gramów. W ostatnich latach zostało przedstawionych w literaturze kilka interesujących algorytmów kodowania dla dużych alfabetów, łączących szybkość kompresji i dekompresji, wysokie stopnie kompresji, wsparcie dla szybkiego wyszukiwania wzorca bezpośrednio w tekście, a przy tym cechujących się prostotą. W niniejszej pracy przedstawiamy kilka prostych idei zwiększających nieco stopień kompresji popularnych kodów bajtowych, takich jak (s,c)-DC oraz otagowany Huffman, przy założeniu, że kompresowany tekst jest statyczny. Wstępne eksperymenty z użyciem jednej z zaproponowanych technik pokazują, iż jest ona bardziej efektywna przy kompresji na bazie q-gramów niż na bazie słów, a stopień kompresji poprawia się w tych przypadkach często o 1% lub więcej, bez uszczerbku szybkości wyszukiwania wzorca czy dekompresji, oraz przy zachowaniu prostoty oryginalnych idei.

Słowa kluczowe

compressed pattern matching word-based models q-grams byte codes

wyszukiwanie w tekście skompresowanym modele oparte na słowach q-gramy kody bajtowe

Wydawca

Wydawnictwa AGH

Czasopismo

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

Rocznik

2008

Tom

T. 12, z. 3

Strony

769--779

Opis fizyczny

Bibliogr. 24 poz., tab.

Twórcy

autor

Grabowski S.

Katedra Automatyki Stosowanej, Politechnika Łódzka

Bibliografia

[1] Baeza-Yates R.A., Gonnet G.H., A new approach to text searching. Commun. ACM, 35(10), 1992, 74-82.
[2] Biskup M.T., Guaranteed Synchronization of Huffman Codes. Proc. Data Compression Conference (DCC'08), 2008. IEEE Computer Society Press, 462-471.
[3] Brisaboa N., Iglesias E., Navarro G., Parama J., An Efficient Compression Code for Text Databases. Proc. ECIR'03, LNCS 2633, 2003, 468-481.
[4] Brisaboa N., Farińa A., Navarro G., Esteller M., (s.c)-dense coding: An optimized Compression code for natural language text databases. [w:] Proc. lOth International Symposium on String Processing and Information Retrieval (SPIRE 2003), LNCS 2857, Springer-Verlag, 2003, 122-136.
[5] Brisaboa N., Farińa A., Navarro G., Parama J., Simple, fast, and efficient natural language adaptive Compression. [w:] Proc. llth International Symposium on String Processing and Information Retrieval (SPIRE 2004), LNCS 3246, Springer-Verlag, 2004, 230-241.
[6] Brisaboa N., Farińa A., Navarro G., Parama J., lmproving semistatic Compression via pair-based coding. [w:] Proc. 6th International Conference on Perspectives of System Informatics (PSF06), LNCS 4378, 2006, 124-134.
[7] Brisaboa N., Farińa A., Navarro G., Parama J., Lightweight natural language text compression. Information Retrieval, 10, 2007, 1-33.
[8] Culpepper J.S., Moffat A., Enhanced byte codes with restricted prefix properties. [w:] Proc. 12th International Symposium on String Processing and Information Retrieval (SPIRE 2005), LCCS 3772, Springer-Verlag, 2005, 1-12.
[9] Culpepper J.S., Moffat A., Phrase-based pattern matching in compressed text. [w:] Proc. 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006), Springer--Verlag, 2006, 337-345.
[10] Fredriksson K., Shift-or string matching with super-alphabets. Information Processing Letters, 7(4), 2003, 201-204.
[11] Fredriksson K., Private communication, Feb. 2007
[12] Fredriksson K., Grabowski S., A general compression algorithm that supports fast searching. Information Processing Letters, 100(6), 2006, 226-232.
[13] Fredriksson K., Nikitin F., Simple compression code supporting rondom access and fast string matching. [w:] Proc. 6th Workshop on Efficient and Experimental Algorithms (WEA'07), LNCS 4525, Springer-Verlag, 2007, 203-216.
[14] Horspool R.N., Practical fast searching in strings. Software-Practice and Experience, 10(6), 1980, 501-506.
[15] Klein S.T., Kopel Ben-Nissan M., Using Fibonacci compression codes as alternatives to dense coding. [w:] Proc. Data Compression Conference (DCC'08), IEEE Computer Society Press, 2008, 472-481.
[16] Moffat A., Word-based text compression. Software-Practice and Experience, 19(2), 1989, 185-198.
[17] de Moura E., Navarro G., Ziviani N., Baeza-Yates R., Fast and flexible word searching on compressed text. ACM Transactions on Information Systems (TOIS), 18(2), 2000, 113-139.
[18] Pizza & Chili Corpus. Maintained by P. Ferragina and G. Navarro. 2005-2007. http://pizzachi-li.dcc.uchile.cl/index.html.
[19] Rautio J., Tanninen J., Tarhio J., String matching with stopper encoding and code splitting. [w:] Proc. of Combinatorial Pattern Matching (CPM), LNCS 2373, Springer-Verlag, 2002, 42-52.
[20] Shkarin D., PPM: One step to practicality. [w:] Proc. Data Compression Conference (DCC'02), IEEE Computer Society Press, 2002, 202-211.
[21] Skibiński P, Grabowski S., Deorowicz S., Revisiting dictionary-based compression. Software-Practice and Experience, 35(15), 2005, 1455-1476.
[22] Skibiński P, Grabowski S., Swacha J., Effective asymmetric XML compression. Software-Practice and Experience, 2008, accepted.
[23] Skibiński P., Swacha J., Grabowski S., A Highly Efficient XML Compression Scheme for the Web. [w:] Proc. 34th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM'08), LNCS 4910, 2008, 766-777.
[24] Sun W., Zhang N., Mukherjee A., A Dictionary-Based Multi-Corpora Text Compression System. [w:] Proc. of Data Compression Conference (DCC'03), IEEE Computer Society Press, 2003, 448.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-AGH1-0017-0052