Compact representation of URL collections with fast access

Grabowski, S.; Swacha, J.

Artykuł - szczegóły

Tytuł artykułu

Compact representation of URL collections with fast access

Autorzy

Grabowski S. , Swacha J.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Oszczędna reprezentacja kolekcji URL z szybkim dostępem do danych

Języki publikacji

Abstrakty

Efficient representation of a string dictionary is a well-known problem with applications e.g. in Web searchers and spellchecking. Traditionally, the dictionary is relatively minor compared to the text from which the terms (words) are collected, but in several applications the number of dictionary items is huge, making a compressed format highly desirable. One of those cases are document addresses on the Internet, i.e., their URLs. Large collections of URLs are useful e.g. in analyses of (possibly large portions of) the Web graph. In this work we present an efficient compression algorithm for lexicogra-phically ordered collections of URLs, supporting extract queries.

Efektywna reprezentacja słownika fraz tekstowych jest klasycznym problemem mającym zastosowania m.in. w wyszukiwarkach internetowych i kontroli pisowni. Zazwyczaj słownik jest stosunkowo mały w stosunku do tekstu, z którego zebrano kolekcję fraz (słów), jednak w niektórych zastosowaniach liczba fraz może być ogromna, co praktycznie zmusza do wykorzystania kompresji. Jednym z takich przykładów są kolekcje adresów dokumentów internetowych, tj. kolekcje URL. Duże kolekcje URL wykorzystywane są np. w analizie dużych wycinków tzw. grafu webowego. W niniejszej pracy proponujemy efektywny algorytm kompresji ułożonych leksykograficznie kolekcji URL, z obsługą zapytań typu extract.

Słowa kluczowe

compressed dictionaries URL compression random access web graph

słowniki skompresowane kompresja URL swobodny dostęp graf webowy

Wydawca

Wydawnictwa AGH

Czasopismo

Automatyka / Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie

Rocznik

2011

Tom

T. 15, z. 3

Strony

349--355

Opis fizyczny

Bibliogr. 16 poz., tab.

Twórcy

autor

Grabowski S.

Technical University of Lodz, Computer Engineering Dept., Lodz, Poland

autor

Swacha J.

University of Szczecin, Institute of Information Technology in Management, Szczecin, Poland

Bibliografia

[1] Ahmad F., Kondrak G., Learning a spelling error model from search ąuery logs. Proc. Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, 2005, 955-962.
[2] Boldi P., Vigna S., The webgraph framework I: compression techniąues. WWW 2004, 595-602.
[3] Apostolico A., Drovandi G., Graph compression byBFS. Algorithms, 2(3), 2009, 1031-1044.
[4] Claude R, Navarro G., Fast and compact web graph representations. ACM Transactions on the Web (TWEB), 4(4):article 16, 2010.
[5] Grabowski Sz., Bieniecki W., Merging adjacency lists for efficient Web graph compression. Accepted to ICMMI, 2011.
[6] Askitis N., Efficient Data Structures for Cache Architectures. Ph.D. dissertation, RMIT University, Melbourne, Victoria, Australia, 2007.
[7] Ciura M.G., Deorowicz S., How to sąueeze a lexicon. Software—Practice and Experience 31(11), 2001, 1077-1090.
[8] Skibiński R, Grabowski Sz., Deorowicz S., Revisiting dictionary-based compression. Software-Practice and Experience, 35(15), 2005, 1455-1476.
[9] Heinz S., Zobel J., Williams H. E., Burst tries: afast, efficient data structurefor string keys. ACM Transactions on Information Systems, 20(2), 2002, 192-223.
[10] Askitis N., Sinha R., HAT-trie: A Cache-conscious Trie-based Data Structurefor Strings. ACSC 2007: 97-105.
[11] Belazzougui D., Boldi P., Pagh R., Vigna S., Monotone minimal perfect hashing: searching a sorted table with 0(1) accesses. SODA, 2009, 785-794.
[12] Belazzougui D., Boldi R, Pagh R., Vigna S., Theory and Practise of Monotone Minimal Perfect Hashing. ALENEX, 2009, 132-144.
[13] Hu T., Tucker A., Optimal computer search trees and variable-length alphabetical codes. SIAMJournal on Applied Mafhematics, 21(4), 1971, 514-532.
[14] Brisaboa N., Canovas R., Claude F., Martinez-Prieto M.A., Navarro G., Compressed String Dictionaries. SEA, 2011, 136-147.
[15] Larsson N.J., Moffat J.A., Offline dictionary-based compression. Proc. of the IEEE, 88, 2000, 1722-1732.
[16] Boldi R, Codenotti B., Santini M., Vigna S., UbiCrawler: A scalable fiilly distributed web crawler. Software-Practice and Experience, 34(8), 2004, 711-726.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-AGH1-0028-0113