Usage of dedicated data structures for URL databases in a large-scale crawling

Dorosz, K.

Artykuł - szczegóły

Tytuł artykułu

Usage of dedicated data structures for URL databases in a large-scale crawling

Autorzy

Dorosz K.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Zastosowanie dedykowanych struktur danych w bazach adresów URL crawlingu dużej skali

Języki publikacji

Abstrakty

The article discuss usage of Berkeley DB data structures such as hash tables and b-trees for implementation of a high performance URL database. The article presents a formal model for a data structures oriented URL database, which can be used as an alternative for a relational oriented URL database.

W artykule omówiono zastosowanie struktur danych z pakietu Berkeley DB, takich jak: tablice z haszowaniem i b-drzewa do implementacji wysoko wydajnych baz danych adresów URL. Przedstawiono model formalny bazy danych zorientowanej na struktury pamieci, która może być alternatywa dla relacyjnie zorientowanej bazy danych linków URL.

Słowa kluczowe

crawling crawler large-scale Berkeley DB URL database URL repository data structures

przeglądanie sieci robot internetowy Berkeley DB baza danych URL repozytorium URL struktury danych

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2009

Tom

Vol. 10

Strony

7--17

Opis fizyczny

Bibliogr. 14 poz., rys.

Twórcy

autor

Dorosz K.

Institute of Computer Sciences, AGH University of Science and Technology, Krakow, Poland,, dorosz@agh.edu.pl

Bibliografia

[1] S. Brin, L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. in Proc. WWW, pp. 107–117, 1998
[2] S. Brin, L. Page, R. Motwami, T. Winograd: The PageRank citation ranking: bringing order to the web. Proceedings of ASIS’98, 1998
[3] A. Heydon, M. Najork: Mercator: A Scalable, Extensible Web Crawler. World Wide Web, vol. 2, no. 4, pp. 219–229, 1999
[4] M. Najork, A. Heydon: High-Performance Web Crawling. World Wide Web, vol. 2, no. 4, pp. 219–229, 2001
[5] V. Shkapenyuk, T. Suel: Design and Implementation of a High-Performance Distributed Web Crawler. in Proc. IEEE ICDE, pp. 357–368, 2002
[6] J. Cho, H. Garcia-Molina, T. Haveliwala, W. Lam, A. Paepcke, S. R. G. Wesley: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology, vol. 6, no. 2, pp. 153–186, 2006
[7] H. T. Lee, D. Leonard, X. Wang, D. Loguinov: IRLbot: Scaling to 6 Billion Pages and Beyond. Texas A&M University, Tech. Rep. 2008-2-2, 2008
[8] C. Olston, S. Pandey: Recrawl scheduling based on information longevity. conf/www/2008, pp. 437–446, 2008
[9] J. Cho, H. Garcia-Molina: Effective Page Refresh Policies for Web Crawlers. ACM Transactions on Database Systems, 28 (4), 2003
[10] E. Coffman, Z. Liu, R. R.Weber: Optimal robot scheduling for web search engines. Journal of Scheduling, 1, 1998
[11] J. Edwards, K. S. McCurley, J. A. Tomlin: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proc. WWW, 2001
[12] S. Pandey, C. Olston: User-centric web crawling. In Proc. WWW, 2005
[13] J. Wolf, M. Squillante, P. S.Yu, J.Sethuraman, L. Ozsen: Optimal Crawling Strategies for Web Search Engines. In Proc. WWW, 2002
[14] W. Litwin: Linear Hashing: A New Tool for File and Table Addressing. Proceedings of the 6th International Conference on Very Large Databases (VLDB), 1980

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-AGH1-0023-0084