PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Cassiopeia – Towards a Distributed and Composable Crawling Platform

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
When it comes to designing and implementing crawling systems or Internet robots, it is of the utmost importance to first address efficiency and scalability issues (from a technical and architectural point of view), due to the enormous size and unimaginable structural complexity of the World Wide Web. There are, however, a significant number of users for whom flexibility and ease of execution are as important as efficiency. Running, defining, and composing Internet robots and crawlers according to dynamically-changing requirements and use-cases in the easiest possible way (e.g. in a graphical, drag & drop manner) is necessary especially for criminal analysts. The goal of this paper is to present the idea, design, crucial architectural elements, Proof-of-Concept (PoC) implementation, and preliminary experimental assessment of Cassiopeia framework, i.e. an all-in-one studio addressing both of the above-mentioned aspects.
Rocznik
Tom
Strony
79--89
Opis fizyczny
Bibliogr. 22 poz., rys., tab.
Twórcy
autor
  • AGH University of Science and Technology, Department of Computer Science, Kraków, Poland
autor
  • AGH University of Science and Technology, Department of Computer Science, Kraków, Poland
  • AGH University of Science and Technology, Department of Computer Science, Kraków, Poland
Bibliografia
  • [1] F. Maghoul et al., “Graph structure in the Web”, in Proc. 9th Int. World Wide Web Conf., Amsterdam, The Netherlands, 2000, pp. 309–320.
  • [2] H. Garcia-Molina, A. Paepcke, A. Arasu, J. Cho, and S. Raghavan, “Searching the Web”, ACM Trans. Internet Technol., vol. 1, no. 1, pp. 2–43, 2001.
  • [3] A. Gulli and A. Signorini, “The indexable Web is more than 11.5 billion page”, in Proc. 14th Int. World Wide Web Conf., Chiba, Japan, 2005, pp. 902–903.
  • [4] K. Bharat and A. Broder, “A technique for measuring the relative size and overlap of public search engines”, inProc. 7th Int. World Wide Web Conf., Brisbane, Australia, 1998, pp. 379–388.
  • [5] A. Singh, M. Srivatsa, L. Liu, and T. Miller, “Apoidea: A decentralized peer-to-peer architecture for crawling the World-Wide-Web”, in Proc. SIGIR Worksh. Distrib. Inform. Retrieval, Toronto, Canada, 2003.
  • [6] J. Cho and H. Garcia-Molina, “Parallel crawlers”, in Proc. 11th Int. World Wide Web Conf., Honolulu, Hawaii, 2002, pp. 124–135.
  • [7] P. Boldi, B. Codenotti, M. Santini, and S. Vigna, “Ubicrawler: A scalable fully distributed web crawler”, Software: Practice and Experience, vol. 34, no. 8, pp. 711–726, 2004.
  • [8] V. Shkapenyuk and T. Suel, “Design and implementation of a high performance distributed Web crawler”, in Proc. 18th IEEE Int. Conf. Data Engin., San Jose, CA, USA, 2002.
  • [9] M. Santini, P. Boldi, B. Codenotti, and S. Vigna, “Ubicrawler: A scalable fully distributed web crawler”, in Proc. 8th Australian World Wide Web Conf., Sushine Coast, Queensland, Australia, 2002.
  • [10] F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz, “Evaluating topic-driven Web-crawlers”, in Proc. 24th Ann. Int. Conf. Res. Develop. Inform. Retriev., New York, USA, 2001, pp. 241–249.
  • [11] K. Wlodarczyk, “Kassiopeia – distributed and pluginnable crawling system”, Master thesis, Department of Computer Science, University of Science and Technology, Kraków, 2011.
  • [12] K. Donald, C. Sampaleanu, R. Johnson, and J. Hoeller, “Spring framework reference documentation” [Online]. Available: http://docs.spring.io/spring/docs/3.0.x/spring-frameworkreference/html/
  • [13] M. Welsh, D. Culler, E. Brewer, and E. Gribble, “SEDA: An architecture for Well-Conditioned scalable internet services”, Harvard University, 2001 [Online]. Available: http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
  • [14] B. M. Michelson, “Event-Driven Architecture Overview”, Patricia Seybold Group, Boston, USA, 2006 [Online]. Available: http://www.omg.org/soa/Uploaded%20Docs/EDA/bda2-2-06cc.pdf
  • [15] D. Lewin, D. Karger, T. Leighton, and A. Sherman, “Web caching with consistent hashing”, in Proc. 8th Int. World Wide Web Conf., Toronto, Canada, 1999.
  • [16] D. Lewin, M. Lehman, D. Karger, T. Leighton, and R. Panigrahy, “Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web”, in Proc. 8th Int. World Wide Web Conf., Toronto, Canada, 1999.
  • [17] V. S. Pai, P. Druschel, and W. Zwaenpoel, “Flash: An Efficient and Portable Web Server”, Ann. Tech. Conf., Monterey, CA, USA, 1999 [Online]. Available: http://static.usenix.org/event/usenix99/full papers/pai/pai.pdf
  • [18] D. Pariag et al., “Comparing the performance of Web server architectures”, in Proc. 2nd ACM SIGOPS/EuroSys European Conf. Comp. Sys., Lisbon, Portugal, 2007, pp. 231–243.
  • [19] M. Welsh, “A Retrospective on SEDA”, 2010 [Online]. Available: http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
  • [20] M. Kluczny, “SEDA as an architecture for efficient, distributed and concurrent systems for Web crawling purposes”, Master thesis, Department of Computer Science, University of Science and Technology, Kraków, 2012.
  • [21] L. Siwik, K. Wlodarczyk, and M. Kluczny, “Staged event-driven architecture as a micro-architecture of distributed and pluginable crawling platform”, Comp. Science, vol. 14, no. 4, pp. 645–665, 2013.
  • [22] Cassiopeia Web Crawler [Online]. Available: http://home.agh.edu.pl/siwik/crawler/
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-0183de3f-8652-45ab-b056-a58a65d8f3dc
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.