Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow

Jurkiewicz, J.; Nowiński, A.

Artykuł - szczegóły

Tytuł artykułu

Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow

Autorzy

Jurkiewicz J. , Nowiński A.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

An article focuses on the new methods for automatic processing and analysis of the scientific papers. It covers the very first part of this task – discovery and harvesting of scientific publications from the internet. Article is focused on discovery and analysis of the html documents to identify publication resources. Usage of data from Common Crawl project allows operating on large subset of the web pages without a need to perform an expensive crawl of the WWW. We present methods for automatic identification of pages describing scholarly documents in WWW network using html meta headers. Presented set of rules applied to the data achieves reasonable quality. A system based on these tools is also presented. It allows easy operating and transferring output to the COntent ANalysis SYStem(CoAnSys) - a processing and analysis system developed in ICM. For achieving this goal set of MapReduce tasks running with Hadoop And Ozzie has been used. The quality and efficiency of described rules are discussed. Finally future challenges for our system are presented.

Słowa kluczowe

Hadoop web mining scientific content finding web page classification

Wydawca

Foundation for Young Scientists

Czasopismo

Challenges of Modern Technology

Rocznik

2013

Tom

Vol. 4, no. 4

Strony

3--6

Opis fizyczny

Bibliogr. 8 poz., tab.

Twórcy

autor

Jurkiewicz J.

J.Jurkiewicz@icm.edu.pl

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland

autor

Nowiński A.

A.Nowinski@icm.edu.pl

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland

Bibliografia

[1] Dendek, P. J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P. & Bolikowski, L. (2013). Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop, 12. Information Retrieval; Digital Libraries. Retrieved from http://arxiv.org/abs/1303.5367
[2] Zamlynska, K., Bolikowski, L. & Rosiek, T. (2008). Migration of the Mathematical Collection of Polish Virtual Library of Science to the YADDA platform. Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, 127-130.
[3] Kosala, R. & Blockeel, H. (2000). Web mining research. ACM SIGKDD Explorations Newsletter, 2(1), 1–15. doi:10.1145/360402.360406
[4] Turner, T. P. & Lise, B. (1998). Rising to the Top: Evaluating the Use of the HTML META Tag to Improve Retrieval of World Wide Web Documents through Internet Search Engines - Library Resources & Technical Services - Volume 42, Number4 / 1998 - American Library Association. Library Resources & Technical Services, v42 n4 Oct 1998. Retrieved July 5, 2013, from http://alcts.metapress.com/content/gq8151m1l 8515845/
[5] Gyongyi, Z. & Garcia-Molina, H. (2005, April 1). Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). Retrieved from http://ilpubs.stanford.edu:8090/771/1/2005-9.pdf
[6] Beel, J. & Gipp, B. (2010). On the Robustness of Google Scholar Against Spam. Proceedings of the 21st ACM conference on Hypertext and hypermedia - HT ’10 (p. 297). New York, New York, USA: ACM Press. doi:10.114 /1810617.1810683
[7] Ardö, A. (2010). Can We Trust Web Page Metadata? Journal of Library Metadata, 10(1), 58–74. doi:10.1080/19386380903547008
[8] Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 1–13. doi:10.1145/1327452.1327492

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-be830447-fd97-4173-aacc-e21081d5a442