Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
An article focuses on the new methods for automatic processing and analysis of the scientific papers. It covers the very first part of this task – discovery and harvesting of scientific publications from the internet. Article is focused on discovery and analysis of the html documents to identify publication resources. Usage of data from Common Crawl project allows operating on large subset of the web pages without a need to perform an expensive crawl of the WWW. We present methods for automatic identification of pages describing scholarly documents in WWW network using html meta headers. Presented set of rules applied to the data achieves reasonable quality. A system based on these tools is also presented. It allows easy operating and transferring output to the COntent ANalysis SYStem(CoAnSys) - a processing and analysis system developed in ICM. For achieving this goal set of MapReduce tasks running with Hadoop And Ozzie has been used. The quality and efficiency of described rules are discussed. Finally future challenges for our system are presented.
Słowa kluczowe
Wydawca
Czasopismo
Rocznik
Tom
Strony
3--6
Opis fizyczny
Bibliogr. 8 poz., tab.
Twórcy
autor
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland
autor
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland
Bibliografia
- [1] Dendek, P. J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P. & Bolikowski, L. (2013). Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop, 12. Information Retrieval; Digital Libraries. Retrieved from http://arxiv.org/abs/1303.5367
- [2] Zamlynska, K., Bolikowski, L. & Rosiek, T. (2008). Migration of the Mathematical Collection of Polish Virtual Library of Science to the YADDA platform. Towards Digital Mathematics Library. Birmingham, United Kingdom, July 27th, 2008, 127-130.
- [3] Kosala, R. & Blockeel, H. (2000). Web mining research. ACM SIGKDD Explorations Newsletter, 2(1), 1–15. doi:10.1145/360402.360406
- [4] Turner, T. P. & Lise, B. (1998). Rising to the Top: Evaluating the Use of the HTML META Tag to Improve Retrieval of World Wide Web Documents through Internet Search Engines - Library Resources & Technical Services - Volume 42, Number4 / 1998 - American Library Association. Library Resources & Technical Services, v42 n4 Oct 1998. Retrieved July 5, 2013, from http://alcts.metapress.com/content/gq8151m1l 8515845/
- [5] Gyongyi, Z. & Garcia-Molina, H. (2005, April 1). Web Spam Taxonomy. First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). Retrieved from http://ilpubs.stanford.edu:8090/771/1/2005-9.pdf
- [6] Beel, J. & Gipp, B. (2010). On the Robustness of Google Scholar Against Spam. Proceedings of the 21st ACM conference on Hypertext and hypermedia - HT ’10 (p. 297). New York, New York, USA: ACM Press. doi:10.114 /1810617.1810683
- [7] Ardö, A. (2010). Can We Trust Web Page Metadata? Journal of Library Metadata, 10(1), 58–74. doi:10.1080/19386380903547008
- [8] Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 1–13. doi:10.1145/1327452.1327492
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-be830447-fd97-4173-aacc-e21081d5a442