Wyniki wyszukiwania - BazTech

Ograniczanie wyników

1 Challenges of Modern Technology

1 2013

Znaleziono wyników: 1

Liczba wyników na stronie

Wyniki wyszukiwania

Wyszukiwano:
w słowach kluczowych: web page classification

Sortuj według:

Ogranicz wyniki do:

Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow

Jurkiewicz J., Nowiński A.

Challenges of Modern Technology

2013

Vol. 4, no. 4

3--6

An article focuses on the new methods for automatic processing and analysis of the scientific papers. It covers the very first part of this task – discovery and harvesting of scientific publications from the internet. Article is focused on discovery and analysis of the html documents to identify publication resources. Usage of data from Common Crawl project allows operating on large subset of the web pages without a need to perform an expensive crawl of the WWW. We present methods for automatic identification of pages describing scholarly documents in WWW network using html meta headers. Presented set of rules applied to the data achieves reasonable quality. A system based on these tools is also presented. It allows easy operating and transferring output to the COntent ANalysis SYStem(CoAnSys) - a processing and analysis system developed in ICM. For achieving this goal set of MapReduce tasks running with Hadoop And Ozzie has been used. The quality and efficiency of described rules are discussed. Finally future challenges for our system are presented.