PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Distributed web-scale infrastructure for crawling, indexing and search with semantic support

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implemen-tation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
Wydawca
Czasopismo
Rocznik
Strony
5--19
Opis fizyczny
Bibliogr. 30 poz., rys., tab.
Twórcy
  • Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia
autor
  • Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia
autor
  • Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia
autor
  • Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia
Bibliografia
  • [1] Chang F., Dean J., Ghemawat S., Hsieh W. C., Wallach D. A., Burrows M., Chandra T., Fikes A., Gruber R. E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26:4:1–4:26, June 2008.
  • [2] Ciglan M., Babik M., ˇSeleng M., Laclavik M., Hluch´y L.: Running mapreduce type jobs in grid infrastructure. In Cracow ’08 Grid Workshop : proceedings, 2009.
  • [3] Dean J., Ghemawat S.: Mapreduce: simplified data processing on large clusters. In Proc. of the 6th conference on Symposium on Opearting Systems Design & Implementation — vol. 6, pp. 10–10, Berkeley, CA, USA, 2004. USENIX Association.
  • [4] Dlugolinsky S., Laclavik M., Hluchy L.: Towards a search system for the Web exploiting spatial data of a web document. In Proc. of the 2010 Workshops on Database and Expert Systems Applications, DEXA ’10, pp. 27–31, Washington, DC, USA, 2010. IEEE Computer Society.
  • [5] Dlugolinsky S., Laclav´ık M., Seleng M.: Vyhladavanie informaciı na webe podˇla vzdialenosti. In Proc. of the 4th Workshop on Intelligent and Knowledge orientem Technologies, WIKT 2009, Koˇsice, Slovakia, November 2009. Equilibria.
  • [6] Gatial E., Balogh Z.: Identifying, retrieving and determining relevance of heterogenous internet resources. In P. N´avrat et al., ed., Tools for Acquisition, Organisation and Presenting of Information and Knowledge, Research roject Workshop (NAZOU), in conjunction with ITAT 2006, pp. 15–21, Bystr´a dolina, N´ızke Tatry, Slovakia, September 2006. Slovak University of Technology Bratislava.
  • [7] Google: The Google Geocoding API. http://developers.google.com/maps/ documentation/geocoding/, May 2012.
  • [8] Google: The Google Places Autocomplete API (Experimental). http://developers.google.com/maps/documentation/places/autocomplete, May 2012.
  • [9] Habala O., Hluchy L., Tran V., Krammer P., Seleng M.: Using advanced data mining and integration in environmental prediction scenarios. Computer Science, 13(1):5–16, 2012.
  • [10] Hahn R., Bizer C., Sahnwaldt C., Herta C., Robinson S., Burgle M., Duwiger H., Scheel U.: Faceted wikipedia search. In W. Abramowicz, R. Tolksdorf, W. Aalst, J. Mylopoulos, M. Rosemann, M. J. Shaw, C. Szyperski, ed., Business Information Systems, vol. 47 of Lecture Notes in Business Information Processing, pp. 1–11. Springer, Berlin Heidelberg, 2010.
  • [11] Hearst M. A.: Design recommendations for hierarchical faceted search interfaces. In SIGIR, Workshop on Faceted Search, 2006.
  • [12] Kunszt P. Z., Szalay A. S., Thakar A. R.: The hierarchical triangular mesh. In A. J. Banday, S. Zaroubi, M. Bartelmann, ed., Mining the Sky: Proc. of the MPA/ESO/MPE Workshop Held at Garching, Germany, July 31 – August 4, Distributed web-scale infrastructure for crawling (...) 17 2000, volume XV of ESO Astrophysics Symposia, pp. 631–637, Springer-Verlag, Berlin Heidelberg, 2001.
  • [13] Laclavık M., Dlugolinsky V., Seleng M., Ciglan M., Hluchy L.: Emails as graph: relation discovery in email archive. In Proc. eedings of the 21st international conference companion on World Wide Web, WWW ’12 Companion, pp. 841– 846, New York, NY, USA, 2012. ACM.
  • [14] Laclavık M., Seleng M., Ciglan M., Hluchy L.: Supporting collaboration by large scale email analysis. In M. Bubak, M. Turala, K. Wiatr, ed., Cracow’ 08 Grid Workshop, pp. 382–387, Krakow, 2009. (Academic Computer Centre CYFRONET AGH)
  • [15] Laclavık M., Seleng M., Hluchy L.: Towards large scale semantic annotation built on mapreduce architecture. In Proc. of the 8th international conference on Computational Science, Part III, ICCS ’08, pp. 331–338, Springer-Verlag, Berlin, Heidelberg, 2008.
  • [16] LinkedIn Corporation: Apache Velocity template language. http://www.linkedin.com/, May 2012.
  • [17] Monster: Monster website. http://www.monster.com/, May 2012.
  • [18] Seleng M.: Distribuovan´e spracovanie d´at nad mapreduce architekt´urou (hadoop a hive). In Proc. of the 5th Workshop on Intelligent and Knowledge oriented Technologies, WIKT 2010, p. 141, Bratislava, Institute of Informatics SAS, November 2010.
  • [19] Slovak University of Technology in Bratislava, Institute of Informatics SAS, Pavol Jozef ˇSaf´arik University in Koˇsice, Softec, Ltd.. NAZOU website. http://nazou. fiit.stuba.sk, May 2012.
  • [20] Szalay A., Gray J., Fekete G., Kunszt P. Z., Kukol P., Thakar A.: Indexing the sphere with the hierarchical triangular mesh. Technical Report MSR-TR-2005- 123, Microsoft Research Advanced Technology Division, Microsoft Corporation One Microsoft Way Redmond, WA 98052, 2005.
  • [21] The Apache Software Foundation: Apache Lucene website. http://lucene.apache.org, May 2012.
  • [22] The Apache Software Foundation: Apache Solr indexing and searching framework. http://lucene.apache.org/solr/, May 2012.
  • [23] The Apache Software Foundation: Apache Velocity template language. http://velocity.apache.org/, May 2012.
  • [24] The Apache Software Foundation: Hadoop Distributed File System site. http://hadoop.apache.org/hdfs/, May 2012.
  • [25] The Apache Software Foundation: Hadoop site. http://hadoop.apache.org/, May 2012.
  • [26] The Apache Software Foundation: HBase site. http://hbase.apache.org/, May 2012.
  • [27] The Apache Software Foundation: Nutch site. http://nutch.apache.org/, May 2012. 18 ˇStefan Dlugolinsk´y, Martin ˇSeleng, Michal Laclav´ık, Ladislav Hluch´y
  • [28] The Apache Software Foundation: Spatial search in Lucene. http://wiki.apache.org/lucene-java/SpatialSearch, May 2012.
  • [29] W3Techs.: World wide web technology surveys. http://w3techs.com/, October 2011.
  • [30] Yahoo! Inc.: Yahoo! PlaceFinder. http://developer.yahoo.com/geo/placefinder/, May 2012.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-article-AGH1-0032-0045
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.