Cluo: web-scale text mining system for open source intelligence purposes

Maciołek, P.; Dobrowolski, G.

Artykuł - szczegóły

Tytuł artykułu

Cluo: web-scale text mining system for open source intelligence purposes

Autorzy

Maciołek P. , Dobrowolski G.

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

The amount of textual information published on the Internet is considered to be in billions of web pages, blog posts, comments, social media updates and others. Analyzing such quantities of data requires high level of distribution – both data and computing. This is especially true in case of complex algorithms, often used in text mining tasks. The paper presents a prototype implementation of CLUO – an Open Source Intelligence (OSINT) system, which extracts and analyzes significant quantities of openly available information.

Słowa kluczowe

text mining big data OSINT natural language processing monitoring

Wydawca

Wydawnictwa AGH

Czasopismo

Computer Science

Rocznik

2013

Tom

Vol. 14 (1)

Strony

45--62

Opis fizyczny

Bibliogr. 21 poz., rys., tab.

Twórcy

autor

Maciołek P.

pmaciolek@luminis-research.com

Luminis Research Sp.z o.o., Rzeszów, Poland
AGH University of Science and Technology, Krakow, Poland

autor

Dobrowolski G.

grzela@agh.edu.pl

AGH University of Science and Technology, Krakow, Poland

Bibliografia

[1] NATO Open Source Intelligence Handbook. NATO, 2001.
[2] NATO Intelligence Exploitation of the Internet. NATO, 2002.
[3] National Defense Authorization Act for Fiscal Year 2006. 2006.
[4] Berger A. L., Pietra V. J. D., Pietra S. A. D.: A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39–71, March 1996.
[5] Cover T., Thomas J.: Elements of Information Theory. Wiley, 1991.
[6] Damianos L. E., Ponte J. M., Wohlever S., Reeder F., Wilson D. G., Hirschman L.: Mitap, text and audio processing for bio-security: A case study. In National Conference on Artificial Intelligence, pp. 807–814, 2002.
[7] Dean J., Ghemawat S.: Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.
[8] Dean J., Ghemawat S.: Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107–113, January 2008.
[9] Fellbaum C.: WordNet – An Electronic Lexical Database. The MIT Press, 1998.
[10] Fielding R. T.: Architectural styles and the design of network-based software architectures. PhD thesis, 2000.
[11] Jurafsky D., Martin J. H.: Speech and Language Processing Prentice Hall, 2 ed., 2008.
[12] Leskovec J., Backstrom L., Kleinberg J.: Meme-tracking and the dynamics of the news cycle. In Proc. of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 497–506, New York, NY, USA, 2009. ACM.
[13] Lubaszewski W., Gajęcki M.: Automatic extraction of semantic association from polish text. Computer Science, 4:119–130, 2002.
[14] Maciolek P., Dobrowolski G.: Is shallow semantic analysis really that shallow? a study on improving text classification performance. In IMCSIT, pp. 455–460, 2010.
[15] Manning C., Raghavan P., Schutze H.: Introduction to Information Retrieval. Cambridge University Press, 1 ed., 2008.
[16] Maziarz M., Piasecki M., Szpakowicz S.: Approaching plWordNet 2.0. In Proc. of the 6th Global Wordnet Conference, Matsue, Japan, January 2012.
[17] Piasecki M., Szpakowicz S., Broda B.: A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wroclawskiej, Wroclaw, 2009.
[18] Porter M. F.: An algorithm for suffix stripping. Program, 1980.
[19] Przepiorkowski A., Bańko M., Gorski R. L., Lewandowska-Tomaszczyk B., eds. Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN, Warsaw, 2012.
[20] Toutanova K., Klein D., Manning C., Singer Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL 2003, 2003.
[21] Toutanova K., Manning C. D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 2000.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-2f1552a4-d79d-47c6-8ef4-96e85296406e