A novel text classification problem and its solution

Zadrożny, S.; Kacprzyk, J.; Gajewski, M.; Wysocki, M.

Artykuł - szczegóły

Tytuł artykułu

A novel text classification problem and its solution

Autorzy

Zadrożny S. , Kacprzyk J. , Gajewski M. , Wysocki M.

Wybrane pełne teksty z tego czasopisma

http://repozytorium.biblos.pk.edu.pl/resources/35433

Identyfikatory

Warianty tytułu

O pewnym zadaniu klasyfikacji dokumentów tekstowych i jego rozwiązaniach

Języki publikacji

Abstrakty

A new text categorization problem is introduced. As in the classical problem, there is a set of documents and a set of categories. However, in addition to being assigned to a specific category, each document belongs to a certain sequence of documents, referred to as a case. It is assumed that all documents in the same case belong to the same category. An example may be a set of news articles. Their categories may be sport, politics, entertainment, etc. In each category there exist cases, i.e., sequences of documents describing, for example evolution of some events. The problem considered is how to classify a document to a proper category and a proper case within this category. In the paper we formalize the problem and discuss two approaches to its solution.

W artykule proponuje się nowe zadanie kategoryzacji dokumentów tekstowych. Podobnie jak w zadaniu klasycznym rozważa się zbiór dokumentów tekstowych i zbiór kategorii. W odróżnieniu od zadania klasycznego, dokumenty są przypisane nie tylko do kategorii, ale również do określonej sekwencji dokumentów w ramach danej kategorii, zwanej sprawą. Zakłada się, że wszystkie dokumenty danej sprawy należą do tej samej kategorii. Przykładem może być kolekcja wiadomości prasowych. Mogą one należeć do kategorii takich, jak sport, polityka, rozrywka itp. W ramach każdej kategorii występują sekwencje wiadomości (sprawy) opisujące np. rozwój pewnych zdarzeń. Zadanie polega więc na zaklasyfikowaniu dokumentu do właściwej kategorii i właściwej sprawy w jej ramach. W artykule formalnie definiuje się nowe zadanie kategoryzacji i proponuje się dwa podejścia do jego rozwiązania.

Słowa kluczowe

text categorization sequences of documents sequence mining hidden Markov models

kategoryzacja dokumentów tekstowych sekwencje dokumentów odkrywanie wzorców sekwencji ukryte modele Markowa

Wydawca

Wydawnictwo Politechniki Krakowskiej im. Tadeusza Kościuszki

Czasopismo

Czasopismo Techniczne. Automatyka

Rocznik

2013

Tom

R. 110, z. 4-AC

Strony

7--16

Opis fizyczny

Bibliogr. 15 poz., wz., tab.

Twórcy

autor

Zadrożny S.

slawomir.zadrozny@ibspan.waw.pl

Systems Research Institute, Polish Academy of Sciences, Warsaw; Warsaw School of Information Technology

autor

Kacprzyk J.

Systems Research Institute, Polish Academy of Sciences, Warsaw
Department of Automatic Control and Information Technology, Faculty of Electrical and Computer Engineering, Cracow University of Technology

autor

Gajewski M.

Systems Research Institute, Polish Academy of Sciences, Warsaw

autor

Wysocki M.

Warsaw School of Information Technology

Bibliografia

[1] Agrawal R., Srikant R., Mining Sequential Patterns, 11th International Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995, 3-14.
[2] Allan J. (Ed.), Topic Detection and Tracking: Event-based Information, Kluwer Academic Publishers, 2002.
[3] Baeza-Yates R., Ribeiro-Neto R., Modern Information Retrieval, Adison-Wesley, New York 1999.
[4] Bird S., Dale R., Dorr B., Gibson B., Joseph M., Kan M.-Y., Lee D., Powley B., Radev D., Tan Y.F., The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics, Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, May 2008, 1755-1759.
[5] Blei D.M., Ng A.Y., Jordan M.I., Latent Dirichlet Allocation, Journal of Machine Learning Research, 3, 2003, 993-1022.
[6] McCallum A., Nigam K., A comparison of event models for Naive Bayes text classification, AAAI-98 Workshop on Learning for Text Categorization, 1998.
[7] Rabiner L., A tutorial on HMM and selected applications in speech recognition, Proceedings of the IEEE, 77 (2), 1989, 257-286.
[8] R Core Team, A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna 2013.
[9] Salton G., Buckley Ch., Term-Weighting Approaches in Automatic Text Retrieval, Information Processing & Management, 24 (5), 1988, 513-523.
[10] Sebastiani F., A tutorial on automated text categorisation, [in:] A. Amandi, A. Zunino (Eds.), Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires 1999, 7-35.
[11] Sebastiani F., Text Categorization, [in:] L.C. Rivero, J.H. Doorn, V.E. Ferraggine (Eds.) Encyclopedia of Database Technologies and Applications, Idea Group, 2005, 683-687.
[12] Wysocki M., Classification of text documents based on the sequence patterns, Master thesis, Warsaw School of Information Technology, 2013 (in Polish).
[13] Yang Y., Liu X., A re-examination of text categorization methods, 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘99), ACM, New York, NY, USA, 1999, 42-49.
[14] Zadrożny S., Kacprzyk J., Computing with words for text processing: An approach to the text categorization, Information Sciences, 176 (4), 2006, 415-437.
[15] Zaki M.J., SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, 42, 1–2 January 2001, 31-60.

Uwagi

This research was partially supported by the National Research Centre (contract No. UMO-2011/01/B/ST6/06908).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-9bc508aa-ba72-4f1d-9346-39fc66db6282