Identyfikatory
Warianty tytułu
Języki publikacji
Abstrakty
The article discusses the use of word sequences in text classification. As opposed to ngrams, word sequences are not of a fixed length and therefore allow the classifier to obtain flexibility necessary to operate on documents collected from various sources. Presented classifier is built upon the suffix tree structure which enables word sequences to take part in classification process. During classification, both single words and longer sequences are taken into account and have impact on the category assignment with respect to their frequency and length. The Suffix Tree Classifier and well known Naive Bayes Classifier are compared and their properties are discussed. Obtained results show that incorporating word sequences into text classification can increase accuracy and reveal some interesting relations between maximal length of used sequences and classifier's error rate.
Słowa kluczowe
Rocznik
Tom
Strony
75--85
Opis fizyczny
Bibliogr. 9 poz., wykr.
Twórcy
autor
- Institute of Electronic Systems, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland
Bibliografia
- 1. Grossi R., Italiano G. F. (1993). Suffix trees and their applications in string algorithms. In: 1st South American Workshop on String Processing. 57-76.
- 2. Larsson J. (1998). The Context Trees of Block Sorting Compression. In: Proceedings of the IEEE Data Compression Conference. 189-198.
- 3. Na J. C., Apostolico A., Iliopoulos C.S., Park K. (2003). Truncated suffix trees and their application to data compression. In: Theoretical Computer Science. Vol. 304, No. 1-3, 87-101.
- 4. Pampapathi R. M., Mirkin B., Levene M. (2005). A Suffix Tree Approach to Email Filtering. Technical report.
- 5. Salton G., Buckley C. (1987). Term Weighting Approaches In Automatic Text Retrieval. Technical report.
- 6. Salton G., Wong A., Yang C. S. (1975). A Vector Space Model for Automatic Indexing. In: Communications of the ACM. Vol. 18, No. 11, 613-620.
- 7. Ukkonen E. (1995). On-Line Construction of Suffix Trees. In: Algorithmica. Vol. 14, No. 3, 249-260.
- 8. Weiner P. (1973). Linear Pattern Matching Algorithms. In: Proceedings of 14th Annual Symposium on Switching and Automata Theory. 1-11.
- 9. Yang Y., Liu X. (1999). A re-examination of text categorization methods. In: Proceedings of {SIGIR}-99, 22nd {ACM} International Conference on Research and Development in Information Retrieval. 42-49.
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-c8c92751-5719-4b4c-9c7f-57666c140c82