Analyzing the effect of dimensionality reduction in document categorization for Basque

Zelaia, A.; Alegria, I.; Arregi, O.; Sierra, B.

Artykuł - szczegóły

Tytuł artykułu

Analyzing the effect of dimensionality reduction in document categorization for Basque

Autorzy

Zelaia A. , Alegria I. , Arregi O. , Sierra B.

Wybrane pełne teksty z tego czasopisma

https://journals.pan.pl/acs/

Identyfikatory

Warianty tytułu

Konferencja

Human Language Technologies as a challenge for Computer Science and Linguistics (2; 21-23.04.2005; Poznań, Poland)

Języki publikacji

Abstrakty

This paper analyzes the incidence that dimensionality reduction techniques have in the process of text categorization of documents written in Basque. Classification techniques such as Naive Bayes, Winnow, SVMs and k-NN have been selected. The Singular Value Decomposition dimensionality reduction technique together with lemmatization and noun selection have been used in our experiments. The results obtained show that the approach combines SVD and k-NN for a lemmatized corpus gives the best accuracy rates of all with a remarkable difference.

Słowa kluczowe

text categorization singular value decomposition supervised classification

Wydawca

Polish Academy of Sciences, Committee of Automatic Control and Robotics

Czasopismo

Archives of Control Sciences

Rocznik

2005

Tom

Vol. 15, no. 4

Strony

703--710

Opis fizyczny

Bibliogr. 19 poz., tab.

Twórcy

autor

Zelaia A.

ccpzejaa@si.ehu.es

University of the Basque Country, UPV-EHU, Computer Science Faculty, 649 postakutxa, 20.080 Donostia, Gipuzkoa, Euskal-Heria, Spain

autor

Alegria I.

acpalloi@si.ehu.es

University of the Basque Country, UPV-EHU, Computer Science Faculty, 649 postakutxa, 20.080 Donostia, Gipuzkoa, Euskal-Heria, Spain

autor

Arregi O.

acparuro@si.ehu.es

University of the Basque Country, UPV-EHU, Computer Science Faculty, 649 postakutxa, 20.080 Donostia, Gipuzkoa, Euskal-Heria, Spain

autor

Sierra B.

ccpsiarb@si.ehu.es

University of the Basque Country, UPV-EHU, Computer Science Faculty, 649 postakutxa, 20.080 Donostia, Gipuzkoa, Euskal-Heria, Spain

University of the Basque Country, UPV-EHU, Computer Science Faculty, 649 postakutxa, 20.080 Donostia, Gipuzkoa, Euskal-Heria, Spain, ccpzejaa@si.ehu.es

Bibliografia

[1] I. Alegria, X. Artola, K. Sarasola and M. Urkia: Automatic Morphological Analysis of Basque. Literary & Linguistic Computing. 11 (1996).
[2] M. W. Berry and M. Browne: Understanding Search Engines: Mathematical Modeling and Text Retrieval. Society for Industrial and Applied Mathematics. ISBN: 0-89871-437-0. Philadelphia, 1999.
[3] M. W. Berry, S. T. Dumais and G. W. O'Brien: Using Linear Algebra For Intelligent Information Retrieval. SIAM Review, 37(4), (1995), 573-595.
[4] A. J. Carlson, C. M. Cumby, J. L. Rosen and D. Roth, Snow. UIUC Tech Report UIUC-DCS-R-99-210. 1999, University of Illinois.
[5] I. Dagan, Y. Karov, and D. Roth. Mistake-Driven Learning In Text Categorization. In Proceedings of The 2nd Conference On Empirical Methods In Natural Language Processing, Pages 55-63, 1997.
[6] B. V. Dasarathy, Nearest neighbor (nn) norms: Nn pattern recognition classification techniques. IEEE Computer Society Press, 1991.
[7] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science. 41:391-407, 1990.
[8) R. Dolin, J. Pierre, M. Butler and R. Avedon. Practical evaluation of ir within automated classification systems. Proceedings of the International Conference on Information and Knowledge Management CIKM. pages 322-329. November 1999.
[9] S. Dumais, Latent semantic analysis. ARIST (Annual Review- of Information Science Technology). 38:189-230. 2004.
[10] S. T. Dumais: Using Isi for information tillering: Tree-3 experiments. In D. Harman. editor. Third Text REtrieval Conference (TRECJl. pages 219-230, 1995.
[11] N. Ezeiza, I. Aduriz, I. Alegria, J. M. Arriola. and R. Urizar Combining stochastic and rule-based methods for disambiguation in agglutinative languages. COUNC-ACLVH. 1998.
[12] I. Inza, P. Larninaga, R. Etxeberria and B. Sierra: Feature subset selection by bayesian network-based optimization. Artificial Intelligence. 123:157-184. 2000.
[13] T. Joachims: Transductive inference for lex! classification using support vector machines. Proceedings of ICML.-99, 16th International Conference on Machine Learning, pages 200-209, 1999.
[14] M. Minsky: Steps toward artificial intelligence. In Proceedings of the Institute of Radio Engineers, volume 49, pages 8-30.
[15] P. Nakov, E. Valchanova and G. Angelova: Towards deeper understanding of the Isa performance. In Proc. of the Int. Conference RANLP-03 'Recent Advances in Natural language Processing", pages 311-318. Bulgaria, 2003.
[16] F. Sebastiani: Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47, March 2002.
[17] I. H. Witten and E. Frank: Data mining, practical machine learning tools and techniques with Java implementations. Morgan Kaufmann Publishers. 1999.
[18] D. Wolpert: Stacked generalization. Neural Networks. 5:241-259. 1992.
[19] Y. Yang and J. O. Pedersen: A comparative study on feature selection in text categorization. In Morgan Kaufmann, editor. Proceedings of the Fourteenth International Conference on Machine Learning. ICML’7, pages 412—420, 1997.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-article-BSW3-0021-0026