Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

Lango, Mateusz

doi:10.2478/fcds-2019-0009

Artykuł - szczegóły

Tytuł artykułu

Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

Autorzy

Lango Mateusz

Wybrane pełne teksty z tego czasopisma

Identyfikatory

DOI

10.2478/fcds-2019-0009

Warianty tytułu

Języki publikacji

Abstrakty

Sentiment classification is an important task which gained extensive attention both in academia and in industry. Many issues related to this task such as handling of negation or of sarcastic utterances were analyzed and accordingly addressed in previous works. However, the issue of class imbalance which often compromises the prediction capabilities of learning algorithms was scarcely studied. In this work, we aim to bridge the gap between imbalanced learning and sentiment analysis. An experimental study including twelve imbalanced learning preprocessing methods, four feature representations, and a dozen of datasets, is carried out in order to analyze the usefulness of imbalanced learning methods for sentiment classification. Moreover, the data difficulty factors - commonly studied in imbalanced learning - are investigated on sentiment corpora to evaluate the impact of class imbalance.

Słowa kluczowe

sentiment analysis imbalanced data multi-class learning data difficulty factor text classification

Wydawca

Wydawnictwo Politechniki Poznańskiej

Czasopismo

Foundations of Computing and Decision Sciences

Rocznik

2019

Tom

Vol. 44, No. 2

Strony

151--178

Opis fizyczny

Bibliogr. 75 poz., tab.

Twórcy

autor

Lango Mateusz

mateusz.lango@cs.put.poznan.pl

Institute of Computing Sciences, Poznan University of Technology, Poznań, Poland

Bibliografia

[1] Abbasi, A., France, S., Zhang, Z., Chen, H.: Selecting Attributes for Sentiment Classification Using Feature Relation Networks. IEEE Transactions on Knowledge and Data Engineering, 23 (3), 447-462 (2011).
[2] Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proc. of the Int. Conference on Language Resources and Evaluation (2010).
[3] Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1471-2105 (2013).
[4] Blitzer, M. D., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2007), 440-447 (2007).
[5] Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150 A, 184-203 (2015).
[6] Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In Advances in Data Analysis with Computational Intelligence Methods, 51-85, Springer (2018).
[7] Brzezinski, D. and Stefanowski, J.: Stream Classification. Encyclopedia of Machine Learning and Data Mining, Springer (2017).
[8] Burns N., Bi Y., Wang H., Anderson T.: Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In: Konig A., Dengel A., Hinkelmann K., Kise K., Howlett R.J., Jain L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems, LNCS, 6881, 161-170 (2011).
[9] Chawla, N.: Data mining for imbalanced datasets: An overview. In Maimon O., Rokach L. (eds): The Data Mining and Knowledge Discovery Handbook, Springer, 853-867 (2005).
[10] Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research, 16, 341-378 (2002).
[11] Das, S. R., Chen, M. Y.: Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375-1388 (2007).
[12] Fernández A., García S., Galar M., Prati R., Krawczyk B., Herrera H.: Learning from Imbalanced Data Sets. Springer (2018).
[13] Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Jour- nal of Artificial Intelligence Research, 61, 863-905 (2018).
[14] Fernández, A., Lopez, V., Galar, M., Jesus M., Herrera, F.: Analysing the classification of imbalanced data sets with multiple classes, binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97-110 (2013).
[15] Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44, 1821-1833 (2011).
[16] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484 (2012).
[17] Ganu, G., Elhadad, N., Marian, A.: Beyond the stars: improving rating predictions using review text content. In Proc. of 12th Int. Workshop on the Web and Databases, 9, 1-6 (2009).
[18] Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In Proc. of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 4756, 397-406 (2007).
[19] Han, H., Wen-Yuan, W., Bing-Huan, M.: Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. Advances in intelligent computing, 878-887 (2005).
[20] He, H., Yang, B., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE Int. Joint Conference on Neural Networks, 1322-1328 (2008).
[21] He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering, 21 (9), 1263-1284 (2009).
[22] He, H. and Ma, Y.: Imbalanced learning: foundations, algorithms, and applications, Wiley (2013).
[23] Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining, 2 (5-6), 412-426 (2009).
[24] Hu, M., Liu, B.: Mining and summarizing customer reviews. In Proc. of the 10th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 168-177 (2004).
[25] Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intelligent Data Analysis Journal, 6 (5), 429-450 (2002).
[26] Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6 (1), 40-49 (2004).
[27] Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723-762 (2014).
[28] Koppel, M, Schler, J.: The Importance of Neutral Examples for Learning Sentiment. Computational Intelligence, 22, 100-109 (2006).
[29] Krawczyk B., McInnes B.T., Cano A.: Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization. In: Martínez de Pisón F., Urraca R., Quintión H., Corchado E. (eds) Hybrid Artificial Intelligent Systems, LNCS, 10334, 26-37 (2017).
[30] Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: oneside selection. In Proc. of the 14th Int. Conf. on Machine Learning ICML-97, 179-186 (1997).
[31] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algorithms: Methods and Algorithms. Wiley (2004).
[32] Lango M., Brzeziński D., Firlik S., Stefanowski J.: Discovering Minority Subclusters and Local Difficulty Factors from Imbalanced Data. In Proc. of the 20th Int. Conference on Discovery Science (2017).
[33] Lango M., Brzeziński D., Stefanowski J.: PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis, In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).
[34] Lango, M., Napierala, K., Stefanowski, J.: Evaluating Difficulty of Multi-class Imbalanced Data. In Proc. of 23rd Int. Symposium on Methodologies for Intelligent Systems, 312-322 (2017).
[35] Lango M., Stefanowski J.: Multi-class and Feature Selection Extensions of Roughly Balanced Bagging for Imbalanced Data. Journal of Intelligent Information Systems (2018).
[36] Lemaître G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18 (17), 1-5 (2017).
[37] Li, S., Ju, S., Zhou, G., Li, X.: Active learning for imbalanced sentiment classification. In Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 139-148 (2012).
[38] Li, S., Wang, Z., Zhou, G., Lee, S. Y. M.: Semi-supervised learning for imbalanced sentiment classification. In Proc. of Int. Joint Conference on Artificial Intelligenc, 22 (3), 1826-1831 (2011).
[39] Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix trifactorization approach to sentiment classification with lexical prior knowledge. In Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conference on Natural Language Processing of the AFNLP, 1, 244-252 (2009).
[40] Li, S., Zhou, G., Wang, Z., Lee, S. Y. M., Wang, R.: Imbalanced sentiment classification. In Proc. of the 20th ACM Int. Conference on Information and Knowledge Management, 2469-2472 (2011).
[41] Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool (2012).
[42] Loper, E., Bird, S.: NLTK: The natural language toolkit. In Proc. of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63-70 (2002).
[43] Mathioudakis, M., Koudas, N.: Twitter-monitor: Trend detection over the twitter stream. In Proc. of the 2010 ACM SIGMOD Int. Conference on Management of Data, 1155-1158 (2010).
[44] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In Proc. of Neural Information Systems Processing (2013).
[45] Mohammad, S., Turney, P.D.: Crowd-sourcing a word-emotion association lexicon. Computational Intelligence, 29 (3), 436-465 (2013).
[46] Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of Unbalanced Data Sets in sentiment classification. IEEE Int. Conference on Systems, Man, and Cybernetics (SMC), 3298-3303 (2012).
[47] Nakov, P., Ritter, A., Rosenthal, S., Stoy-anov, V., Sebastiani, F.: SemEval- 2016 task 4: Sentiment analysis in Twitter. In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).
[48] Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In Proc. of the 7th Int. Conference on Hybrid Artificial Intelligent Systems, LNAI, 7209, 139-150 (2012).
[49] Napierala, K., Stefanowski, J.: BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 39 (2), 335-373 (2012).
[50] Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563-597 (2016).
[51] Niklas, J., Weber, S.H., Muller, M.C., Gurevych, I.: Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proc. of the 1st Int. Workshop on Topic-sentiment analysis for mass opinion (2009).
[52] Ohana, B., Tierney, B., Delany, S. J.: Domain independent sentiment classification with many lexicons. In 4th Int. Symposium on Mining and Web at 25th Int. Conference on Advanced Information Networking and Applications (AINA), 632-637 (2011).
[53] Pang, B., Lee, L.: A Sentimental Education: Sentiment Analysis using subjectivity summarization based on minimum cuts. In: 42nd Annual Meeting on Association for Computational Linguistics, 271-278 (2004).
[54] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Conference on Empirical Methods in Natural Language Processing, 10, 79-86 (2002).
[55] Pedregosa et al.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830 (2011).
[56] Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In Proc. of 3rd Mexican Int. Conf. on Artificial Intelligence, 312-321 (2004).
[57] Remus, R.: Modeling and representing negation in data-driven machine learning-based sentiment analysis. In Proc. of 1st Int. Workshop on Emotion and Sentiment in Social and Expressive Media (ESSEM 2013), 22-33 (2013).
[58] Schütze, H., Manning, C.D.: Foundations of Statistical Natural Language Processing. MIT Press (1999).
[59] Song, K., Feng, S., Gao, W., Wang, D., Yu, G., Wong, K. F.: Personalized Sentiment Classification Based on Latent Individuality of Microblog Users. In Proc. of Int. Joint Conferences on Artificial Intelligence, 2277-2283 (2015).
[60] Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In Ramanna, L.C.J.S. and Howlett, R.J. (eds), Emerging Paradigms in Machine Learning, 277-306 (2013).
[61] Stefanowski, J.: Dealing with Data Difficulty Factors while Learning from Imbalanced Data. In S. Matwin and J. Mielniczuk (eds), Challenges in Computational Statistics and Data Mining, Studies in Computational Intelligence, 605, 333-363 (2016).
[62] Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In Song, I.-Y., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery, LNCS, 5182, 283-292 (2008).
[63] Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772 (2010).
[64] Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2002) (2002).
[65] Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Class Imbalance, Redux. In Proc. of IEEE 11th Int. Conference on Data Mining, 754-763 (2011).
[66] Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. System Man Cybern., Part B. 42 (4), 1119-1130 (2012).
[67] Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp. Comput. Intell. Data Mining, 324-331 (2009).
[68] Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In Proc. of the 16th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 783-792 (2010).
[69] Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39 (2-3), 165-210 (2005).
[70] Wilson, D.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetrics, 2 (3), 408-421 (1972).
[71] Wilson D.R., Martinez T.R.: Improved heterogeneous distance functions. J. Artificial Intelligence Research, 6, 1-34 (1997).
[72] Wojciechowski, S., Wilk, S., Stefanowski, J.: An algorithm for selective preprocessing of multiclass imbalanced data. In Proc. of Int. Conference on Computer Recognition Systems, CORES 2017, 238-247 (2017).
[73] Wojciechowski, S., Wilk, S.: Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data, Foundations of Computing and Decision Sciences, 42(2), 149-176 (2017).
[74] Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., Wang, X.: Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification. Cogn Comput, 7, 226 (2015).
[75] Zhou, Z. H., Liu, X.Y.: On multi-class cost sensitive learning. Computational Intelligence, 26 (3), 232-257 (2010).

Uwagi

Opracowanie rekordu w ramach umowy 509/P-DUN/2018 ze środków MNiSW przeznaczonych na działalność upowszechniającą naukę (2019).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-0241b5b4-8558-4250-83d8-f7ca279c74e2