Bag of words and embedding text representation methods for medical article classification

Cichosz, Paweł

doi:10.34768/amcs-2023-0043

Artykuł - szczegóły

Tytuł artykułu

Bag of words and embedding text representation methods for medical article classification

Autorzy

Cichosz Paweł

Treść / Zawartość

Pełne teksty:

07_cichosz_bag_of_words_and_embedding_text_representation_methods_2023_4.pdf

Pobierz

Identyfikatory

DOI

10.34768/amcs-2023-0043

Warianty tytułu

Języki publikacji

Abstrakty

Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

Słowa kluczowe

text representation text classification bag of words word embedding

reprezentacja tekstu klasyfikacja tekstu osadzanie słów

Wydawca

Oficyna Wydawnicza Uniwersytetu Zielonogórskiego

Czasopismo

International Journal of Applied Mathematics and Computer Science

Rocznik

2023

Tom

Vol. 33, no. 4

Strony

603--621

Opis fizyczny

Bibliogr. 65 poz., rys., tab., wykr.

Twórcy

autor

Cichosz Paweł

pawel.cichosz@pw.edu.pl

Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland

Bibliografia

[1] Aggarwal, C.C. and Zhai, C.-X. (Eds) (2012). Mining Text Data, Springer, New York.
[2] Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019). FLAIR: An easy-to-use framework for state-of-the-art NLP, Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Stroudsburg, USA, pp. 54-59.
[3] Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2021). Flair: A Very Simple Framework for State-of-the-Art NLP, Version 0.10, https://github.com/flairNLP/flair.
[4] Akbik, A., Blythe, D. and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling, Proceedings of the 27th International Conference on Computational Linguistics, COLING-2018, Santa Fe, USA, pp. 1638-1649.
[5] Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection, Statistics Surveys 4: 40-79.
[6] Babić, K., Martincic-Ipsic, S. and Meštrović, A. (2020). Survey of neural text representation models, Information 11(11): 511.
[7] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). Enriching word vectors with subword information, arXiv: 1607.04606.
[8] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2020). fastText: Library for Efficient Text Classification and Representation Learning, Version 0.9.2, https://fasttext.cc.
[9] Borisov, V., Leemann, T., Seßler, K., Haug, J., Pawelczyk, M. and Kasneci, G. (2022). Deep neural networks and tabular data: A survey, arXiv: 2110.01889.
[10] Breiman, L. (2001). Random forests, Machine Learning 45(1): 5-32.
[11] Chawla, N.V., Bowyer, K. W. Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321-357.
[12] Cichosz, P. (2018). A case study in text mining of discussion forum posts: Classification with bag of words and global vectors, International Journal of Applied Mathematics and Computer Science 28(4): 787-801, DOI: 10.2478/amcs-2018-0060.
[13] Cohen, A.M., Hersh, W.R., Peterson, K. and Yen, P.-Y. (2006). Reducing workload in systematic review preparation using automated citation classification, Journal of the American Medical Informatics Association 13(2): 206-219.
[14] Cohn, D., Atlas, L. and Ladner, R. (1994). Improving generalization with active learning, Machine Learning 15(2): 201-221.
[15] Cortes, C. and Vapnik, V.N. (1995). Support-vector networks, Machine Learning 20(3): 273-297.
[16] Dařena, F. and Žižka, J. (2017). Ensembles of classifiers for parallel categorization of large number of text documents expressing opinions, Journal of Applied Economic Sciences 12(1): 25-35.
[17] Deb, S. and Chanda, A.K. (2022). Comparative analysis of contextual and context-free embeddings in disaster prediction from Twitter data, Machine Learning with Applications 7: 100253.
[18] Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT-2019, Minneapolis, USA, pp. 4171-4186.
[19] Dumais, S.T., Platt, J.C., Heckerman, D. and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization, Proceedings of the 17th International Conference on Information and Knowledge Management, CIKM-98, Bethesda, USA, pp. 148-155.
[20] Egan, J.P. (1975). Signal Detection Theory and ROC Analysis, Academic Press, New York.
[21] Fawcett, T. (2006). An introduction to ROC analysis, Pattern Recognition Letters 27(8): 861-874.
[22] Forman, G. (2003). An extensive empirical study of feature selection measures for text classification, Journal of Machine Learning Research 3: 1289-1305.
[23] Forman, G. and Scholz, M. (2010). Apples-to-apples in cross-validation studies: Pitfalls in classifier performance measurement, ACM SIGKDD Explorations Newsletter 12(1): 49-57.
[24] García Adeva, J.J., Pikatza Atxaa, J.M., Ubeda Carrillo, M. and Ansuategi Zengotitabengoa, E. (2014). Automatic text classification to support systematic reviews in medicine, Expert Systems with Applications 41(4): 1498-1508.
[25] Graves, A. (2013). Generating sequences with recurrent neural networks, arXiv: 1308.0850.
[26] Hamel, L.H. (2009). Knowledge Discovery with Support Vector Machines, Wiley, Hoboken.
[27] Hassan, S., Mihalcea, R. and Banea, C. (2007). Random-walk term weighting for improved text classification, Proceedings of the 1st IEEE International Conference on Semantic Computing, ICSC-2007, Irvine, USA, pp. 53-60.
[28] Helaskar, M.N. and Sonawane, S.S. (2019). Text classification using word embeddings, Proceedings of the 5th International Conference on Computing, Communication, Control, and Automation, ICCUBEA-2019, New York, USA, pp. 1-4.
[29] Hilbe, J.M. (2009). Logistic Regression Models, Chapman and Hall, Boca Raton.
[30] Honnibal, M., Montani, I., Van Landeghem, S. and Boyd, A. (2021). spaCy: Industrial-Strength Natural Language Processing in Python, http://spacy.io.
[31] Ji, X., Ritter, A. and Yen, P.-Y. (2017). Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews, Journal of Biomedical Informatics 69: 33-42.
[32] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 137-142.
[33] Joachims, T. (2002). Learning to Classify Text by Support Vector Machines: Methods, Theory, and Algorithms, Springer, New York.
[34] Jonnalagadda, S. and Petitti, D. (2013). A new iterative method to reduce workload in systematic review process, International Journal of Computational Biology and Drug Design 6(1-2): 5-17.
[35] Kaibi, I., Nfaoui, E.H. and Satori, H. (2019). A comparative evaluation of word embeddings techniques for Twitter sentiment analysis, Proceedings of the 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems, WITS-2019, Fez, Morocco, pp. 1-4.
[36] Khabsa, M., Elmagarmid, A., Ilyas, I., Hammady, H. and Ouzzani, M. (2016). Learning to identify relevant studies for systematic reviews using random forest and external information, Machine Learning 102(3): 465-482.
[37] Koprinska, I., Poon, J., Clark, J. and Chan, J. (2007). Learning to classify e-mail, Information Sciences: An International Journal 177(10): 2167-2187.
[38] Koziarski, M. and Woźniak, M. (2017). CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27(4): 727-736, DOI: 10.1515/amcs-2017-0050.
[39] Le, Q.V. and Mikolov, T. (2014). Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning, ICML-2014, Beijing, China, pp. 1188-1196.
[40] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S.and So, C.H. and Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36(4): 1234-1240.
[41] Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of the 10th European Conference on Machine Learning, ECML-98, Chemnitz, Germany, pp. 4-15.
[42] Matwin, S., Kouznetsov, A., Inkpen, D., Frunza, O. and O’Blenis, P. (2010). A new algorithm for reducing the workload of experts in performing systematic reviews, Journal of the American Medical Informatics Association 17(4): 446-453.
[43] McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification, Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, USA, pp. 41-48.
[44] Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data, Data Mining and Knowledge Discovery 28(1): 92-122.
[45] Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013). Efficient estimation of word representations in vector space, arXiv: 1301.3781.
[46] Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics, Cognitive Science 34(8): 1388-1429.
[47] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12: 2825-2830.
[48] Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP-2014, Doha, Qatar, pp. 1532-1543.
[49] Platt, J.C. (1998). Fast training of support vector machines using sequential minimal optimization, in B. Schölkopf et al. (Eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, pp. 185-208.
[50] Radovanović, M. and Ivanović, M. (2008). Text mining: Approaches and applications, Novi Sad Journal of Mathematics 38(3): 227-234.
[51] Řehůřek (2021). Gensim: Topic Modeling for Humans, Version 4.0.1, https://radimrehurek.com/gensim.
[52] Řehůřek, V. and Sojka, P. (2010). Software framework for topic modelling with large corpora, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45-50.
[53] Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B.B., Chen, X. and Wang, X. (2020). A survey of deep active learning, ACM Computing Surveys 54(9): 1-40.
[54] Rios, G. and Zha, H. (2004). Exploring support vector machines and random forests for spam detection, Proceedings of the 1st Conference on Email and Anti Spam, CEAS-2004, Moutain View, USA, pp. 284-292.
[55] Salton, G. and Buckley, C. (1988). Term weighting approaches in automatic text retrieval, Information Processing and Management 24(5): 513-523.
[56] Szymański, J. (2014). Comparative analysis of text representation methods using classification, Cybernetics and Systems 45(2): 180-199.
[57] van den Bulk, L.M., Bouzembrak, Y., Gavai, A., Liu, N., van den Heuvel, L.J. and Marvin, H.J.P. (2022). Automatic classification of literature in systematic reviews on food safety using machine learning, Current Research in Food Science 5: 84-95.
[58] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need, Advances in Neural Information Processing Systems, NIPS-2017, Long Beach, USA, pp. 6000-6010.
[59] Wang, C., Nulty, P. and Lillis, D. (2020). A comparative study on word embeddings in deep learning for text classification, Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, NLPIR-2020, Seoul, Korea, pp. 37-46.
[60] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. and Rush, A.M. (2020). Transformers: State-of-the-art natural language processing, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38-45, (online).
[61] Xue, D. and Li, F. (2015). Research of text categorization model based on random forests, 2015 IEEE International Conference on Computational Intelligence and Communication Technology, CICT-2015, Ghaziabad, India, pp. 173-176.
[62] Yang, Y. and Pedersen, J. (1997). A comparative study on feature selection in text categorization, Proceedings of the 14th International Conference on Machine Learning, ICML-97, Nashville, USA, pp. 412-420.
[63] Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP-2011, Edinburgh, UK, pp. 172-182.
[64] Zhu, X. and Goldberg, A. (2009). Introduction to Semi-Supervised Learning, Morgan & Claypool, San Rafael.
[65] Zymkowski, T., Szymański, J., Sobecki, A., Drozda, P., Szałapak, K., Komar-Komarowski, K. and Scherer, R. (2022). Short texts representations for legal domain classification, Proceedings of the 21st International Conference on Artificial Intelligence and Soft Computing, ICAISC-2022, Zakopane, Poland, pp. 105-114.

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-5a1538e6-c9d5-423a-9048-b04619a55166