Experimental Comparison of Pre-Trained Word Embedding Vectors of Word2Vec, Glove, FastText for Word Level Semantic Text Similarity Measurement in Turkish

Tulu, Cagatay Neftali

doi:10.12913/22998624/152453

Artykuł - szczegóły

Tytuł artykułu

Experimental Comparison of Pre-Trained Word Embedding Vectors of Word2Vec, Glove, FastText for Word Level Semantic Text Similarity Measurement in Turkish

Autorzy

Tulu Cagatay Neftali

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.12913/22998624/152453

Warianty tytułu

Języki publikacji

Abstrakty

This study aims to evaluate experimentally the word vectors produced by three widely used embedding methods for the word-level semantic text similarity in Turkish. Three benchmark datasets SimTurk, AnlamVer, and RG65_Turkce are used in this study to evaluate the word embedding vectors produced by three different methods namely Word2Vec, Glove, and FastText. As a result of the comparative analysis, Turkish word vectors produced with Glove and FastText gained better correlation in the word level semantic similarity. It is also found that The Turkish word coverage of FastText is ahead of the other two methods because the limited number of Out of Vocabulary (OOV) words have been observed in the experiments conducted for FastText. Another observation is that FastText and Glove vectors showed great success in terms of Spearman correlation value in the SimTurk and AnlamVer datasets both of which are purely prepared and evaluated by local Turkish individuals. This is another indicator showing that these aforementioned datasets are better representing the Turkish language in terms of morphology and inflections.

Słowa kluczowe

semantic word similarity word embeddings NLP Turkish NLP natural language processing

Wydawca

Lublin University of Technology
Polish Society of Ecological Engineering (PTIE), Branch of PTIE in Lublin

Czasopismo

Advances in Science and Technology. Research Journal

Rocznik

2022

Tom

Vol. 16, no 4

Strony

147--156

Opis fizyczny

Bibliogr. 25 poz., fig., tab.

Twórcy

autor

Tulu Cagatay Neftali

cagataytulu@gmail.com

Software Engineering Department, Adana Alparslan Turkes Science and Technology University, Balcalı, Çatalan Cd., 01250 Adana, Turkey

https://orcid.org/0000-0002-4462-3707

Bibliografia

1. Sammut, C., Webb, G.I. TF–IDF. Springer US, 2010. https://doi.org/10.1007/978-0-387-30164-8_832.
2. Dumais, S.T. Latent Semantic Analysis. Annual Review of Information Science and Technology. 2005; 38: 188–230. https://doi.org/10.1002/aris.1440380105.
3. Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient Estimation of Word Representations in Vector Space; 2013.
4. Pennington, J., Socher, R., Manning, C.D. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 2014; 1532–1543.
5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. Enriching Word Vectors with Subword Information. 2016. arXiv preprint arXiv:1607.04606.
6. Harris, Z. 1954. Distributional structure. Word, 10(23), 146–162.
7. Firth, J.R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis,. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959, London: Longman, 1968; 1–32.
8. Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
9. Brin, S., Page, L. The anatomy of a large-scale hypertextual Web search engine (PDF). Computer Networks and ISDN Systems. 1998; 30 (1–7): 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X.
10. Budanitsky, A., Hirst, G. Evaluating WordNet-based measures of semantic distance. Comput. Linguistics. 2006; 32(1): 13–47.
11. Wu Z., Palmer, M. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting ofthe Associations for Computational Linguistics.
12. Aydogan, M., Karci, A. Kelime Temsil Yöntemleri ile Kelime Benzerliklerinin İncelenmesi. 2019; 34, 181–195. https://doi.org/10.21605/cukurovaummfd.609119.
13. Amasyalı, M.F., Balcı, S., Mete, E., Varlı, E.N. 2012. Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması EMO Bilimsel Dergi, 2(4), 95–104.
14. Arabacı, M.A., Esen, E., Atar, M.S., Yılmaz, E., Kaltalıoğlu, B. 2018. Detecting Similar Sentences Using Word Embedding. In 2018 26th Signal Processing and Communications Applications Conference.
15. Esen, E., Özkan, S. Analysis of Turkish Parliament Records in Terms of Party Coherence. In 2017 25th Signal Processing and Communications Applications Conference (SIU) 1–4. IEEE 2017.
16. Sopaoglu, U., Ercan, G. 2016. Evaluation of Semantic Relatedness Measures for Turkish Language. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016), Konya, Turkey 2016.
17. Rubenstein, H., Goodenough J.B. Contextual correlates of synonymy. Communications of the ACM. 1965; 8(10): 627–633.
18. Camacho-Collados J., Pilehvar, T. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research. 2018; 63: 743–788.
19. Ercan, G., Yıldız, O.T.. AnlamVer: Semantic Model Evaluation Dataset for Turkish – Word Similarity and Relatedness. In Proceedings of the 27th International Conference on Computational Linguistics, 2018, 3819–3836.
20. Dündar, E.B., Alpaydın, E. Learning Word Representations with Deep Neural Networks for Turkish. 27th Signal Processing and Communications Applications Conference (SIU), 2019, 1–4. https://doi.org/10.1109/SIU.2019.8806491.
21. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee K., Zettlemoyer, L. Deep contextualized word representations. In Proc. of NAACL, 2018.
22. Morin, F., Bengio, Y. 2005. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, AIST-ATS’05, 2005, 246–252.
23. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems. 2002; 20(1): 116–131.
24. Hill, F., Reichart, R., Korhonen, A. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics. 2015; 41(4): 665–695. https://doi.org/10.1162/COLI_a_002.
25. h t t p s : / / g i t h u b . c o m / c a g a t a y t u l u /TurkishWordSimilarity/.

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b81722cc-136d-493e-9b7f-052ce57af670