PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Experimental Comparison of Pre-Trained Word Embedding Vectors of Word2Vec, Glove, FastText for Word Level Semantic Text Similarity Measurement in Turkish

Treść / Zawartość
Identyfikatory
Warianty tytułu
Języki publikacji
EN
Abstrakty
EN
This study aims to evaluate experimentally the word vectors produced by three widely used embedding methods for the word-level semantic text similarity in Turkish. Three benchmark datasets SimTurk, AnlamVer, and RG65_Turkce are used in this study to evaluate the word embedding vectors produced by three different methods namely Word2Vec, Glove, and FastText. As a result of the comparative analysis, Turkish word vectors produced with Glove and FastText gained better correlation in the word level semantic similarity. It is also found that The Turkish word coverage of FastText is ahead of the other two methods because the limited number of Out of Vocabulary (OOV) words have been observed in the experiments conducted for FastText. Another observation is that FastText and Glove vectors showed great success in terms of Spearman correlation value in the SimTurk and AnlamVer datasets both of which are purely prepared and evaluated by local Turkish individuals. This is another indicator showing that these aforementioned datasets are better representing the Turkish language in terms of morphology and inflections.
Twórcy
  • Software Engineering Department, Adana Alparslan Turkes Science and Technology University, Balcalı, Çatalan Cd., 01250 Adana, Turkey
Bibliografia
  • 1. Sammut, C., Webb, G.I. TF–IDF. Springer US, 2010. https://doi.org/10.1007/978-0-387-30164-8_832.
  • 2. Dumais, S.T. Latent Semantic Analysis. Annual Review of Information Science and Technology. 2005; 38: 188–230. https://doi.org/10.1002/aris.1440380105.
  • 3. Mikolov, T., Chen, K., Corrado, G., Dean, J. Efficient Estimation of Word Representations in Vector Space; 2013.
  • 4. Pennington, J., Socher, R., Manning, C.D. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 2014; 1532–1543.
  • 5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. Enriching Word Vectors with Subword Information. 2016. arXiv preprint arXiv:1607.04606.
  • 6. Harris, Z. 1954. Distributional structure. Word, 10(23), 146–162.
  • 7. Firth, J.R. 1957. A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis,. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952–1959, London: Longman, 1968; 1–32.
  • 8. Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
  • 9. Brin, S., Page, L. The anatomy of a large-scale hypertextual Web search engine (PDF). Computer Networks and ISDN Systems. 1998; 30 (1–7): 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X.
  • 10. Budanitsky, A., Hirst, G. Evaluating WordNet-based measures of semantic distance. Comput. Linguistics. 2006; 32(1): 13–47.
  • 11. Wu Z., Palmer, M. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting ofthe Associations for Computational Linguistics.
  • 12. Aydogan, M., Karci, A. Kelime Temsil Yöntemleri ile Kelime Benzerliklerinin İncelenmesi. 2019; 34, 181–195. https://doi.org/10.21605/cukurovaummfd.609119.
  • 13. Amasyalı, M.F., Balcı, S., Mete, E., Varlı, E.N. 2012. Türkçe Metinlerin Sınıflandırılmasında Metin Temsil Yöntemlerinin Performans Karşılaştırılması EMO Bilimsel Dergi, 2(4), 95–104.
  • 14. Arabacı, M.A., Esen, E., Atar, M.S., Yılmaz, E., Kaltalıoğlu, B. 2018. Detecting Similar Sentences Using Word Embedding. In 2018 26th Signal Processing and Communications Applications Conference.
  • 15. Esen, E., Özkan, S. Analysis of Turkish Parliament Records in Terms of Party Coherence. In 2017 25th Signal Processing and Communications Applications Conference (SIU) 1–4. IEEE 2017.
  • 16. Sopaoglu, U., Ercan, G. 2016. Evaluation of Semantic Relatedness Measures for Turkish Language. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016), Konya, Turkey 2016.
  • 17. Rubenstein, H., Goodenough J.B. Contextual correlates of synonymy. Communications of the ACM. 1965; 8(10): 627–633.
  • 18. Camacho-Collados J., Pilehvar, T. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research. 2018; 63: 743–788.
  • 19. Ercan, G., Yıldız, O.T.. AnlamVer: Semantic Model Evaluation Dataset for Turkish – Word Similarity and Relatedness. In Proceedings of the 27th International Conference on Computational Linguistics, 2018, 3819–3836.
  • 20. Dündar, E.B., Alpaydın, E. Learning Word Representations with Deep Neural Networks for Turkish. 27th Signal Processing and Communications Applications Conference (SIU), 2019, 1–4. https://doi.org/10.1109/SIU.2019.8806491.
  • 21. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee K., Zettlemoyer, L. Deep contextualized word representations. In Proc. of NAACL, 2018.
  • 22. Morin, F., Bengio, Y. 2005. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani, editors, AIST-ATS’05, 2005, 246–252.
  • 23. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems. 2002; 20(1): 116–131.
  • 24. Hill, F., Reichart, R., Korhonen, A. SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics. 2015; 41(4): 665–695. https://doi.org/10.1162/COLI_a_002.
  • 25. h t t p s : / / g i t h u b . c o m / c a g a t a y t u l u /TurkishWordSimilarity/.
Uwagi
Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-b81722cc-136d-493e-9b7f-052ce57af670
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.