PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Topic Modeling of the SrpELTeC Corpus: A Comparison of NMF, LDA, and BERTopic

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Konferencja
Federated Conference on Computer Science and Information Systems (19 ; 08-11.09.2024 ; Belgrade, Serbia)
Języki publikacji
EN
Abstrakty
EN
Topic modeling is an effective way to gain insight into large amounts of data. Some of the most widely used topic models are Latent Dirichlet allocation (LDA) and Nonnegative Matrix Factorization (NMF). However, with the rise of self- attention models and pre-trained language models, new ways to mine topics have emerged. BERTopic represents the current state-of-the-art when it comes to modeling topics. In this paper, we comapred LDA, NMF, and BERTopic performance on literary texts in Serbian, by measuring Topic Coherency and Topic Diveristy, as well as qualitatively evaluating the topics. For BERTopic, we compared multilingual sentence transformer embeddings, to the Jerteh-355 monolingual embeddings for Serbian. For TC, NMF yielded the best results, while BERTopic with Jerteh-355 embeddings gave the best TD. Jerteh-355 also outperformed sentence transformers embeddings in both TC and TD.
Rocznik
Tom
Strony
649--653
Opis fizyczny
Bibliogr. 21 poz., tab.
Twórcy
  • Association for Language Resources and Technologies ul. Studentski trg 3, Belgrade, Serbia
  • Univ. of Belgrade, F. of Philology ul. Studentski trg 3, Belgrade, Serbia
  • Univ. of Belgrade, F. of Mining and Geology ul. Ðušina 7, Belgrade, Serbia
  • Univ. of Belgrade, F. of Mining and Geology ul. Ðušina 7, Belgrade, Serbia
Bibliografia
  • 1. I. Uglanova and E. Gius, “The order of things. a study on topic modelling of literary texts.” CHR, no. 18-20, p. 2020, 2020.
  • 2. K. E. Chu, P. Keikhosrokiani, and M. P. Asl, “A topic modeling and sentiment analysis model for detection and visualization of themes in literary texts,” Pertanika Journal of Science & Technology, vol. 30, no. 4, pp. 2535–2561, 2022, https://doi.org/10.47836/pjst.30.4.14.
  • 3. R. Stanković, C. Krstev, B. Š. Todorović, D. Vitas, M. Škorić, and M. I. Nešić, “Distant reading in digital humanities: Case study on the serbian part of the eltec collection,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3337–3345. [Online]. Available: https://aclanthology.org/2022.lrec-1.356
  • 4. C. Schöch, T. Erjavec, R. Patras, and D. Santos, “Creating the european literary text collection (eltec): Challenges and perspectives,” Modern Languages Open, 2021, http://doi.org/10.3828/mlo.v0i0.364.
  • 5. D. Medvecki, B. Bašaragin, A. Ljajić, and N. Milošević, “Multi-lingual transformer and bertopic for short text topic modeling: The case of serbian,” in Conference on Information Technology and its Applications. Springer, 2024, pp. 161–173, https://doi.org/10.1007/978-3-031-50755-7_16.
  • 6. D. Vrandeˇci´c and M. Krötzsch, “Wikidata: a free collaborative knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp. 78–85, 2014, https://doi.org/10.1145/2629489.
  • 7. C. Schöch, M. Hinzmann, J. Röttgermann, K. Dietz, and A. Klee, “Smart modelling for literary history,” International Journal of Humanities and Arts Computing, vol. 16, no. 1, pp. 78–93, 2022, https://doi.org/10.3366/ijhac.2022.0278.
  • 8. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
  • 9. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” Advances in neural information processing systems, vol. 13, 2000. [Online]. Available: https://api.semanticscholar.org/CorpusID:2095855
  • 10. R. Egger and J. Yu, “A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts,” Frontiers in sociology, vol. 7, p. 886498, 2022.
  • 11. R. Egger and J. Yu, “Identifying hidden semantic structures in instagram data: a topic modelling comparison,” Tourism Review, vol. 77, no. 4, pp. 1234–1246, 2021.
  • 12. M. Švaňa, “Social media, topic modeling and sentiment analysis in municipal decision support,” in 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS). IEEE, 2023, pp. 1235–1239, http://dx.doi.org/10.15439/2023F1479.
  • 13. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017, https://doi.org/10.48550/arXiv.1706.03762.
  • 14. M. Grootendorst, “Bertopic: Neural topic modeling with a class-based tf-idf procedure,” arXiv preprint https://arxiv.org/abs/2203.05794, 2022, https://doi.org/10.48550/arXiv.2203.05794.
  • 15. R. Stanković, C. Krstev, B. Šandrih Todorović, and M. Škorić, “Annotation of the serbian eltec collection,” Infotheca - Journal for Digital Humanities, vol. 21, no. 2, pp. 43–59, 2022. [Online]. Available: https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.3_en
  • 16. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint https://arxiv.org/abs/1908.10084, 2019, https://doi.org/10.48550/arXiv.1908.10084.
  • 17. M. Škorić, “Novi jezički modeli za srpski jezik,” Infoteka, vol. 24, 2024, https://doi.org/10.48550/arXiv.2402.14379. [Online]. Available: https://arxiv.org/abs/2402.14379
  • 18. A. B. Dieng, F. J. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 439–453, 2020, https://doi.org/10.48550/arXiv.1907.04907.
  • 19. D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evaluation of topic coherence,” in Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2010, pp. 100–108.
  • 20. M. Röder, A. Both, and A. Hinneburg, “Exploring the space of topic coherence measures,” in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399–408, https://doi.org/10.1145/2684822.2685324.
  • 21. N. Ljubešić and D. Lauc, “BERTić - the transformer language model for Bosnian, Croatian, Montenegrin and Serbian,” in Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. Kiyv, Ukraine: Association for Computational Linguistics, Apr. 2021, pp. 37–42, https://doi.org/10.48550/arXiv.2104.09243. [Online]. Available: https://www.aclweb.org/anthology/2021.bsnlp-1.5
Uwagi
1. This research was supported by the Science Fund of the Republic of Serbia, #7276, Text Embeddings - Serbian Language Applications - TESLA.
2. Thematic Sessions: Short Papers
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-e5edbdfc-a6c1-4850-9fe4-b1368b253f1b
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.