Topic modeling is an effective way to gain insight into large amounts of data. Some of the most widely used topic models are Latent Dirichlet allocation (LDA) and Nonnegative Matrix Factorization (NMF). However, with the rise of self- attention models and pre-trained language models, new ways to mine topics have emerged. BERTopic represents the current state-of-the-art when it comes to modeling topics. In this paper, we comapred LDA, NMF, and BERTopic performance on literary texts in Serbian, by measuring Topic Coherency and Topic Diveristy, as well as qualitatively evaluating the topics. For BERTopic, we compared multilingual sentence transformer embeddings, to the Jerteh-355 monolingual embeddings for Serbian. For TC, NMF yielded the best results, while BERTopic with Jerteh-355 embeddings gave the best TD. Jerteh-355 also outperformed sentence transformers embeddings in both TC and TD.
2
Dostęp do pełnego tekstu na zewnętrznej witrynie WWW
This paper presents the development of a Named Entity Linking (NEL) model to the Wikidata knowledge base for the Serbian language named SrpCNNeL. The model was trained to recognize and link seven different named entity types (persons, locations, organisations, professions, events, demonyms, and works of art) on the dataset containing sentences from novels, legal documents, as also sentences generated from the Wikidata knowledge base and Leximirka lexical database. The resulting model demonstrated robust performance, achieving an F1 score of 0.8 on the test set. Considering that the dataset contains the highest number of locations linked to the knowledge base, an evaluation was conducted on an independent dataset and compared to the baseline Spacy Entity Linker for locations only.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.