Extending Word2Vec with domain-specific labels

Švaňa, Miloš

doi:10.15439/2022F37

Artykuł - szczegóły

Tytuł artykułu

Extending Word2Vec with domain-specific labels

Autorzy

Švaňa Miloš

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2022F37

Warianty tytułu

Konferencja

Federated Conference on Computer Science and Information Systems (17 ; 04-07.09.2022 ; Sofia, Bulgaria)

Języki publikacji

Abstrakty

Choosing a proper representation of textual data is an important part of natural language processing. One option is using Word2Vec embeddings, i.e., dense vectors whose properties can to a degree capture the “meaning” of each word. One of the main disadvantages of Word2Vec is its inability to distinguish between antonyms. Motivated by this deficiency, this paper presents a Word2Vec extension for incorporating domain-specific labels. The goal is to improve the ability to differentiate between embeddings of words associated with different document labels or classes. This improvement is demonstrated on word embeddings derived from tweets related to a publicly traded company. Each tweet is given a label depending on whether its publication coincides with a stock price increase or decrease. The extended Word2Vec model then takes this label into account. The user can also set the weight of this label in the embedding creation process. Experiment results show that increasing this weight leads to a gradual decrease in cosine similarity between embeddings of words associated with different labels. This decrease in similarity can be interpreted as an improvement of the ability to distinguish between these words.

Słowa kluczowe

social networking computational modeling company production predictive model natural language processing thesaurus

sieci społecznościowe modelowanie obliczeniowe firma produkcja model predykcyjny przetwarzanie języka naturalnego tezaurus

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2022

Tom

Vol. 30

Strony

157--160

Opis fizyczny

Bibliogr. 14 poz., wz., tab.

Twórcy

autor

Švaňa Miloš

milos.svana@vsb.cz

VSB - Technical University of Ostrava Department of Systems Engineering 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czechia

Bibliografia

1. M. G. Agudo. An analysis of word embedding spaces and regularities. 2019.
2. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, and T. Henighan. Language models are few-shot learners. page 25, 2020.
3. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North, pages 4171–4186. Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/N19-1423. URL http://aclweb.org/anthology/N19-1423.
4. Z. Dou, W. Wei, and X. Wan. Improving word embeddings for antonym detection using thesauri and SentiWordNet. In M. Zhang, V. Ng, D. Zhao, S. Li, and H. Zan, editors, Natural Language Processing and Chinese Computing, volume 11109, pages 67–79. Springer International Publishing, 2018. ISBN 978-3-319-99500-7 978-3-319-99501-4. http://dx.doi.org/10.1007/978-3-319-99501-4_6. URL http://link.springer.com/10. 1007/978-3-319-99501-4_6. Series Title: Lecture Notes in Computer Science.
5. A. Handler. An empirical study of semantic similarity in WordNet and word2vec. page 23, 2014.
6. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013. URL http://arxiv.org/abs/1301.3781.
7. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. 2013. URL http://arxiv.org/abs/1310.4546.
8. M. Ono, M. Miwa, and Y. Sasaki. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 984–989. Association for Computational Linguistics, 2015. http://dx.doi.org/10.3115/v1/N15-1100. URL http://aclweb.org/anthology/N15-1100.
9. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. 2018. URL http://arxiv.org/abs/1802.05365.
10. I. Samenko, A. Tikhonov, and I. P. Yamshchikov. Intuitive contrasting map for antonym embeddings. 2021. URL http://arxiv.org/abs/2004.12835.
11. Y. Shao, S. Taylor, N. Marshall, C. Morioka, and Q. Zeng-Treitler. Clinical text classification with word embedding features vs. bag-of-words features. In 2018 IEEE International Conference on Big Data (Big Data), pages 2874–2878. IEEE, 2018. ISBN 978-1-5386-5035-6. http://dx.doi.org/10.1109/BigData.2018.8622345. URL https: //ieeexplore.ieee.org/document/8622345/.
12. M. R. Vargas, B. S. L. P. de Lima, and A. G. Evsukoff. Deep learning for stock market prediction from financial news articles. In 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), pages 60–65. IEEE, 2017. ISBN 978-1-5090-4253-1. http://dx.doi.org/10.1109/CIVEMSA.2017.7995302. URL http://ieeexplore.ieee.org/document/7995302/.
13. H.-Y. Yeh, Y.-C. Yeh, and D.-B. Shen. Word vector models approach to text regression of financial risk prediction. 12(1):89, 2020. ISSN 2073-8994. http://dx.doi.org/10.3390/sym12010089. URL https://www.mdpi.com/ 2073-8994/12/1/89.
14. L. Zhang, J. Li, and C. Wang. Automatic synonym extraction using word2vec and spectral clustering. In 2017 36th Chinese Control Conference (CCC), pages 5629–5632. IEEE, 2017. ISBN 978-988-15639-3-4. http://dx.doi.org/10.23919/ChiCC.2017.8028251. URL http://ieeexplore.ieee.org/document/8028251/.

Uwagi

1. This paper was supported by the SGS project No. SP2022/113. This support is gratefully acknowledged.

2. Short article

3. Track 1: 17th International Symposium on Advanced Artificial Intelligence in Applications

4. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-67712cb8-4389-43a6-b1ac-567b195c4cc7