Using Transformer models for gender attribution in Polish

Kaczmarek, Karol; Pokrywka, Jakub; Graliński, Filip

doi:10.15439/2022F197

Artykuł - szczegóły

Tytuł artykułu

Using Transformer models for gender attribution in Polish

Autorzy

Kaczmarek Karol , Pokrywka Jakub , Graliński Filip

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2022F197

Warianty tytułu

Języki publikacji

Abstrakty

Gender identification is the task of predicting the gender of an author of a given text. Some languages, including Polish, exhibit gender-revealing syntactic expression. In this paper, we investigate machine learning methods for gender identification in Polish. For the evaluation, we use large (780M words) corpus “He said she said”, created by grepping (for author's gender identification) gender-revealing syntactic expressions and normalizing all these expressions to masculine form (for preventing classifiers from using syntactic features). In this work, we evaluate TF-IDF based, fastText, LSTM, RoBERTa models, differentiating self-contained and non-self-contained approaches. We also provide a human baseline. We report large improvements using pre-trained RoBERTa models and discuss the possible contamination of test data for the best pre-trained model.

Słowa kluczowe

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2022

Tom

Vol. 30

Strony

73--77

Opis fizyczny

Bibliogr. 29 poz.

Twórcy

autor

Kaczmarek Karol

autor

Pokrywka Jakub

autor

Graliński Filip

Bibliografia

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-564d052c-d29d-4b4e-833e-7302c3be8190