Using Transformer models for gender attribution in Polish

Kaczmarek, Karol; Pokrywka, Jakub; Graliński, Filip

doi:10.15439/2022F197

Artykuł - szczegóły

Tytuł artykułu

Using Transformer models for gender attribution in Polish

Autorzy

Kaczmarek Karol , Pokrywka Jakub , Graliński Filip

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2022F197

Warianty tytułu

Konferencja

Federated Conference on Computer Science and Information Systems (17 ; 04-07.09.2022 ; Sofia, Bulgaria)

Języki publikacji

Abstrakty

Gender identification is the task of predicting the gender of an author of a given text. Some languages, including Polish, exhibit gender-revealing syntactic expression. In this paper, we investigate machine learning methods for gender identification in Polish. For the evaluation, we use large (780M words) corpus “He said she said”, created by grepping (for author's gender identification) gender-revealing syntactic expressions and normalizing all these expressions to masculine form (for preventing classifiers from using syntactic features). In this work, we evaluate TF-IDF based, fastText, LSTM, RoBERTa models, differentiating self-contained and non-self-contained approaches. We also provide a human baseline. We report large improvements using pre-trained RoBERTa models and discuss the possible contamination of test data for the best pre-trained model.

Słowa kluczowe

training computer science code machine learning syntactics transformer data model

szkolenie informatyka kod uczenie maszynowe syntaktyka transformator model danych

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2022

Tom

Vol. 30

Strony

73--77

Opis fizyczny

Bibliogr. 29 poz., tab.

Twórcy

autor

Kaczmarek Karol

karol.kaczmarek@amu.edu.pl

Adam Mickiewicz University, Faculty of Mathematics and Computer Science
Applica.ai Sp. z o.o.

autor

Pokrywka Jakub

jakub.pokrywka@amu.edu.pl

Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Uniwersytetu Poznańskiego 4, 61-614 Poznań, Poland

autor

Graliński Filip

filip.gralinski@amu.edu.pl

Adam Mickiewicz University, Faculty of Mathematics and Computer Science, Uniwersytetu Poznańskiego 4, 61-614 Poznań, Poland

Bibliografia

1. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
2. B. Bsir and M. Zrigui. Bidirectional LSTM for author gender identification. In International Conference on Computational Collective Intelligence, pages 393–402. Springer, 2018.
3. T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
4. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint https://arxiv.org/abs/1911.02116, 2019.
5. C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
6. S. Dadas, M. Perełkiewicz, and R. Poświata. Pre-training Polish Transformer-Based language models at scale. In L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, and J. M. Zurada, editors, Artificial Intelligence and Soft Computing, pages 301–314, Cham, 2020. Springer International Publishing.
7. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR, abs/1810.04805, 2018.
8. Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of The 33rd International Conference on Machine Learning, 06 2015.
9. A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision, 2009.
10. F. Graliński, Ł. Borchmann, and P. Wierzchoń. “He Said She Said” — a male/female corpus of Polish. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 2016. European Language Resources Association (ELRA).
11. F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń. Gonito.net – open platform for research competition, cooperation and reproducibility. In A. Branco, N. Calzolari, and K. Choukri, editors, Proceedings of the 4REAL Workshop, pages 13–20. 2016.
12. F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń. Vive la petite différence! Exploiting small differences for gender attribution of short texts. Lecture Notes in Artificial Intelligence, 9924:54–61, 2016.
13. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
14. S. Hussein, M. Farouk, and E. Hemayed. Gender identification of Egyptian dialect in Twitter. Egyptian Informatics Journal, 20(2):109–116, 2019.
15. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. arXiv preprint https://arxiv.org/abs/1607.01759, 2016.
16. D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
17. D. Kodiyan, F. Hardegger, S. Neuhaus, and M. Cieliebak. Author Profiling with bidirectional RNNs using Attention with GRUs: Notebook for PAN at CLEF 2017. In CLEF 2017 Evaluation Labs and Workshop–Working Notes Papers, Dublin, Ireland, 11-14 September 2017, volume 1866. RWTH Aachen, 2017.
18. S. Krüger and B. Hermann. Can an online service predict gender? On the state-of-the-art in gender identification from texts. In 2019 IEEE/ACM 2nd International Workshop on Gender Equality in Software Engineering (GE), pages 13–16. IEEE, 2019.
19. T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics, 2018.
20. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692, 2019.
21. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics, 2009.
22. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
23. J. Read. Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceedings of the ACL student research workshop, pages 43–48, 2005.
24. A. Siewierska. Gender distinctions in independent personal pronouns. In M. S. Dryer and M. Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig, 2013.
25. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014.
26. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
27. R. Veenhoven, S. Snijders, D. van der Hall, and R. van Noord. Using translated data to improve deep learning author profiling models. In Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), volume 2125, 2018.
28. A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.
29. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.

Uwagi

1. Short article

2. Track 1: 17th International Symposium on Advanced Artificial Intelligence in Applications

3. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-564d052c-d29d-4b4e-833e-7302c3be8190