Applying SoftTriple loss for supervised language model fine tuning

Sosnowski, Witold; Wróblewska, Anna; Gawrysiak, Piotr

doi:10.15439/2022F185

Artykuł - szczegóły

Tytuł artykułu

Applying SoftTriple loss for supervised language model fine tuning

Autorzy

Sosnowski Witold , Wróblewska Anna , Gawrysiak Piotr

Wybrane pełne teksty z tego czasopisma

http://annals-csis.org

Identyfikatory

DOI

10.15439/2022F185

Warianty tytułu

Konferencja

Federated Conference on Computer Science and Information Systems (17 ; 04-07.09.2022 ; Sofia, Bulgaria)

Języki publikacji

Abstrakty

We introduce a new loss function based on cross entropy and SoftTriple loss, TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about 0.02 - 2.29 percentage points. Thorough tests on popular datasets using our loss function indicate a steady gain. The fewer samples in the training dataset, the higher gain - thus, for small-sized dataset, it is about 0.71 percentage points, for medium-sized - 0.86 percentage points, for large - 0.20 percentage points, and for extra-large 0.04 percentage points.

Słowa kluczowe

training measurement computer science computer aided instruction computer architecture reliability engineering natural language processing

trening pomiar informatyka inżynieria niezawodności przetwarzanie języka naturalnego

Wydawca

Polskie Towarzystwo Informatyczne

Czasopismo

Annals of Computer Science and Information Systems

Rocznik

2022

Tom

Vol. 30

Strony

141--147

Opis fizyczny

Bibliogr. 30 poz., wz., tab., il., wykr.

Twórcy

autor

Sosnowski Witold

witold.sosnowski.dokt@pw.edu.pl

Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw, Poland

https://orcid.org/0000-0002-2241-9588

autor

Wróblewska Anna

anna.wroblewska1@pw.edu.pl

Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw, Poland

https://orcid.org/0000-0002-3407-7570

autor

Gawrysiak Piotr

p.gawrysiak@ii.pw.edu.pl

Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland

https://orcid.org/0000-0002-9647-6761

Bibliografia

1. L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning,” in Proceedings of the 20th Australasian document computing symposium, 2015, pp. 1–8. [Online]. Available: https://doi.org/10.1145/2838931.2838932
2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
3. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint https://arxiv.org/abs/1810.04805, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805
4. Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458. [Online]. Available: https://doi.org/10.48550/arXiv.1909.05235
5. C. Parsing, “Speech and language processing,” 2009.
6. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint https://arxiv.org/abs/1301.3781, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1301.3781
7. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint https://arxiv.org/abs/1907.11692, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692
8. S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2020, pp. 301–314. [Online]. Available: https://doi.org/10.1007/978-3-030-61534-5_27
9. K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.” Journal of machine learning research, vol. 10, no. 2, 2009.
10. E. Xing, M. Jordan, S. J. Russell, and A. Ng, “Distance metric learning with application to clustering with side-information,” Advances in neural information processing systems, vol. 15, pp. 521–528, 2002.
11. S. Wu, X. Feng, and F. Zhou, “Metric learning by similarity network for deep semi-supervised learning,” in Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS 2020). World Scientific, 2020, pp. 995–1002. [Online]. Available: https://doi.org/10.48550/arXiv.2004.14227
12. Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 360–368. [Online]. Available: https://doi.org/10.48550/arXiv.1703.07464
13. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46478-7_31
14. R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742. [Online]. Available: https://doi.org/10.1109/CVPR.2006.100
15. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682
16. “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607. [Online]. Available: https://doi.org/10.48550/arXiv.2002.05709
17. A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint https://arxiv.org/abs/1703.07737, 2017.
18. B. Skuczyńska, S. Shaar, J. Spenader, and P. Nakov, “Beasku at checkthat! 2021: fine-tuning sentence bert with triplet loss and limited data,” Faggioli et al., 2021.
19. I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, and N. Koenigstein, “Metricbert: Text representation learning via self-supervised triplet training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746018
20. M. Lennox, N. Robertson, and B. Devereux, “Deep learning proteins using a triplet-bert network,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 4341–4347. [Online]. Available: https://doi.org/10.1109/embc46164.2021.9630387
21. B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint https://arxiv.org/abs/2011.01403, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2011.01403
22. A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint https://arxiv.org/abs/1803.05449, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1803.05449
23. A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
24. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
25. B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” arXiv preprint cs/0506075, 2005. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219855
26. J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language,” Language resources and evaluation, vol. 39, no. 2, pp. 165–210, 2005. [Online]. Available: https://doi.org/10.1007/s10579-005-7880-9
27. B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” arXiv preprint cs/0409058, 2004. [Online]. Available: https://doi.org/10.3115/1218955.1218990
28. M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177. [Online]. Available: https://doi.org/10.1145/1014052.1014073
29. W. Dolan, C. Quirk, C. Brockett, and B. Dolan, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” 2004.
30. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint https://arxiv.org/abs/1412.6980, 2014. [Online]. Available: https://doi.org/10.48550/arXiv.1412.6980

Uwagi

1. The research was funded by the Centre for Priority Research Area Artificial Intelligence and Robotics of Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme (grant no 1820/27/Z01/POB2/2021).

2. Track 1: 17th International Symposium on Advanced Artificial Intelligence in Applications

3. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-eb181a51-8c39-4993-9907-548178a43cc3