PL EN


Preferencje help
Widoczny [Schowaj] Abstrakt
Liczba wyników
Tytuł artykułu

Applying SoftTriple loss for supervised language model fine tuning

Wybrane pełne teksty z tego czasopisma
Identyfikatory
Warianty tytułu
Konferencja
Federated Conference on Computer Science and Information Systems (17 ; 04-07.09.2022 ; Sofia, Bulgaria)
Języki publikacji
EN
Abstrakty
EN
We introduce a new loss function based on cross entropy and SoftTriple loss, TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models. This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about 0.02 - 2.29 percentage points. Thorough tests on popular datasets using our loss function indicate a steady gain. The fewer samples in the training dataset, the higher gain - thus, for small-sized dataset, it is about 0.71 percentage points, for medium-sized - 0.86 percentage points, for large - 0.20 percentage points, and for extra-large 0.04 percentage points.
Rocznik
Tom
Strony
141--147
Opis fizyczny
Bibliogr. 30 poz., wz., tab., il., wykr.
Twórcy
  • Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw, Poland
  • Faculty of Mathematics and Information Science Warsaw University of Technology Warsaw, Poland
  • Faculty of Electronics and Information Technology Warsaw University of Technology Warsaw, Poland
Bibliografia
  • 1. L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning,” in Proceedings of the 20th Australasian document computing symposium, 2015, pp. 1–8. [Online]. Available: https://doi.org/10.1145/2838931.2838932
  • 2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • 3. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint https://arxiv.org/abs/1810.04805, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805
  • 4. Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458. [Online]. Available: https://doi.org/10.48550/arXiv.1909.05235
  • 5. C. Parsing, “Speech and language processing,” 2009.
  • 6. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint https://arxiv.org/abs/1301.3781, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1301.3781
  • 7. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint https://arxiv.org/abs/1907.11692, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692
  • 8. S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2020, pp. 301–314. [Online]. Available: https://doi.org/10.1007/978-3-030-61534-5_27
  • 9. K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.” Journal of machine learning research, vol. 10, no. 2, 2009.
  • 10. E. Xing, M. Jordan, S. J. Russell, and A. Ng, “Distance metric learning with application to clustering with side-information,” Advances in neural information processing systems, vol. 15, pp. 521–528, 2002.
  • 11. S. Wu, X. Feng, and F. Zhou, “Metric learning by similarity network for deep semi-supervised learning,” in Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS 2020). World Scientific, 2020, pp. 995–1002. [Online]. Available: https://doi.org/10.48550/arXiv.2004.14227
  • 12. Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 360–368. [Online]. Available: https://doi.org/10.48550/arXiv.1703.07464
  • 13. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46478-7_31
  • 14. R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742. [Online]. Available: https://doi.org/10.1109/CVPR.2006.100
  • 15. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682
  • 16. “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607. [Online]. Available: https://doi.org/10.48550/arXiv.2002.05709
  • 17. A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint https://arxiv.org/abs/1703.07737, 2017.
  • 18. B. Skuczyńska, S. Shaar, J. Spenader, and P. Nakov, “Beasku at checkthat! 2021: fine-tuning sentence bert with triplet loss and limited data,” Faggioli et al., 2021.
  • 19. I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, and N. Koenigstein, “Metricbert: Text representation learning via self-supervised triplet training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746018
  • 20. M. Lennox, N. Robertson, and B. Devereux, “Deep learning proteins using a triplet-bert network,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 4341–4347. [Online]. Available: https://doi.org/10.1109/embc46164.2021.9630387
  • 21. B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint https://arxiv.org/abs/2011.01403, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2011.01403
  • 22. A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint https://arxiv.org/abs/1803.05449, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1803.05449
  • 23. A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
  • 24. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
  • 25. B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” arXiv preprint cs/0506075, 2005. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219855
  • 26. J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language,” Language resources and evaluation, vol. 39, no. 2, pp. 165–210, 2005. [Online]. Available: https://doi.org/10.1007/s10579-005-7880-9
  • 27. B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” arXiv preprint cs/0409058, 2004. [Online]. Available: https://doi.org/10.3115/1218955.1218990
  • 28. M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177. [Online]. Available: https://doi.org/10.1145/1014052.1014073
  • 29. W. Dolan, C. Quirk, C. Brockett, and B. Dolan, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” 2004.
  • 30. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint https://arxiv.org/abs/1412.6980, 2014. [Online]. Available: https://doi.org/10.48550/arXiv.1412.6980
Uwagi
1. The research was funded by the Centre for Priority Research Area Artificial Intelligence and Robotics of Warsaw University of Technology within the Excellence Initiative: Research University (IDUB) programme (grant no 1820/27/Z01/POB2/2021).
2. Track 1: 17th International Symposium on Advanced Artificial Intelligence in Applications
3. Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).
Typ dokumentu
Bibliografia
Identyfikator YADDA
bwmeta1.element.baztech-eb181a51-8c39-4993-9907-548178a43cc3
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.