Image caption generation using transfer learning

Kopiński, Radosław; Antczak, Karol

doi:10.5604/01.3001.0053.9697

Artykuł - szczegóły

Tytuł artykułu

Image caption generation using transfer learning

Autorzy

Kopiński Radosław , Antczak Karol

Treść / Zawartość

Pełne teksty:

R. KOPIŃSKI, K. ANTCZAK_image_csmm_15_16_2022.pdf

Pobierz

Identyfikatory

DOI

10.5604/01.3001.0053.9697

Warianty tytułu

Generowanie podpisów na podstawie zdjęć z użyciem uczenia transferowego

Języki publikacji

Abstrakty

This paper describes an image caption generation system using deep neural networks. The model is trained to maximize the probability of generated sentence, given the image. The model utilizes transfer learning in the form of pretrained convolutional neural networks to preprocess the image data. The datasets are composed of a still photographs and associated with it, five captions in English language. Constructed model is compared to other similarly constructed models using BLEU score system and ways to further improve its performance are proposed.

W tym artykule opisano system generujący podpisy do zdjęć z wykorzystaniem głębokich sieci neuronowych. Model jest trenowany pod kątem maksymalizacji prawdopodobieństwa wygenerowanego zdania, dla zadanego obrazu. Model wykorzystuje uczenie transferowe w postaci wytrenowanych wstępnie neuronowych sieci konwolucyjnych. Zbiory danych wykorzystane do trenowania modelu składają się z fotografii, oraz przypisanych do niej pięciu zdań w języku angielskim. Skonstruowany model jest potem porównany z innymi modelami o podobnej konstrukcji z wykorzystaniem punktacji BLEU.

Słowa kluczowe

neural networks NLP caption generation machine learning computer vision deep learning transfer learning

sieci neuronowe generowanie podpisów uczenie maszynowe widzenie komputerowe głębokie uczenie uczenie transferowe

Wydawca

Institute of Computer and Information Systems, Faculty of Cybernetics, Military University of Technology

Czasopismo

Computer Science and Mathematical Modelling

Rocznik

2022

Tom

No. 15-16

Strony

7--12

Opis fizyczny

Bibliogr. 13 poz., il., rys., tab.

Twórcy

autor

Kopiński Radosław

radoslaw.kopinski@student.wat.edu.pl

student, Military University of Technology, Faculty of Cybernetics Institute of Computer and Information Systems, Kaliskiego St. 2, 00-908 Warsaw, Poland

autor

Antczak Karol

karol.antczak@wat.edu.pl

Military University of Technology, Faculty of Cybernetics Institute of Computer and Information Systems, Kaliskiego St. 2, 00-908 Warsaw, Poland

https://orcid.org/0000-0002-1455-6347

Bibliografia

[1] Farhadi A., et al., „Every picture tells a Ssory: Generating sentences from images”, Computer Vision - ECCV 2010, LNCS 6314, pp. 15-29, Springer 2010.
[2] Mitchell M., et al., „Midge: Generating image descriptions from computer vision detections”, in: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747-756, April 2012.
[3] Bai S., An S., „A survey on automatic image caption generation”, Neurocomputing, Vol. 311, 291-304 (2018).
[4] Mikolov T., Chen K., Corrado G., Dean J., „Efficient estimation of word representations in vector space”, arXiv preprint arXiv : 1301.3781, September 2013.
[5] Tanti M., et al., „Where to put the image in an image caption generator”, Natural Language Engineering, Vol. 24(3), 467-489 (2018).
[6] Simonyan K. et al., „Very deep convolutional networks for large-scale image recognition”, CoRR, abs/1409.1556v6 (2014).
[7] Kingma D.P., Ba J., „Adam: A Method for stochastic optimization”, CoRR, abs/1412.6980v9 (2017).
[8] Abadi M., et al., „TensorFlow: Large-scale machine learning on heterogeneous systems”, November 2015, Software available from www.tensorflow.org.
[9] Chollet F., et. al., Keras, 2015, https://keras.io.
[10] Karpathy A., Fei-Fei L., „Deep Visual-Semantic Alignments for Generating Image Descriptions”, CoRR, abs/1412.2306 (2014).
[11] Vinyals O. et al., „Show and Tell: A Neural Image Caption Generator”, CoRR abs/1411.4555 (2014).
[12] Xu K. et al., „Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, CoRR, abs/1502.03044 (2015).
[13] Liu Z. et al., „A ConvNet for the 2020s”. CoRR, abs/2201.03545 (2022).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-c654344f-52f6-454c-871f-2f55d0c682d0