Training CNN classifiers solely on webly data

Lewy, Dominik; Mandziuk, Jacek

doi:10.2478/jaiscr-2023-0005

Artykuł - szczegóły

Tytuł artykułu

Training CNN classifiers solely on webly data

Autorzy

Lewy Dominik , Mandziuk Jacek

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.2478/jaiscr-2023-0005

Warianty tytułu

Języki publikacji

Abstrakty

Real life applications of deep learning (DL) are often limited by the lack of expert labeled data required to effectively train DL models. Creation of such data usually requires substantial amount of time for manual categorization, which is costly and is considered to be one of the major impediments in development of DL methods in many areas. This work proposes a classification approach which completely removes the need for costly expert labeled data and utilizes noisy web data created by the users who are not subject matter experts. The experiments are performed with two well-known Convolutional Neural Network (CNN) architectures: VGG16 and ResNet50 trained on three randomly collected Instagram-based sets of images from three distinct domains: metropolitan cities, popular food and common objects - the last two sets were compiled by the authors and made freely available to the research community. The dataset containing common objects is a webly counterpart of PascalVOC2007 set. It is demonstrated that despite significant amount of label noise in the training data, application of proposed approach paired with standard training CNN protocol leads to high classification accuracy on representative data in all three above-mentioned domains. Additionally, two straightforward procedures of automatic cleaning of the data, before its use in the training process, are proposed. Apparently, data cleaning does not lead to improvement of results which suggests that the presence of noise in webly data is actually helpful in learning meaningful and robust class representations. Manual inspection of a subset of web-based test data shows that labels assigned to many images are ambiguous even for humans. It is our conclusion that for the datasets and CNN architectures used in this paper, in case of training with webly data, a major factor contributing to the final classification accuracy is representativeness of test data rather than application of data cleaning procedures.

Słowa kluczowe

classification webly data InstaFood1M InstaCities1M InstaPascal2M

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2023

Tom

Vol. 13, No. 1

Strony

75--92

Opis fizyczny

Bibliogr. 56 poz., rys.

Twórcy

autor

Lewy Dominik

Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland

https://orcid.org/0000-0003-2107-4909

autor

Mandziuk Jacek

mandziuk@mini.pw.edu.pl

Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland

https://orcid.org/0000-0003-0947-028X

Bibliografia

[1] J. A. Aghamaleki and S. M. Baharlou. Transfer learning approach for classification and noise reduction on noisy web data. Expert Syst. Appl., 105:221–232, 2018.
[2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 819–826. IEEE Computer Society, 2013.
[3] S. Bai and S. An. A survey on automatic image caption generation. Neurocomputing, 311:291–304, 2018.
[4] A. Bergamo and L. Torresani. Exploiting weaklylabeled web images to improve object classification: a domain adaptation approach. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pages 181–189. Curran Associates, Inc., 2010.
[5] J. Bohlke, D. Korsch, P. Bodesheim, and J. Denzler. Lightweight filtering of noisy web data: Augmenting fine-grained datasets with selected internet images. In G. M. Farinella, P. Radeva, J. Braz, nd K. Bouatouch, editors, Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021, Volume 5: VISAPP, Online Streaming, February 8-10, 2021, pages 466–477. SCITEPRESS, 2021.
[6] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
[7] X. Chen and A. Gupta. Webly supervised learning of convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1431–1439. IEEE Computer Society, 2015.
[8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow ith convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2758–2766. IEEE Computer Society, 2015.
[9] T. Durand, N. Thome, and M. Cord. WELDON: weakly supervised learning of deep convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 4743–4752. IEEE Computer Society, 2016.
[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
[11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 2121–2129, 2013.
[12] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 580–587. IEEE Computer Society, 2014.
[13] R. Gomez. Instacities1m, https://gombru.github.io/2018/08/01/InstaCities1M/, 2018.
[14] R. Gomez, L. Gomez, J. Gibert, and D. Karatzas, Learning to learn from web data through deep semantic embeddings. In L. Leal-Taixe and S. Roth, editors, Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part VI, volume 11134 of Lecture Notes in Computer Science, pages 514–529.Springer, 2018.
[15] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2315–2324. IEEE Computer Society, 2016.
[16] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988. IEEE Computer Society, 2017.
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society, 2016.
[18] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montreal, Canada, pages 10477–10486, 2018.
[19] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann. Deep classifiers from image tags in the wild. In G. Friedland, C. Ngo, and D. A. Shamma, editors, Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions, MMCommons 2015, Brisbane, Australia, October 30, 2015, pages 13–18. ACM, 2015.
[20] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis., 116(1):1–20, 2016.
[21] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, volume 9911 of Lecture Notes in Computer Science, pages 67–84. Springer, 2016.
[22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[23] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effectiveness of noisy data for finegrained recognition. In B. Leibe, J. Matas, N. Sebe,and M. Welling, editors, Computer Vision – ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, volume 9907 of Lecture Notes in Computer Science, pages 301–320. Springer, 2016.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 1106–1114, 2012.
[25] D. Lewy and J. Mandziuk. An overview of mixing augmentation methods and augmentation strategies. Artificial Intelligence Review, 2022.
[26] D. Lewy and J. Mandziuk. Instafood1m, https://szefkuchni.github.io/InstaFood1M/, 2019.
[27] D. Lewy and J. Mandziuk. Instapascal2m, https://szefkuchni.github.io/InstaPascal2M/, 2019.
[28] J. Li, Y. Song, J. Zhu, L. Cheng, Y. Su, L. Ye, P. Yuan, and S. Han. Learning from large-scale noisy web data with ubiquitous reweighting for image classification. IEEE Trans. Pattern Anal. Mach. Intell., 43(5):1808–1814, 2021.
[29] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: common objects in context. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer, 2014.
[30] D. Mahajan, R. B. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten. Exploring the limits of weakly supervised pretraining. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, volume 11206 of Lecture Notes in Computer Science, pages 185–201. Springer, 2018.
[31] L. Niu, W. Li, D. Xu, and J. Cai. Visual recognition by learning from web data via weakly supervised domain generalization. IEEE Trans. Neural Networks Learn. Syst., 28(9):1985–1999, 2017.
[32] L. Niu, Q. Tang, A. Veeraraghavan, and A. Sabharwal. Learning from noisy web data with categorylevel supervision. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7689–7698. Computer Vision Foundation /IEEE Computer Society, 2018.
[33] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7171–7180. Computer Vision Foundation / IEEE Computer Society, 2018.
[34] G. Patrini, A. Rozza, A. K. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2233–2241. IEEE Computer Society, 2017.
[35] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
[36] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 779–788. IEEE Computer Society, 2016.
[37] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.
[38] M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should I trust you?”: Explaining the predictions of any classifier. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144. ACM, 2016.
[39] S. Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747,2016.
[40] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115(3):211–252, 2015.
[41] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, 2017.
[42] A. Shrivastava, A. Gupta, and R. B. Girshick. Training region-based object detectors with online hard example mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 761–769. IEEE Computer Society, 2016.
[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[44] E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deep learning in NLP. In A. Korhonen, D. R. Traum, and L. Marquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 3645–3650. Association for Computational Linguistics, 2019.
[45] H. Su, S. Gong, and X. Zhu. Weblogo-2m: Scalable logo detection by deep learning from the web. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, pages 270–279. IEEE Computer Society, 2017.
[46] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 843–852. IEEE Computer Society, 2017.
[47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June7-12, 2015, pages 1–9. IEEE Computer Society, 2015.
[48] A. Vahdat. Toward robustness against label noise in training deep discriminative neural networks. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5596–5605, 2017.
[49] A. Veit, N. Alldrin, G. Chechik, I. Krasin,A. Gupta, and S. J. Belongie. Learning from noisy large-scale datasets with minimal supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6575–6583. IEEE Computer Society, 2017.
[50] Y. Xian, B. Schiele, and Z. Akata. Zero-shot learning - the good, the bad and the ugly. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3077–3086. IEEE Computer Society, 2017.
[51] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2691–2699. IEEE Computer Society, 2015.
[52] J. Yang, X. Sun, Y. Lai, L. Zheng, and M. Cheng. Recognition from web data: A progressive filtering approach. IEEE Trans. Image Process., 27(11):5303–5315, 2018.
[53] I. Yildirim, T. Kulkarni, W. A. Freiwald, and J. B. Tenenbaum. Efficient and robust analysis-bysynthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations. In Annual Conference of the Cognitive Science Society, 2015.
[54] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3320–3328, 2014.
[55] C. Zhang, Y. Yao, X. Xu, J. Shao, J. Song, Z. Li, and Z. Tang. Extracting useful knowledge from noisy web images via data purification for finegrained recognition. In H. T. Shen, Y. Zhuang, J. R. Smith, Y. Yang, P. Cesar, F. Metze, and B. Prabhakaran, editors, MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, 2021, pages 4063–4072. ACM, 2021.
[56] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. D. Reid. Attend in groups: A weakly-supervised deep learning framework for learning from web data. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2915–2924. IEEE Computer Society, 2017.

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2022-2023).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-ec514410-ae39-47e8-a099-caa7bbee4d24