Combined YOLOv5 and HRNet for high accuracy 2D keypoint and human pose estimation

Nguyen, Hung-Cuong; Nguyen, Thi-Hao; Nowak, Jakub; Byrski, Aleksander; Siwocha, Agnieszka; Le, Van-Hung

doi:10.2478/jaiscr-2022-0019

Artykuł - szczegóły

Tytuł artykułu

Combined YOLOv5 and HRNet for high accuracy 2D keypoint and human pose estimation

Autorzy

Nguyen Hung-Cuong , Nguyen Thi-Hao , Nowak Jakub , Byrski Aleksander , Siwocha Agnieszka , Le Van-Hung

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.2478/jaiscr-2022-0019

Warianty tytułu

Języki publikacji

Abstrakty

Two-dimensional human pose estimation has been widely applied in real-world applications such as sports analysis, medical fall detection, human-robot interaction, with many positive results obtained utilizing Convolutional Neural Networks (CNNs). Li et al. at CVPR 2020 proposed a study in which they achieved high accuracy in estimating 2D keypoints estimation/2D human pose estimation. However, the study performed estimation only on the cropped human image data. In this research, we propose a method for automatically detecting and estimating human poses in photos using a combination of YOLOv5 + CC (Contextual Constraints) and HRNet. Our approach inherits the speed of the YOLOv5 for detecting humans and the efficiency of the HRNet for estimating 2D keypoints/2D human pose on the images. We also performed human marking on the images by bounding boxes of the Human 3.6M dataset (Protocol #1) for human detection evaluation. Our approach obtained high detection results in the image and the processing time is 55 FPS on the Human 3.6M dataset (Protocol #1). The mean error distance is 5.14 pixels on the full size of the image (1000×1002). In particular, the average results of 2D human pose estimation/2D keypoints estimation are 94.8% of PCK and 99.2% of PDJ@0.4 (head joint). The results are available.

Słowa kluczowe

YOLOv5 HRNet 2D key points estimation 2D human pose estimation

Wydawca

University of Social Sciences

Czasopismo

Journal of Artificial Intelligence and Soft Computing Research

Rocznik

2022

Tom

Vol. 12, No. 4

Strony

281--298

Opis fizyczny

Bibliogr. 63 poz., rys.

Twórcy

autor

Nguyen Hung-Cuong

Faculty of Engineering Technology, Hung Vuong University, Vietnam

autor

Nguyen Thi-Hao

Faculty of Engineering Technology, Hung Vuong University, Vietnam

autor

Nowak Jakub

Czestochowa University of Technology, Czestochowa, Poland

autor

Byrski Aleksander

AGH University of Science and Technology, Institute of Computer Science, 30-059 Kraków, Poland

autor

Siwocha Agnieszka

University of Social Sciences, Institute of Information Technologies, 9 Sienkiewicza Street, 90-113 Łódź, Poland

autor

Le Van-Hung

van-hung.le@mica.edu.vn

Tan Trao University, Vietnam

Bibliografia

[1] Ssd mobilenet v1 architecture (2018). [Accessed 22 Dec 2021]
[2] Abdulla, W.: Mask r-cnn for object detection and instance segmentation on keras and tensorflow. https://github.com/matterport/Mask_RCNN (2017). [Accessed 20 Dec 2021] [3] Babu, S.C.: A 2019 guide to human pose estimation with deep learning. https://nanonets.com/blog/human-poseestimation-2d-guide/. [Online: Accessed 5 December 2021]
[4] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: Optimal Speed and Accuracy of Object etection. arXiv (2020)
[5] Burrus, N.: Kinect calibration. http://nicolas.burrus.name/index.php/Research/KinectCalibration
[6] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Conference on CVPR, vol. 2017-Janua, pp. 1302–1310 (2017). DOI 10.1109/CVPR.2017.143
[7] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
[8] Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. CoRR abs/1507.06550 (2015)
[9] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded Pyramid Network for Multiperson Pose Estimation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018). DOI 10.1109/CVPR.2018.00742
[10] Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems pp. 379–387 (2016)
[11] Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, vol. 29. Curran Associates, Inc. (2016). https://proceedings.neurips.cc/paper/2016/file/577ef1154f3240ad5b9b413aa7346a1e-Paper.pdf
[12] Dang, Q., Yin, J., Wang, B., Zheng, W.: Deep learning based 2D human pose estimation: A survey. TPAMI 24(6), 663–676 (2021). DOI 10. 26599/TST.2018.9010100
[13] Gao, H.: Single shot multibox detector implementation in pytorch. https://github.com/qfgaohao/pytorch-ssd (2020). [Accessed 20 Dec 2021]
[14] Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 Inter, pp. 1440–1448 (2015). DOI 10.1109/ICCV.2015.169
[15] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014). DOI 10.1109/CVPR.2014.81
[16] Glen., S.: "jaccard index/similarity coefficient" from statisticshowto.com: Elementary statistics for the rest of us!https://www.statisticshowto.com/jaccard-index/.Online; accessed 6 December 2021
[17] Haque, M.F., Lim, H.y., Kang, D.s.: Object Detection Based on VGG with ResNet Network. In: 2019 International Conference on Electronics, Information, and Communication (ICEIC), pp. 1–3. Institute of electronics and information engineers (IEIE)
[18] He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
[19] He, K., Zhang, X., Ren, S., Sun, J.: Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9), 1904–1916 (2015). DOI 10.1109/TPAMI.2015.2389824
[20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on CVPR, vol. 2016-Decem, pp. 770–778 (2016). DOI 10.1109/CVPR.2016.90
[21] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 3296–3305 (2017). DOI 10.1109/CVPR.2017.351
[22] Hung, G.L., Sahimi, M.S.B., Samma, H., Almohamad, T.A., Lahasan, B.: Faster R-CNN Deep Learning Model for Pedestrian Detection from Drone Images. In: SN Computer Science, vol. 1, pp. 1–9. Springer Singapore (2020). DOI 10.1007/s42979-020-00125-y. https://doi.org/10.1007/s42979-020-00125-y
[23] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36(7), 1325–1339 (2014)
[24] Jocher, G.R.: Head and person detection model. https://github.com/deepakcrk/yolov5-crowdhuman. Online; accessed 6 December 2021
[25] Jocher, G.R.: Yolov5 tutorials. https://github.com/ultralytics/yolov5. Online; accessed 6 December 2021
[26] Jonathan, H.: Object detection: speed and accuracy comparison (faster r-cnn, r-fcn, ssd, fpn, retinanet and yolov3) (2018). [Accessed 18 Dec 2021]
[27] Krishnan, S.: Person-detection. https://github.com/SusmithKrishnan/person-detection (2021). [Accessed 20 Dec 2021]
[28] Li, N.: Evoskeleton, cascaded 2d-to-3d lifting.https://github.com/Nicholasli1995/EvoSkeleton. Online; accessed 25 December 2021
[29] Li, S., Ke, L., Pratama, K., Tai, Y.W., Tang, C.K., Cheng, K.T.: Cascaded deep monocular 3d human pose estimation with evolutionary training data. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
[30] Liang, S., Sun, X., Wei, Y.: Compositional Human Pose Regression. In: ICCV, vol. 176-177, pp. 1–8 (2017). DOI 10.1016/j.cviu.2018.10.006
[31] Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft coco: Common objects in context (2014). http://arxiv.org/abs/1405.0312
[32] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot multibox detector. In: European Conference on Computer Vision, vol. 9905 LNCS, pp. 21–37 (2016). DOI 10.1007/978-3-319-46448-0_2
[33] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: B. Leibe, J. Matas, N. Sebe, M. Welling (eds.) ECCV (1), Lecture Notes in Computer Science, vol. 9905, pp. 21–37. Springer (2016). http://dblp.uni-trier.de/db/conf/eccv/eccv2016-1.htmlLiuAESRFB16
[34] Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. Computers and Graphics (Pergamon) 85, 15–22 (2019). DOI 10.1016/j.cag. 2019.09.002
[35] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Computer Vision – ECCV 2016, pp. 483–499. Springer International Publishing (2016)
[36] Newell, A., Yang, K., Deng, J.: Stacked Hourglass Networks for Human Pose Estimation. In: ECCV (2016)
[37] openpose: openpose. https://github.com/CMUPerceptual-Computing-Lab/openpose (2019). [Accessed 23 April 2019]
[38] Ramanan, D.: Learning to parse images of articulated bodies. In: In NIPS (2006)
[39] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 779–788 (2016). DOI 10.1109/CVPR.2016.91
[40] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
[41] Redmon, J., Farhadi, A.: YOLO9000: Better, faster, stronger. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6517–6525 (2017). DOI 10.1109/CVPR.2017.690
[42] Redmon, J., Farhadi, A.: Yolov3 an incremental improvement (2018). http://arxiv.org/abs/1804.02767. [Accessed 18 April 2021]
[43] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems 28, pp. 91–99 (2015)
[44] Ren, S., He, K., Girshick, R., Sun, J.: Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (2017). DOI 10.1109/TPAMI.2016.2577031
[45] Sapp, B., Taskar, B.: In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. DOI 10.1109/CVPR. 2013.471
[46] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–14 (2015)
[47] Sun, K., Xiao, B., Liu, D., Wang, J.: Deep highresolution representation learning for human pose estimation. In: CVPR (2019)
[48] Tan, D.: Image geometric transformation in numpy and opencv. https://towardsdatascience.com/imagegeometric-transformation-in-numpy-and-opencv936f5cd1d315 (2019). Online; accessed 6 December 2021
[49] Thanh, N.T., Hùng, L.V., Công, P.T.: An Evaluation of Pose Estimation in Video of Traditional Martial Arts Presentation. Journal of Research and Development on Information and Communication Technology 2019(2), 114–126 (2019). DOI 10.32913/mic-ict-research.v2019.n2.864
[50] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656. IEEE Computer Society (2015)
[51] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. CoRR abs/1312.4659 (2013). http://dblp.unitrier.de/db/journals/corr/corr1312.htmlToshevS13
[52] Toshev, A., Szegedy, C.: DeepPose: Human Pose Estimation via Deep Neural Networks. In: IEEE Conference on CVPR (2014)
[53] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. TPAMI
[54] Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
[55] Weiming Chen , Zijie Jiang, H.G., Ni, X.: Fall Detection Based on Key Points of of human-skeleton using openpose. Symmetry (2020)
[56] Willett, N.S., Shin, H.V., Jin, Z., Li, W., Finkelstein, A.: Pose2Pose: Pose Selection and Transfer for 2D Character Animation. In: International Conference on Intelligent User Interfaces, Proceedings IUI, pp. 88–99 (2020). DOI 10.1145/3377325.3377505
[57] Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: European Conference on Computer Vision (ECCV) (2018)
[58] Yang, W.: Human Pose Estimation 101. https://github.com/cbsudux/Human-PoseEstimation-101percentage-of-correct-keypoints—pck (2019). [Accessed 18 April 2021]
[59] Yang, W., Ouyang, W., Li, H., Wang, X.: Endto-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In: CVPR (2016)
[60] Yang, W., Ouyang, W., Li, H., Wang, X.: End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. https://github.com/bearpaw/eval_pose (2016). Online; accessed 20 December 2021
[61] Zhang, H., Sciutto, C., Agrawala, M., Fatahalian, K.: Vid2Player: Controllable Video Sprites That Behave and Appear Like Professional Tennis Players. ACM Transactions on Graphics 40(3), 1–16 (2021). DOI 10.1145/3448978
[62] Zhang, X., Zou, J., He, K., Sun, J.: Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10), 1943–1955 (2016). DOI 10.1109/TPAMI.2015.2502579
[63] Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y. Towards 3d human pose estimation in the wild: A weakly-supervised approach. In: The IEEE International Conference on Computer Vision (ICCV) (2017)

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-b90ac873-a7bf-4d76-9cc5-be438e1eb35b