Skeleton-based human action/interaction classification in sparse image sequences

Piwowarski, Paweł; Kasprzak, Włodzimierz

doi:10.14313/JAMRIS/3‐2023/18

Artykuł - szczegóły

Tytuł artykułu

Skeleton-based human action/interaction classification in sparse image sequences

Autorzy

Piwowarski Paweł , Kasprzak Włodzimierz

Treść / Zawartość

Pełne teksty:

Pobierz

Identyfikatory

DOI

10.14313/JAMRIS/3‐2023/18

Warianty tytułu

Języki publikacji

Abstrakty

Research results on human activity classification in video are described, based on initial human skeleton estimation in selected video frames. Simple, homogeneous activities, limited to single person actions and two-person interactions, are considered. The initial skeleton data is estimated in selected video frames by software tools, like “OpenPose” or “HRNet”. Main contributions of presented work are the steps of “skeleton tracking and correcting” and “relational feature extraction”. It is shown that this feature engineering step significantly increases the classification accuracy compared to the case of raw skeleton data processing. Regarding the final neural network encoder‐classifier, two different architectures are designed and evaluated. The first solution is a lightweight multilayer perceptron (MLP) network, implementing the idea of a “mixture of pose experts”. Several pose classifiers (experts) are trained on different time periods (snapshots) of visual actions/interactions, while the final classification is a time‐related pooling of weighted expert classifications. All pose experts share a common deep encoding network. The second (middle weight) solution is based on a “long short‐term memory” (LSTM) network. Both solutions are trained and tested on the well‐known NTU RGB+D dataset, although only 2D data are used. Our results show comparable performance with some of the best reported LSTM-, Graph Convolutional Network- (GCN), and Convolutional Neural Network-based classifiers for this dataset. We conclude that, by reducing the noise of skeleton data, highly successful lightweight- and midweight-models for the recognition of brief activities in image sequences can be achieved.

Słowa kluczowe

action classification skeleton features 2-person interactions mixture of experts video analysis

Wydawca

Łukasiewicz Industrial Research Institute for Automation and Measurements PIAP

Czasopismo

Journal of Automation Mobile Robotics and Intelligent Systems

Rocznik

2023

Tom

Vol. 17, No. 3

Strony

1--14

Opis fizyczny

Bibliogr. 47 poz., rys.

Twórcy

autor

Piwowarski Paweł

wlodzimerz.kasprzak@pw.edu.pl

Warsaw University of Technology, Institute of Control and Computation Eng. ul.Nowowiejska 15/19, 00‐665 Warsaw, Poland, www.ia.pw.edu.pl/ ~wkasprza

https://orcid.org/0000%E2%80%900002%E2%80%904840%E2%80%908860

autor

Kasprzak Włodzimierz

pawel@piwowarski.com.pl

Warsaw University of Technology, Institute of Control and Computation Eng. ul.Nowowiejska 15/19, 00‐665 Warsaw, Poland

https://orcid.org/0000-0002-4477-2534

Bibliografia

[1] C. Coppola, S. Cosar, D. R. Faria, and N. Bellotto.“Automatic detection of human interactions from RGB‐D data for social activity classification,” 2017 26th IEEE International Symposium on Robot and Human Interactive Communication “RO-MAN”, Lisbon, 2017, pp. 871–876; doi: 10.1109/ROMAN.2017.8172405.
[2] A. M. Zanchettin, A. Casalino, L. Piroddi, and P. Rocco. “Prediction of Human Activity Patterns for Human–Robot Collaborative Assembly Tasks,” IEEE Transactions on Industrial Informatics, vol. 15(2019), no. 7, pp. 3934–3942; doi: 10.1109/TII.2018.2882741.
[3] Z. Zhang, G. Peng, W. Wang, Y. Chen, Y. Jia, and S. Liu. “Prediction‐Based Human‐Robot Collaboration in Assembly Tasks Using a Learning from Demonstration Model,” Sensors, 2022, no. 22(11):4279; doi: 10.3390/s22114279.
[4] M. S. Ryoo. “Human activity prediction: Early Recognition of Ongoing Activities from Streaming Videos,” 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 1036–1043; doi: 10.1109/ICCV. 2011.6126349.
[5] K. Viard, M. P. Fanti, G. Faraut, and J.‐J. Lesage. “Human Activity Discovery and Recognition sing Probabilistic Finite‐State Automata. “IEEE ransactions on Automation Science and Engineering, vol. 17 (2020), no. 4, pp. 2085–2096; doi: 10.1109/TASE.2020.2989226.
[6] S. Zhang, Z. Wei, J. Nie, L. Huang, S. Wang, and Z. Li. “A review on human activity recognition using vision‐based method,” Journal of Healthcare Engineering, Hindawi, vol. 2017, Article ID 3090343; doi: 10.1155/2017/3090343.
[7] A. Stergiou and R. Poppe. “Analyzing human‐human interactions: a survey,” Computer Vision and Image Understanding, Elsevier, vol. 188 (2019), 102799; doi: 10.1016/j.cviu.2019.102799.
[8] A. Bevilacqua, K. MacDonald, A. Rangarej, V. Widjaya, B. Caulfield, and T. Kechadi. “Human Activity Recognition with Convolutional Neural Networks,” Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2018), LNAI vol. 11053, Springer, Cham, Switzerland, 2019, pp. 541–552; doi: 10.1007/978‐3‐030‐10997‐4_33.
[9] M. Liu, and J. Yuan. “Recognizing Human Actions as the Evolution of Pose Estimation Maps,” 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, June 18‐22, 2018, pp. 1159–1168; doi: 10.1109/CVPR.2018.00127.
[10] E. Cippitelli, E. Gambi, S. Spinsante, and F. Florez‐Revuelta. “Evaluation of a skeleton‐based method for human activity recognition on a large‐scale RGB‐D dataset,” 2nd IET International onference on Technologies for Active and Assisted Living (TechAAL 2016), London, UK, 2016; doi: 10.1049/IC.2016.0063.
[11] Z. Cao, G. Hidalgo, T. Simon, S.‐E. Wei, and Y. Sheikh, ”OpenPose: Realtime Multi‐Person 2D Pose Estimation Using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):172–186, 2021; doi: 10.1109/TPAMI.2019.2929257.
[12] A. Toshev, and C. Szegedy. “DeepPose: Human Pose Estimation via Deep Neural Networks,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1653–1660; doi: 10.1109/CVPR.2014.214.
[13] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. “Deepercut: a deeper, stronger, and faster multi‐person pose estimation model,” Computer Vision – ECCV 2016, LNCS vol. 9907, Springer, Cham, Switzerland, 2016, pp. 34–50; doi: 10.1007/978‐3‐319‐46466‐4_3.
[14] [Online]. NTU RGB+D 120 Dataset. Papers With Code. Available online: https://paperswithcode.com/dataset/ntu‐rgb‐d‐120 (accessed on 30 June 2022).
[15] M. Perez, J. Liu, and A.C. Kot, “Interaction Relational Network for Mutual Action Recognition,” arXiv:1910.04963 [cs.CV], 2019; https://arxiv.org/abs/1910.04963 (accessed on 15.07.2022).
[16] L.‐P. Zhu, B. Wan, C.‐Y. Li, G. Tian, Y. Hou, and K. Yuan. “Dyadic relational graph convolutional networks for skeleton‐based human interaction recognition,” Pattern Recognition, Elsevier, vol. 115, 2021, p. 107920;doi: 10.1016/j.patcog.2021.107920.
[17] R.‐A. Jacobs, M.‐I. Jordan, S.‐J. Nowlan, and G.‐E. Hinton. “Adaptive mixtures of local experts,”Neural Comput., 3(1):79–87, 1991.
[18] S. Puchała, W. Kasprzak, and P. Piwowarski. “Feature engineering techniques for skeleton‐based two‐person interaction classification in video,” 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 2022, IEEE Explore, pp. 66–71; doi: 10.1109/ICARCV57592.2022.10004329.
[19] P.‐F. Felzenszwalb, R.‐B. Girshick, D. McAllester, and D. Ramanan, ”Object detection with discrim‐inatively trained part‐based models,”IEEE Trans. Pattern Anal. Mach. Intell., 2010, vol. 32, no. 9, pp. 1627–1645; doi: 10.1109/TPAMI.2009.167.
[20] A. Krizhevsky, I. Sutskever, and G.‐E. Hinton, “ImageNet classification with deep convoluional neural networks,” Communications of the ACM, 2017, vol. 60(6), pp. 84–90; doi: 10.1145/3065386.
[21] K. Simonyan, and A. Zisserman. “Very Deep Convolutional Networks for Large‐Scale Image 12 Recognition,” arXiv, 2015, arXiv:1409.1556; https://arxiv.org/abs/1409.1556.
[22] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition,” Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778; doi:10.1109/CVPR.2016.90.
[23] T.‐L. Munea, Y.‐Z. Jembre, H.‐T. Weldegebriel, L. Chen, C. Huang, and C. Yang. “The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation,” IEEE Access, 2020, vol. 8, pp. 133330–133348; doi: 10.1109/ACCESS.2020.3010248.
[24] K. Wei, and X. Zhao. “Multiple‐Branches Faster RCNN for Human Parts Detection and Pose Estimation,” Computer Vision – ACCV 2016 Workshops, Lecture Notes in Computer Science, vol. 10118, Springer, Cham, 2017; doi: 10.1007/978‐3‐319‐54526‐4.
[25] Z. Su, M. Ye, G. Zhang, L. Dai, and J. Sheng. “Cascade feature aggregation for human pose estimation,” arXiv, 2019, arXiv:1902.07837; https: //arxiv.org/abs/1902.07837.
[26] H. Meng, M. Freeman, N. Pears, and C. Bailey. “Real‐time human action recognition on an embedded, reconfigurable video processing architecture,” J. Real-Time Image Proc., vol. 3, 2008, no. 3, pp. 163–176; doi: 10.1007/s11554‐008‐0073‐1.
[27] K.‐G. Manosha Chathuramali, and R. Rodrigo. “Faster human activity recognition with SVM,” International Conference on Advances in ICT for Emerging Regions (ICTer2012), Colombo, Sri Lanka, 12–15 December 2012, IEEE, 2012, pp. 197–203; doi: 10.1109/icter.2012.6421415.
[28] X. Yan, and Y. Luo. “Recognizing human actions using a new descriptor based on spatial–temporal interest points and weighted‐output classifier,” Neurocomputing, Elsevier, vol. 87, 2012, pp. 51–61, 15 June 2012; doi: 10.1016/j.neucom.2012.02.002.
[29] R. Vemulapalli, F. Arrate, and R. Chellappa. “Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, 23–28 June 2014, Columbus, OH, USA, IEEE, pp. 588–595; doi: 10.1109/cvpr.2014.82.
[30] J. Liu, A. Shahroudy, D. Xu, and G. Wang, ”Spatio‐Temporal LSTM with Trust Gates for 3D Human Action Recognition,” Computer Vision –ECCV 2016, Lecture Notes in Computer Science, vol. 9907, Springer, Cham, Switzerland, 2016, pp. 816–833; doi: 10.1007/978‐3‐319‐46487‐9_50.
[31] A. Shahroudy, J. Liu, T.‐T. Ng, and G. Wang. “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” arXiv:1604.02808[cs.CV], 2016; https://arxiv.org/abs/1604.02808.
[32] C. Li, Q. Zhong, D. Xie, and S. Pu. “Skeleton‐based Action Recognition with Convolutional Neural Networks,” 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), 10–14 July 2017, Hong Kong, pp. 597–600; doi: 10.1109/ICMEW.2017.8026285.
[33] D. Liang, G. Fan, G. Lin, W. Chen, X. Pan, and H. Zhu. “Three‐Stream Convolutional Neural Network With Multi‐Task and Ensemble Learning for 3D Action Recognition,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 16–17 June 2019, Long Beach, CA, USA, IEEE, pp. 934–940; doi: 10.1109/cvprw.2019.00123.
[34] S. Yan, Y. Xiong, and D. Lin. “Spatial Temporal Graph Convolutional Networks for Skeleton‐Based Action Recognition,” arXiv:1801.07455[cs.CV], 2018; https://arxiv.org/abs/1801.07455, (accessed on 15.07.2022).
[35] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, ”Actional‐Structural Graph Convolutional Networks for Skeleton‐Based Action Recognition,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp. 3590–3598; doi: 10.1109/CVPR.2019.00371.
[36] L. Shi, Y. Zhang, J. Cheng, and H.‐Q. Lu. “Two‐Stream Adaptive Graph Convolutional Networks for Skeleton‐Based Action Recognition,” arXiv:1805.07694v3 [cs.CV], 10 July 2019; doi: 10.48550/ARXIV.1805.07694.
[37] L. Shi, Y. Zhang, J. Cheng, and H.‐Q. Lu. “Skeleton‐based action recognition with multi‐stream adaptive graph convolutional networks,”IEEE Transactions on Image Processing, vol. 29, October 2020, pp. 9532–9545; doi: 10.1109/TIP.2020.3028207.
[38] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai. “Revisiting Skeleton‐based Action Recognition,” arXiv, 2021, arXiv:2104.13586; https://arxiv.org/abs/2104.13586.
[39] H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai.“Revisiting Skeleton‐based Action Recognition,”arXiv:2104.13586v2 [cs.CV], 2 Apr 2022; https://arxiv.org/abs/2104.13586v2.
[40] J. Liu, G. Wang, P. Hu, L.‐Y. Duan, and A. C. Kot. “Global Context‐Aware Attention LSTM Networks for 3D Action Recognition,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21‐26 July 2017, pp. 3671–3680; doi: 10.1109/CVPR.2017.391.
[41] J. Liu, G. Wang, L.‐Y. Duan, K. Abdiyeva, and A. C. Kot. “Skeleton‐Based Human Action Recognition with Global Context‐Aware Attention LSTM Networks,” IEEE Transactions on Image Processing (TIP), 27(4):1586–1599, 2018; doi: 10.1109/TIP.2017.2785279.
[42] J. Liu, A. Shahroudy, G. Wang, L.‐Y. Duan, and A. C. Kot. “Skeleton‐Based Online Action Prediction Using Scale Selection Network,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(6):1453–1467, 2019; doi: 10.1109/TPAMI.2019.2898954.
[43] T. Yu, and H. Zhu. “Hyper‐Parameter Optimization: A Review of Algorithms and Applications,”arXiv:2003.05689 [cs, stat], 2020; https://arxiv.org/abs/2003.05689.
[44] [Online]. “openpose”, CMU‐Perceptual‐Computing‐Lab, 2021; https://github.com/CMU‐Perceptual‐Computing‐Lab/openpose/.
[45] [Online]. “Keras: the Python deep learning API,”https://keras.io/.
[46] [Online]. “UTKinect‐3D Database,” Available online: http://cvrc.ece.utexas.edu/KinectData sets/HOJ3D.html (accessed on 30 June 2022).
[47] Kiwon Yun. “Two‐person Interaction Detection Using Body‐Pose Features and Multiple Instance Learning,” h t t p s://www3.cs.stonybrook.edu/~kyun/research/kinect_interaction/index.html.

Uwagi

Opracowanie rekordu ze środków MNiSW, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-53e442a4-8dfc-4814-b9ea-5ca9a6e1bb4f