Mixture of pose experts for interaction classification in video

Kasprzak, Włodzimierz; Do, Van-Khanh; Piwowarski, Paweł

Artykuł - szczegóły

Tytuł artykułu

Mixture of pose experts for interaction classification in video

Autorzy

Kasprzak Włodzimierz , Do Van-Khanh , Piwowarski Paweł

Identyfikatory

Warianty tytułu

Języki publikacji

Abstrakty

We detect and classify two-person interactions in multiple frames of a video, based on skeleton data. The solution follows the idea of a ”mixture of experts”, but when experts are distributed over a time sequence of video frames. Every expert is trained independently to classify a particular time-indexed snapshot of a visual action, while the overall classification result is a weighted combination of all the expert’s results. The training videos need not any extra labeling effort, as the particular frames are automatically adjusted with relative time indices. In general, the initial skeleton data is extracted from video frames by OpenPose [2] or by other dedicated method. An algorithm for merging-elimination and normalization of image joints is also developed, that improves the quality of skeleton data. During training of the classifiers hyper-parameter optimization technics are employed. The solution is trained and tested on the interaction subset of the well-known NTU-RGB-D dataset [14], [13] - only 2D data are used. Two baseline classifiers, implementing the pose classification experts, are compared - a kernel SVM and neural MLP. Our results show comparable performance with some of the best reported STM- and CNN-based classifiers for this dataset [20]. We conclude that by reducing the noise of skeleton data and using a time-distributed mixture of simple experts, a highly successful lightweight-approach to visual interaction recognition can be achieved.

Słowa kluczowe

vision systems sensory system

systemy wizyjne systemy sensoryczne

Wydawca

Oficyna Wydawnicza Politechniki Warszawskiej

Czasopismo

Prace Naukowe Politechniki Warszawskiej. Elektronika

Rocznik

2022

Tom

z. 197, t. 2

Strony

5--16

Opis fizyczny

Bibliogr. 25 poz., rys., tab., wykr.

Twórcy

autor

Kasprzak Włodzimierz

wlodzimerz.kasprzak@pw.edu.pl

Warsaw University of Technology, Institute of Control and Computation Eng., Poland

autor

Do Van-Khanh

khanhdovanit@gmail.com

autor

Piwowarski Paweł

pawel@piwowarski.com.pl

Warsaw University of Technology, Institute of Control and Computation Eng., Poland

Bibliografia

[1] A. Bevilacqua, K. MacDonald, A. Rangarej, V. Widjaya, B. Caulfield, and T. Kechadi. Human Activity Recognition with Convolutional Neural Networks. In: Machine Learning and Knowledge Discovery in Databases, LNAI vol. 11053, Springer, Cham, Switzerland, 2019, pp. 541-552.
[2] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):172–186, 2021.
[3] E. Cippitelli, E. Gambi, S. Spinsante, and F. Florez-Revuelta. Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living (TechAAL 2016), London, UK, 2016.
[4] C. Coppola, S. Cosar, D. R. Faria, and N. Bellotto. Automatic detection of human interactions from RGB-D data for social activity classification. In: 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Lisbon, 2017, pp. 871-876.
[5] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: Computer Vision – ECCV 2016, LNCS vol. 9907, Springer, Cham, Switzerland, 2016, pp. 34-50.
[6] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87, 1991.
[7] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot. Global Context-Aware At tension LSTM Networks for 3D Action Recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, 21-26 July 2017, pp. 3671–3680.
[8] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot. Skeleton- Based Human Action Recognition with Global Context-Aware Attention LSTM Networks. IEEE Transactions on Image Processing (TIP), 27(4):1586–1599, 2018
[9] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15-20 June 2019, pp. 3590-3598.
[10] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In: Computer Vision – ECCV 2016, LNCS vol. 9907, Springer, Cham, Switzerland, 2016, pp. 816–833.
[11] M. Liu and J. Yuan. Recognizing Human Actions as the Evolution of Pose Estimation Maps. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1159-1168.
[12] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot. Skeleton- Based Online Action Prediction Using Scale Selection Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 42(6):1453–1467, 2019.
[13] M. Perez, J. Liu, and A.C. Kot. Interaction Relational Network for Mutual Action Recognition. arXiv:1910.04963 [cs.CV], 2019.
[14] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. arXiv:1604.02808[cs.CV], 2016.
[15] A. Stergiou and R. Poppe. Analyzing human-human interactions: a survey. Computer Vision and Image Understanding. Elsevier, vol. 188 (2019), 102799.
[16] A. Toshev and C. Szegedy. DeepPose: Human Pose Estimation via Deep Neural Networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1653-1660.
[17] S. Yan, Y. Xiong, and D. Lin. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv:1801.07455 [cs.CV], 2018.
[18] T. Yu and H. Zhu. Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv:2003.05689 [cs, stat], 2020.
[19] S. Zhang, Z. Wei, J. Nie, L. Huang, S. Wang, and Z. Li. A review on human activity recognition using vision-based method. Journal of Healthcare Engineering, Hindawi. Volume 2017, Article ID 3090343
[20] [Online]. NTU RGB+D 120 Dataset. Papers With Code. https://paperswithcode.com/dataset/ntu-rgb-d-120
[21] [Online]. openpose. CMU-Perceptual-Computing-Lab, 2021 https://github.com/CMU-Perceptual-Computing-Lab/openpose/
[22] [Online]. Keras: the Python deep learning API. https://keras.io/
[23] [Online]. Keras Tuner. https://keras-team.github.io/keras-tuner/
[24] [Online]. Scikit-learn: machine learning in Python. https://scikit-learn.org/
[25] [Online]. Google Colaboratory. https://colab.research.google.com/

Uwagi

Opracowanie rekordu ze środków MEiN, umowa nr SONP/SP/546092/2022 w ramach programu "Społeczna odpowiedzialność nauki" - moduł: Popularyzacja nauki i promocja sportu (2024).

Typ dokumentu

Bibliografia

Identyfikator YADDA

bwmeta1.element.baztech-23ef1edb-391c-4ece-a259-6a2761c3f23a