We detect and classify two-person interactions in multiple frames of a video, based on skeleton data. The solution follows the idea of a ”mixture of experts”, but when experts are distributed over a time sequence of video frames. Every expert is trained independently to classify a particular time-indexed snapshot of a visual action, while the overall classification result is a weighted combination of all the expert’s results. The training videos need not any extra labeling effort, as the particular frames are automatically adjusted with relative time indices. In general, the initial skeleton data is extracted from video frames by OpenPose [2] or by other dedicated method. An algorithm for merging-elimination and normalization of image joints is also developed, that improves the quality of skeleton data. During training of the classifiers hyper-parameter optimization technics are employed. The solution is trained and tested on the interaction subset of the well-known NTU-RGB-D dataset [14], [13] - only 2D data are used. Two baseline classifiers, implementing the pose classification experts, are compared - a kernel SVM and neural MLP. Our results show comparable performance with some of the best reported STM- and CNN-based classifiers for this dataset [20]. We conclude that by reducing the noise of skeleton data and using a time-distributed mixture of simple experts, a highly successful lightweight-approach to visual interaction recognition can be achieved.
JavaScript jest wyłączony w Twojej przeglądarce internetowej. Włącz go, a następnie odśwież stronę, aby móc w pełni z niej korzystać.