Human action recognition is one of the most active research areas in computer vision communities. Existing action recognition dataset mainly focus on atomic actions, which limits the performance of video-based human activity recognition. Recently, researchers are engaging in person-person interaction actions and person-object interaction actions. However, the current interaction action dataset only include limited action categories and video samples. Meanwhile, with the recent advancement of the cost-effective depth sensor, there is a significant attention of using depth data for various computer vision task. So we contribute a public dataset with Multi-modal & Multi-view & Interactive (M2I) dataset.
Multi-modal & Multi-view & Interactive (M2I) dataset provides person-person interaction actions and person-object interaction actions. In this dataset, two static Kinect depth sensors were used to simultaneously capture the RGB image (320 × 240), depth image (320 × 240) and skeleton data (3D coordinates of 20 joints per frame) from both the front and side views. The dataset was recorded with 30 frames per second.
It consists of 22 action categories and a total of 22 unique individuals. Each action is performed twice by 20 groups (two persons in a group). In total, M2I dataset contains 1760 samples (22 actions × 20 groups × 2 views × 2 run). All the RGB image, depth data, and skeleton data are preprocessed to remove noise. Furthermore, we implemented background modeling and foreground extraction to provide masks for individual frames.
Totally, M2I dataset contains the following information: RGB data (image sequence sample: 6.79G; video sample: 19.2G); Depth data (image sequence sample: 49.4G); mask (image sequence sample: 613M); 3D Skeleton data (53.9M).
For evaluation, all samples were divided with respect to the groups into a training set (8 groups), a validation set (6 groups) and a test set (6 groups). The classiffiers are trained on the training set while the validation set is used to optimize the parameters. The final action recognition results are obtained with the test set.
Training: person 1 4 6 9 10 13 14 15
Validation: person 2 3 7 8 11 12
Test: person 5 16 17 18 19 20
The dataset contains 22 action categories.
Table 1: List of action categories of M2I dataset.
(A) means the Person-person Interaction actions. (B) and (C) belong to the Person-object Interaction actions.
Multi-person interactive actions (A)
1 Walk Together
7 High Five
Multi-person with object interactive actions (B):
1 Play Football
3 Pass Basketball
4 Carry Box
Person-object interactive actions (C):
1 Throw Basketball
2 Bounce Basketball
3 Hula Hoop
5 Tennis Swing
6 Call Cellphone
8 Take Photo
9 Sweep Floor
10 Clean Desk
11 Play Guitar
D stands for multi-person interactive action and S stands for single person interactive action.
T01 stands for the first sample and T02 stands for the second sample.
“A02_D01_T01” means the second multi-person interactive action (Crossing) by the first group of actors for the first time.
Figure 1: The recording scene of M2I dataset.
Figure 2: Camera configuration of the M2I dataset.
Figure 3: Image samples of selected action in the M2I dataset. Each cell shows the front-view and side-view samples (RGB, depth, skeleton) of actions.
Because of the environment setting, there exist some mistakes when the skeleton data was recorded. We implemented late processing for improvement. Here we list the types of errors and solutions.
If you have any better methods to deal with the abnormal data, welcome to contact us (email@example.com).
1. Single-View Task
We respectively evaluated the BoW+SVM framework on the front view and the side view of M2I dataset in both RGB and depth. For multi-class classification, we applied the one-against-rest approach and selected the optimal parameters by cross validation. The model was trained on the training+validation data and tested on the test data.
With the category-wise comparison in Figure 4, the performances in RGB modality can outperform most of those in depth modality with richer visual information and both modalities are complementary with each other.
Figure 4: Category-wise accuracy and average accuracy on the RGB and Depth data under the single-view scenario. (SV: side view; FV: front view)
2. Cross-View Task
We implemented the transferable dictionary pair learning method in both supervised (shared actions in both views are labeled) and unsupervised (shared actions in both views are not labeled) settings to transferring sparse feature representations of videos from the source to target view on the RGB and depth data of M2I dataset, respectively.
Figure 5 shows that learning in the side view to test in the front view can usually outperform learning in the from view to test in the side view in the RGB modality. However, the conclusion is just the opposite in the depth modality since the front view usually has more significant depth variation.
Figure 5: Category-wise accuracy and average accuracy on the RGB and Depth data under the cross-view scenario. The top bar figure corresponds to unsupervised approach and the bottom bar figure corresponds to supervised approach.
3. Cross-Domain Task
We selected one view as the target domain and the other as the auxiliary domain and evaluated the Adaptive Multiple Kernel Learning (AMKL) method against SVM-T, DTSVM, MKL, FR and SVM-AT. Two different experiments were set:
Case 1: all the training+validation data of 14 persons in the auxiliary domain plus the data of N persons belonging to the training+validation part in the target domain;
Case 2: all the training+validation data of 14 persons in the target domain plus the data of N persons belonging to the training+validation part in the auxiliary domain.
This experiment further demonstrated that the dataset is extremely challenging with significant view differences.
Figure 6: Overall accuracy on the RGB and Depth data under the cross-domain scenario. (a-d) Case 1; (e-h) Case2. (T: target domain; A: auxiliary domain)
For each category, we provide RGB data (image sequence sample: 6.79G; video sample: 19.2G); Depth data (image sequence sample:49.4G); mask (image sequence sample: 613M); 3D Skeleton data (53.9M). We build one ftp server for downloading this dataset. When participants send email to firstname.lastname@example.org or email@example.com for registering, we will send username and password to participation for downloading this dataset.
Note: The participants also need to download and file Agreement and Disclaimer Form and send it back to us with your register email. We will then email you the instructions to download the dataset.
Please cite the following paper if using this dataset in your publications:
author = "Ning Xu and Anan Liu and Weizhi Nie and Yongkang Wong and Fuwu Li and Yuting Su",
title = "Multi-modal \& Multi-view \& Interactive Benchmark Dataset for Human Action Recognition",
booktitle = "Proceedings of the 23th International Conference on Multimedia 2015,
Brisbane, Queensland, Australia, October 26-30, 2015",
year = "2015"