Actions ~ Transformations


What defines an action like "kicking ball"? We argue that the true meaning of an action lies in the change or transformation an action brings to the environment. In this paper, we propose a novel representation for actions by modeling an action as a transformation which changes the state of the environment before the action happens (precondition) to the state after the action (effect). Motivated by recent advancements of video representation using deep learning, we design a Siamese network which models the action as a transformation on a high-level feature space. We show that our model gives improvements on standard action recognition datasets including UCF101 and HMDB51. More importantly, our approach is able to generalize beyond learned action categories and shows significant performance improvement on cross-category generalization on our new ACT dataset.

People

Paper

Xiaolong Wang, Ali Farhadi and Abhinav Gupta
Actions ~ Transformations
Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
[PDF]

Citation

@inproceedings{Wang_Transformation,
    Author = {Xiaolong Wang and Ali Farhadi and Abhinav Gupta},
    Title = {Actions {\textasciitilde} Transformations},
    Booktitle = {CVPR},
    Year = {2016},
}

ACT Dataset

Our ACT dataset consists of 11234 high quality video clips with 43 classes. These 43 classes can be further grouped into 16 super-classes. For example, we have classes such as kicking bag and kicking people, they all belong to the super-class kicking; swinging baseball, swinging golf and swinging tennis can be grouped into swinging.

We propose two tasks for our ACT dataset:

Task 1: Standard action classification over 43 categories.

Task 2: Cross-category generalization. For each of the 16 super-classes, we consider one of its sub-category as testing and the other sub-categories are used for training.

The ACT dataset can also be downloaded from dropbox.
[Video Frames (jpg in 34GB)] [Videos (avi in 14GB)] [Labels]

Acknowledgements

This work was partially supported by ONR N000141310720, NSF IIS-1338054, ONR MURI N000141010934 and ONR MURI N000141612007. This work was also supported by Allen Distinguished Investigator Award and gift from Google. The authors would like to thank Yahoo! and Nvidia for the compute cluster and GPU donations respectively.The authors would also like to thank Datatang for labeling the ACT dataset.