In this paper, entitled "Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain", we present a deep neural network for human action recognition able to learn straight from compressed video. Our network is a two-stream CNN integrating both frequency (i.e., transform coefficients) and temporal (i.e., motion vectors) information, which can be extracted by parsing and entropy decoding the stream of encoded video data.
The starting point for our proposal is the CoViAR [1] approach. In essence, CoViAR extends TSN [2] to exploit three information available in MPEG-4 compressed streams: (1) RGB images encoded in I-frames , (2) motion vectors, and (3) residuals encoded in P-frames. Although CoViAR has been designed to operate with video data in the compressed domain, it still demands a preliminary decoding step, since the frequency domain representation (i.e., DCT coefficients) used to encode the pictures in I-frames and the residuals in P-frames needs to be decoded to the sp...