Spatio-temporal representations for human action recognition from digital video

AGUSTI BALLESTER, PAU

Spatio-temporal representations for human action recognition from digital video

AGUSTI BALLESTER, PAU

Dirigida por:

Filiberto Pla Bañón Director/a
V. Javier Traver Codirector/a

Universidad de defensa: Universitat Jaume I

Fecha de defensa: 03 de diciembre de 2015

Tribunal:

Francesc Josep Ferri Rabasa Presidente
Ramón Alberto Mollineda Cardenas Secretario/a
Plinio Moreno López Vocal

Tipo: Tesis

Teseo: 395016 DIALNET TESEO editor

Resumen

Over the last few decades one of the fields of research in computer vision that has attracted most attention is that of human action recognition from videos. The main reason is its immediate application in surveillance, human-machine interaction, care of the elderly, etc. Despite the progress made, the problem is very challenging and, consequently, robust, efficient, general and scalable solutions are elusive. One open issue in human action recognition is how to represent the action itself. A common approach is to use a region of interest (e.g. a bounding box around a performing actor) and divide it into a number of cells according to some geometry grid. Then, features are extracted from each of these cells. The most often used grid geometry is the Cartesian one, and others are scarcely used. This thesis compares the rectangular geometry and the polar one, and shows that, in some cases, the rectangular geometry may not be the best choice to describe some human actions. Histograms are commonly used to summarize information within some spatio-temporal region. However, spatial information is lost since the source of the bin counts is disregarded. To address this issue and evaluate its impact on action characterization, spatiograms, which are spatial-information enriched histograms, are tested as an alternative to the classical histograms. Another of the relevant and open issues in action characterization is how to properly and compactly represent the temporal information. Some solutions involve accumulating local histograms over time, or extracting short-time series of a few still snapshots of representative poses, or decomposing actions into sequences of "actoms" (key atomic action units), and weighting visual features by their temporal distance to these actoms. In this thesis, instead of merely accumulating the information over time, Recurrence Matrices (RMs) are explored so that temporal relationships of visual descriptors can be captured with much finer detail. Many techniques in literature use background segmentation, which is a difficult problem and is only feasible in simple unrealistic scenarios. This issue has been commonly and successfully solved by detecting local spatio-temporal interest points (STIPs) and deriving visual descriptors from them. Many of such descriptors rely on the optic flow, but only its gradient and its magnitude are used. In contrast, this thesis proposes alternative descriptors based on higher-order properties of optical flow. One of them is based on the kinematic properties of the optic flow, inspired by fluid theory. The other descriptor reuses the concept of RMs to provide potentially richer temporal information around the interest point. The RM concept is used again in the thesis as a temporally holistic descriptor computed from frame-based descriptors. The combination of these descriptors is studied in the context of a framework which includes early and late fusion mechanisms. The Bag-of-words (BoW) model is a well-known and widely used approach to create a fixed length feature vector from arbitrary-size sets of descriptors (e.g. those coming from STIPs detected in an action video). While the BoW is conceptually very simple and practically effective, it suffers from some drawbacks. One of such drawbacks is the heavy computational burden of the clustering procedure (usually, K-means) used to build the codebook, and another one is that it ignores spatial and temporal relationships between the visual words. This thesis proposes two mechanisms to address these limitations. Regarding the computational requirements, a random projections-based approach has been studied to reach a trade-off between computation time and recognition accuracy. As for representing spatio-temporal relationships among visual words, a novel and simple variation of BoW, t-BoW, is proposed, which captures temporal relationships between pairs of words in an aggregated way by counting co-occurrences at several temporal differences. Although many issues on action recognition certainly still remain open, this thesis has proposed novel representations and methodologies that bring further insights into this challenging problem. In particular, two contributions might be stressed. First, the concept of recurrence matrices is found to be flexible and powerful. Second, simple and effective means to overcome some limitations of the BoW formulation, such as t-BoW, seem promising. Therefore, these ideas lend themselves to further exploration, and compared or to be combined with other state-of-the-art approaches.