Hierarchical representations for spatio-temporal visual attention modeling and understanding

  1. Fernández Torres, Miguel Ángel
Dirigida por:
  1. Iván González Díaz Director/a
  2. Fernando Díaz de María Codirector/a

Universidad de defensa: Universidad Carlos III de Madrid

Fecha de defensa: 15 de febrero de 2019

Tribunal:
  1. Luis Salgado Álvarez de Sotomayor Presidente/a
  2. Ascensión Gallardo Antolín Secretario/a
  3. Jennifer Benoit-Bryan Vocal

Tipo: Tesis

Resumen

This technical abstract is organized as follows. First, we introduce the main focus of the PhD. Thesis, which is the study and development of hierarchical representations for spatio-temporal visual attention modeling and understanding. Then, we describe the objectives and contributions of the dissertation. Finally, we summarize the conclusions drawn from the main contributions of the thesis, which serve to outline future lines of research. 1. Goals and context of the thesis Within the framework of Artificial Intelligence, Computer Vision emerged in the late 1960s with the objective of automatically simulating the Human Visual System (HVS) functions. Drawing from the visual information captured in digital images and video sequences, this interdisciplinary field seeks to discover good representations of the real-world in order to carry out particular tasks such as object location and recognition, event detection or visual tracking. In spite of the wide variety of systems that are continuously released and improved to solve these tasks, some of them truly effective, they still need to process large amounts of visual information for achieving high performances, which dramatically impacts on their efficiency. Human beings, however, inherently select the most important elements to interact in a context and, besides, are rapidly attracted by striking stimulus. And this is thanks to the visual attention function of the HVS, which can be understood as an optimization process for visual cognition and perception. If we were able to design image-understanding algorithms that accomplish this operation, we could use them to reduce their computational cost. At the same time, we would help users and experts when dealing with applications and complex scenarios which require processing large amounts of information simultaneously, such as driving, aviation and video surveillance, reducing the probability of human errors and speeding up the decision making processes. Visual attention can be readily identified in two different domains, spatial and temporal, which allow to define three types of computational models for visual attention: spatial, spatio-temporal and temporal. Most of existing models consider a spatial component to guide information processing to conspicuous locations or areas of particular interest in a scene. Moreover, visual information in real world is dynamic, so it is equally important to model how it changes over time, in order to update spatial attention based on previously selected locations, which allows modeling visual attention in a spatio-temporal manner, as well as selecting time segments of special importance. It is also common to distinguish between two families of visual attention models: stimulus-driven Bottom-Up (BU) models, which are based on visual features of the scene, and goal-driven Top-Down (TD) approaches, which take into account prior knowledge or advanced indications. The main focus of this thesis thus concerns the study and development of hierarchical representations for spatio-temporal visual attention modeling and understanding. Specifically, the thesis makes the following two main contributions towards our goals: - We introduce a hierarchical generative probabilistic model for context-aware visual attention modeling and understanding. Our first approach, which we have called visual Attention TOpic Model (ATOM), models visual attention in the spatio-temporal domain by considering the existing concurrence between BU and TD factors. - We develop a deep network architecture for visual attention modeling, which is oriented to be applied in a video surveillance scenario. We have called our second proposal as Spatio-Temporal to Temporal visual ATtention NETwork (ST-T-ATTEN). It first estimates TD spatio-temporal visual attention, which ultimately serves for modeling visual attention in the temporal domain. 2. Objectives and contributions of the thesis As introduced in the previous paragraph, we have developed two computational systems for visual attention, which constitute the main contributions of this thesis. The first part of the thesis introduces our first proposal: a generative probabilistic framework for spatio-temporal visual attention modeling and understanding. The model proposed, which we have called ATOM, is generic, independent of the application scenario and founded on the most outstanding psychological studies about attention. Drawing in the well-known Latent Dirichlet Allocation (LDA) (David Blei et al., 2003) method for the analysis of large corpus of data and some of its supervised extensions, our approach defines task- or context-driven visual attention in video as a mixture of latent sub-tasks, which are in turn represented as combinations of low-, mid- and high-level spatio-temporal features. In particular, we make the following contributions with our first approach: - We introduce feature engineering for visual attention guidance, providing a wide set of handcrafted features, which are later used in our experiments. Starting from basic and novelty spatio-temporal low-level features, such as color, intesity, orientation or motion, we move on to describe and model some mid- and high-level features related to camera motion estimation and object detection. - Then, our algorithm incorporates an intermediate level formed by latent sub-tasks, which bridges the gap between features and visual attention, and enables to obtain more comprehensible interpretations of attention guidance. - Moreover, we generate a categorical binary response for each spatial location to model visual attention. This allows to automatically align the sub-tasks discovered to a binary response by means of a logistic regression, which fully corresponds to the definition of human fixations. The experiments related to our first approach provide an in-depth analysis of ATOM. For that purpose, our model is used for context-driven visual attention modeling and understanding in two large-scale video databases annotated with eye fixations: CRCNS-ORIG (L. Itti and R. Carmi, 2009) and DIEM (Parag K. Mital et al., 2011). We illustrate how our approach successfully learns hierarchical guiding representations adapted to several contexts. Moreover, we analyze the models obtained, as well as perform a comparison with quite a few state-of-the-art methods. The second part of the thesis describes our second proposal: a deep network architecture that goes from spatio-temporal visual attention prediction to attention estimation in the temporal domain. The system proposed, which we have named ST-T-ATTEN, models visual attention over time as a fixation-based response. First, we introduce the fundamental hypothesis of the second part of the thesis: attention in the temporal domain can be predicted using the dispersion of gaze locations recorded from several subjects. Indeed, visual attention in the temporal domain can be understood as a filtering mechanism, which allows to select time segments of special importance in video sequences. Hence, it could be used to prevent human errors and speed up decision making processes in real applications which require watching large amounts of visual information, such as the task of video surveillance. We make the following particular contributions with this second approach: - We describe three feature learning architectures for visual attention guidance, which provide input feature maps to our system: RGB-based spatial, motion-based and objectness-based networks. - We propose a frame-level fixation-based temporal ground-truth, which is computed attending to the dispersion at fixation spatial locations from several subjects. Furthermore, we validate the fundamental hypothesis introduced above. We will use this variable to train our models to estimate attention in the temporal domain. - Our proposed ST-T-ATTEN is built on the combination of two modules: 1) A Spatio-Temporal visual ATtention NETwork (ST-ATTEN) for spatio-temporal visual attention estimation, which consists on a Convolutional Encoder Decoder (CED) network (Fu Jie Huang et al., 2007); 2) A Temporal ATtention NETwork (T-ATTEN) for modeling visual attention in the temporal domain, based on Long Short-Term Memory (LSTM) units (Sepp Hochreiter and Jürgen Schmidhuber, 1997), widely used for time series forecasting. Finally, we describe the experiments conducted to validate the different configurations proposed for the ST-T-ATTEN modules. We make use of the BOSS database (BOSS European project, http://www.multitel.be/image/research-development/research-projects/boss.php. Accessed: 2016-09-30.), which contains videos recorded in a railway transport context with different anomalous events, with the aim of determining the optimal configuration for the whole ST-T-ATTEN proposed, as well as motivating its use as an information filtering mechanism in a video surveillance application. 3. Conclusions In this thesis we have proposed two hierarchical frameworks for visual attention modeling in video sequences. Visual attention can be modeled in two different domains, spatial and temporal, which leads to three types of computational models: spatial, spatio-temporal and temporal. First, spatial models highlight locations of particular interest in a frame by frame basis. Second, modeling attention in the temporal domain allows either to update spatial attention based on previously selected locations (spatio-temporal) or to select time segments of special importance in a video (temporal). The first part of the thesis introduces our first approach, which is called ATOM. Our proposal involves a hierarchical generative probabilistic model for spatio-temporal visual attention prediction and understanding. The definition of the system proposed is generic and independent of the application scenario. Moreover, it is founded on the most outstanding psychological studies about attention (A. M. Treisman and G. Gelade, 1980; Jeremy M Wolfe, 1994), which hold that attention guidance is not based directly on the information provided by early visual processes but on a contextual representation arisen from them. Relying on the well-known LDA and its supervised extensions, ATOM defines task- or context-driven visual attention in video as a mixture of several sub-tasks which, in turn, can be represented as a combination of low-, mid- and high-level spatio-temporal features obtained from video frames. Therefore, given a video frame, the algorithm receives a set of visual feature maps (color, intensity, motion, object-based, etc.) as input. Then, an intermediate level of latent sub-tasks between feature extraction and visual attention modeling is introduced. Finally, latent sub-tasks are aligned to the information drawn from human fixations by means of a categorical variable response, which is generated by a logistic regression model over the sub-task proportions. The experiments related to our first approach have demonstrated its ability to successfully learn hierarchical representations of visual attention specifically adapted to diverse contexts (outdoors, video games, sports, TV news, etc.), on the basis of a wide set of features. For that purpose, we have made use of the well-known large-scale CRCNS-ORIG (L. Itti and R. Carmi, 2009) and DIEM (Parag K. Mital et al., 2011) databases. Experiments have shown the advantage of our comprehensible guiding representations based on handcrafted features to understand how visual attention works in different scenarios. In addition, modeling simple eye-catching elements, such as faces or text, through spatial discrete distributions, as well as considering object-based representations learned by recently adopted Convolutional Neural Networks (CNNs), our proposal significantly outperforms quite a few competent methods in the literature when estimating visual attention. The second part of the thesis presents our second proposal, which is named ST-T-ATTEN. This second approach takes a step further and goes from spatio-temporal visual attention estimation to attention estimation in the temporal domain. The model is fundamentally supported by the assumption that a measurement of task-driven visual attention in the temporal domain can be drawn from the dispersion of fixation locations recorded from several observers. First, to demonstrate this hypothesis, we have measured the existing correlation between eye fixation sequences of different viewers when an important or anomalous event happens on the BOSS database. Although this temporal level of attention constitutes a useful clue to detect important events in crowded and complex scenarios, attention in the temporal domain should always be considered as an early filtering mechanism, which selects candidate time segments to contain suspicious events, and therefore reduces the later processing devoted to the anomaly detection task. Based on this hypothesis, we have developed ST-T-ATTEN, which attempts to model attention in the temporal domain from estimations of spatio-temporal visual attention. Inspired by the recent success of CNNs for learning deep hierarchical representations and LSTM units for time series forecasting, the proposed ST-T-ATTEN is composed of two stages. The first stage, which is denoted as ST-ATTEN, consists of a CED network that receives at its input three high-level feature maps for visual attention guidance (RGB-based, motion and objectness), all of them computed by deep CNNs. Then, through an encoding-decoding architecture, the network concurrently estimates spatio-temporal visual attention maps and extracts latent representations of visual attention. We have proposed two configurations for this module of the system. They differ in the outer layers of the encoder and the decoder, which are convolutional in the first approach and convolutional LSTM in the second one. The second stage of ST-T-ATTEN, which is called T-ATTEN, involves a LSTM-based architecture that estimates, for each frame in a video sequence, a temporal attention response. We have also distinguished between two versions of T-ATTEN, depending on the input variable: either the visual attention map at the output of the ST-ATTEN or the latent representations generated by the encoder. The proposed ST-T-ATTEN architecture has been evaluated in a video surveillance scenario defined by the BOSS database, which contains video sequences recorded in a railway transport context, with different types of suspicious or anomalous events (several women harassment, a cell phone theft, a passengers fight, etc.). The main purpose of our experiments has been to assess various architectures of our proposal. Experiments have concluded that the best performing architecture is composed by a convolutional ST-ATTEN stage, which successfully fuses the information provided by the three input feature maps. Then, either the T-ATTEN fed with the visual attention map obtained at the output of the ST-ATTEN decoder or the latent representations extracted by its associated encoder have resulted in similar performances in terms of the Pearson Correlation Coefficient (PCC) score. Finally, we have also discussed two potential end-user applications for our proposal. On the one hand, given a video surveillance scenario, the temporal attention response could be applied to select in real-time the most outstanding screens from the monitoring array, thus driving operator's attention to scenes that potentially show anomalies or suspicious events. On the other hand, the estimated response could be also applied in off-line tasks which imply reviewing many hours of surveillance recordings, reducing the information to be processed by the operator. With some adjustments, our system might be able to provide CCTV operators a complete experience of visual attention, not only highlighting the most conspicuous locations in a scene, but also selecting the most relevant time segments, according to both previous events in the scene and events happening in different camera views at the same time. 4. Future lines of research Lastly, we conclude this abstract by identifying and discussing potential future lines of research related to our contributions. At this point, there is no doubt about the great benefits of visual attention modeling in the framework of Artificial Intelligence, nor about the infinite possibilities that such an abstract concept opens for the processing and understanding of this big data world. Despite the wide variety of computational models of visual attention existing in the literature, much remains to be done, not only to meet a system that automatically addresses this cognitive function, but also to understand how HVS carries out this optimization process. Referring to the two nowadays popular representation learning paradigms, Deep Learning (DL) and Probabilistic Graphical Models (PGM), our contributions have shown the importance of either the task of seeing, performed by DL representations, or the ability of thinking, characteristic of PGM, for visual attention modeling and understanding. First, it is important to achieve good representations of the world that surrounds us for attention guidance, and it is here where Deep Neural Network (DNN) architectures and, in particular, CNNs, play an essential role in machine perception. In addition, given that visual attention involves not only one, but several complex tasks, it is paramount to understand how computational visual attention deals with the hierarchical representations provided by DNNs, through probabilistic methods that explain relationships between the observed variables. This direction, recently set by Bayesian Deep Learning (BDL), is the one that we plan to follow in our future research, paying special attention to BDL for topic models, which constitutes a revision to probabilistic Latent Topic Models (LTM), on the basis of which our hierarchical ATOM framework for visual attention understanding has been built. Discovering sub-tasks, not only over space but also along time, will allow establishing relationships between recognized concepts in one or multiple video sequences, both in the same scene or in different ones. Secondly, in the latter part of the thesis, we have demonstrated the major advantages of modeling visual attention in the temporal domain, selecting video segments of special importance, which subsequently help to reduce the computational burden of subsequent end-user applications. Visual attention has been barely tackled from this perspective in the literature up to date, in spite of its usefulness for the processing and analysis of vast amounts of visual information in applications such as anomaly detection. One interesting research line we have not covered in this thesis is the interpretation of eye movement sequences, establishing relationships between the content of fixated locations. This would allow to develop more comprehensible and valuable systems for estimating the variation of visual attention over time. Reinforcement learning methods seem a promising way of addressing this challenge. Finally, we are highly motivated to model spatio-temporal visual attention, as well as attention in the temporal domain, given multiple video sequences played at the same time, with the aim of assisting experts in crowded and complex scenarios. For that purpose, we will soon proceed to annotate large-scale video surveillance databases with human fixations, which will serve for a further analysis and the improvement of the deep ST-T-ATTEN architecture proposed.