Epistemological Reasoning for the Interpretation of coNtext and securitY Events for Surveillance (2010-2012)

This proposal describes the ERINYES project which aims to facilitate automatic, high-level and semantically meaningful interpretations of scenes in surveillance video. The focus of the ERINYES approach is on the interrelated cognitive topics that are required for content-based access to video in repositories of surveillance footage. These topics include behavior analysis and representation, semantic analysis and access, learning from and mining of motion data, and interaction with and visualization of video data. Inferring the semantics of human behaviors in video and determining indices for content-based video retrieval and analysis are very difficult problems. Pursuit of the goals of ERINYES will broaden expertise in human behavior understanding and advance the state-of-the-art in content-based retrieval of surveillance data based on natural-language interactions. Our experience in previous projects and the current trends of the research community have indicated two fundamental directions for to be pursued. First, the utility of imagery provided by active cameras can be significantly improved by enabling them to focus attention on semantic concepts of human behavior. Second, contextual semantics have been identified as essential to rich, meaningful scene interpretation. Exploiting the knowledge inferred from scene context can help to robustly interpret what is happening and why it is happening. The novelty of the ERINYES project is the focus on integration of these two aspects crucial to scene interpretation. We will introduce the concept of semantic foveation, or the direction of active sensors attention on the basis of semantics. This will enable systems responsible for recognizing and reasoning about contextual semantics to request changes in attentional focus that provide the information most useful to completing their interpretation of an unfolding scene. ERINYES will guarantee that such requests can be expressed using the rich vocabulary contextual semantics. We will also explore techniques for automatically learning context from example video. To accomplish this we will investigate methods for identifying regions of interest where humans are likely to appear, for the extraction of semantic knowledge from such regions and for reasoning about specific human behaviors and object with which they are expected to interact. This integrative approach will close the loop between image sensors and semantic interpretation of scenes in terms of human-understandable concepts. We aim to design an Artificial Cognitive System which extracts descriptions of interpreted human behaviors and complex situations from video sequences recorded by a network of active cameras. In doing so, we will put forward new computational models and advance the field of content-based retrieval of surveillance based on natural language interactions. Context will provide the semantic vocabulary and semantic foveation will allow reasoning systems to articulate requests for attentional direction in meaningful terms.