One aim is the the generation of textual descriptions explaining observed human behaviors. Semantic descriptions will support conceptual abstraction, thereby facilitating the communication of short messages or essential details. Another goal is explanation with visual languages, in our case a virtual environment for the generation of synthetic behaviors which reproduce conceptual descriptions.
Towards this end, we have modelled human behaviours by taking into account that our information source has to be compatible with the output of a tracking system. The framework used is based on the Fuzzy Metric Temporal Horn Logic (FMTHL) and the Situation Graph Trees (SGTs) formalisms in order to represent the knowledge required for human behaviour modelling, where behaviour is defined as agent trajectories which acquire a meaning in a specific context.
Generation of natural-language texts
Discourse Representation Theory (DRT) has demonstrated to be of particular interest for the generation of natural-language texts, since it discusses algorithms for the translation of coherent NL textual statements into computer-internal representations by means of logical, language-independent predicates.
In essence, Discourse Representation Structures (DRS) establishes the basic link between semantics and syntax in multiple languages, thus allowing the derivation of NL textual descriptions which best match the corpus provided by native users. Further language-specific morphological considerations in Catalan, English and Spanish languages have provided more flexibility and naturalness to the resulting texts.
Results from NL texts can be interpreted as semantic tags which provide content segmentation of image sequences. Thus, segmentation of video sequences into time-intervals showing cohesive information will be applied for extracting a set of semantic shots.
This way, we aim to include motion patterns for video-annotation standards, such as MPEG-7, thus allowing high-level indexing related to human behaviours concepts. In addition to this, we are currently studying the connection of a user interaction stage accepting NL-based queries. This is the starting point for a search engine capable of retrieving video sequences from a large database based on human behaviour contents.
We have used an open-source, platform-independent toolkit for developing a 3D talking head called Xface which relies on MPEG-4 standard for facial animation. Based on this toolkit, NL texts to be reported using synthetic speech will include knowledge about the agent, any relevant objects, paths, locations, manners and purposes.
We also plan to embed facial expressions because emotions in virtual humans do increase the believability and interest in virtual beings. Towards this end, NL will be used as a parameterization for emotions.
Thus, linguistic features will determine how an explanation should be conveyed by virtual reporters with emotional personality, i.e. a first step in the creation of consistent individuality.
Generation of virtual environments
We have developed a framework to automatically generate synthetic image sequences by designing and simulating complex human behaviours in virtual environments based on interpreted behaviours. At present, a simulation process based on the SGT formalism generates synthetic states by means of precomputed human action models, so that synthetic behaviours takes into account the relationships of each synthetic agent w.r.t its environment at each frame step.
The resulting behaviour is visualized within a virtual environment using a 3D graphic engine. Results obtained are being very helpful for re-visualizing recognized human behaviours and for the automatic generation of synthetic sequences.
Generation of augmented reality
In order to permit a gradual increase of complexity of a previously recorded image sequence recorded inside a real scenario, the number of moving targets should be increased. Thus, the more of agents involved in the scenario, the more complex the image sequence will be. However, the initial conditions of the scenario must be kept in order to avoid distortions prompted by e.g. illumination changes or alterations of the background configuration.
To this end, a set of virtual agents is modeled, either by assigning predefined trajectories or by simulating behavior models, and those are rendered into a synthetic image sequence. Consequently, both the recorded and synthetic image sequences are fused into a new image sequence containing real and virtual agents in the real scenario. This resulting image sequence increases complexity in terms of occlusions, camouflages, and events.