Interactive Semantic Video Search with a Large Thesaurus of Machine-Learned Audio-Visual Concepts (2007-2010)

Video is vital to society and economy. It plays a key role in the news, cultural heritage documentaries and surveillance, and it will soon be the natural form of communication for the Internet and mobile phones. Digital video will bring more formats and opportunities and it is certain that the consumer and the professional need advanced storage and search technology for the management of large-scale video assets. This project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine. Video search engines are the product of progress in many technologies: visual analysis, audio processing, machine learning, as well as visualization and interaction technology. A good measure of the overall progress is the TRECVID competition, posing increasingly hard and realistic problems with the obligation to discuss methods openly. In 2005, the best interactive video search came from the coordinating partner of this project, i.e. the University of Amsterdam. VIDI-Video will boost the performance of video search by forming a 1000 element thesaurus detecting instances of audio, visual or mixed-media content. In the US, a similar grand challenge has been posed. This project’s approach is to let the system learn many, possibly weaker, detectors instead of modelling a few of them carefully. The combination of many detectors describing different aspects of the video content will render a much richer basis for the semantics.

The Consortium presents excellent expertise and resources: the machine learning with active 1-class classifiers to minimize the need for annotated examples is lead by the University of Surrey, UK. Video stream processing is lead by CERTH, Greece. Another component is audio event detection, lead by INESC-ID, Portugal. Visual image processing is lead by the University of Amsterdam, the Netherlands. The University of Florence, Italy, leads the efforts in interaction, and CVC, Spain leads software consolidation. Finally, Beeld & Geluid, the Netherlands, and FRD, Italy, as application stakeholders, provide data and perform evaluation and dissemination. Concrete outputs will be a fully implemented audio-visual search engine, consisting of two main parts, viz. a learning system and a runtime system, where the former will feed its results into the latter after each round of training-and-thesaurus-update.

The learning system will consist of software to be developed for overall video processing; visual analysis; audio analysis; integrated feature detector; and multimedia query + user interface. All subsystems will be delivered and available both as stand-alone and integrated into these two final, connected systems. The modularity and contemporary stand-alone status of each system warrant developmental independence, and an efficient exploitation, as commercial opportunities often target components rather than entire systems.

Prominent market players such as Microsoft, Yahoo!, and especially: Google are entering fully force into the huge area of video and buy up existing resources, rights, and available search engines. In this scenario it shall be noted, however, that none of those search engines are close to what is targetted in this project. A good eye will be kept on this very dynamic and promising market, as exploitation and dissemination plans are being drawn up and regularly updated.