DIMACS Workshop on Video Mining

November 4 - 6, 2002

Yiannis Aloimonos, University of Maryland, College Park

Title: Spatiotemporal Representations from Image Sequences: From Illusions to Video Mining

We present a new uncertainty principle governing visual processing. Because of noise in images, features such as points, lines and image movement cannot be accurately estimated but they suffer from an inherent statistical bias. This bias is the reason for a large number of visual illusions. At the same time, this bias has several consequences for spatiotemporal representations that can be used for video mining operations. We concentrate on one such cognitively impenetrable representation, namely motion segmentation, and we outline the framework of video grammars which coupled with motion segmentation becomes a formal tool for video mining. Joint work with Cornelia Fermuller

Arnon Amir, IBM Almaden Research Center

Title: Efficient Video Browsing using Multiple Synchronized Views

Humans can browse text documents in a very efficient manner. A user can find within seconds the relevant document from ten retrieved items shown on the screen. Browsing of multiple audio and video documents, however, could be very time-consuming. Even the task of browsing a single one-hour video to find a relevant segment might take considerable time. Different visualization tools have been developed to assist in this task, such as storyboard, fast playback and video summarization. The first part of the talk will cover several views derived from original videos, including slide shows, speech speedup, and accelerating fast forward. These views are then integrated into a search and retrieval system that allows the user to browse through the search results and to switch between views while maintaining their respective locations in the video. Special architecture considerations are required for a web-based, server-client implementation. Support for multiple synchronized views is included in the MPEG-7 standard. Joint work with Dulce Ponceleon and Savitha Srinivasan

Shih-Fu Chang, Columbia University

Title: Video Indexing, Summarization, and Adaptation

Recently, researchers have been very active in developing automatic techniques for audio-visual content description. Such descriptions facilitate new applications such as multimedia search engines, personalized media filters, and intelligent video navigators. My group has particularly focused on emerging applications in two areas: personal media recorders (e.g., the settop box environment) and mobile multimedia services.

In the media recorder environment, the goal is to automatically extract information about program structures and important events. Our approaches have been based on three equally important principles: multimedia fusion, preserving content production syntax, and modeling human perceptual process. To illustrate such principles, we will present our video indexing projects in the sports, film, and medical domains.

The results of the above content analysis research can be extended to other applications, such as ubiquitous media. In this area, we advocate a new approach, called content-adaptive video streaming and skimming. In content-adaptive streaming, bandwidth allocation is dynamically adapted based on the detected importance of content in each video segment. Such adaptation enables improved video quality and new interactive features over limited communication bandwidth. In skimming, a temporally shortened version of the video is automatically generated to meet the user's preference while preserving a maximal amount of information. Our research explores the production syntax, human comprehension models, and a utility optimization framework for generating the optimal skims.

In addition, I will present my personal views about potential application areas which may benefit the most from research on automatic content analysis.

Daniel DeMenthon, University of Maryland, College Park

Title: Video Indexing and Retrieval using Spatio-Temporal Descriptions of Sequences

We describe a system presently under development that lets users with access to large collections of videos take a short query video sequence and identify and rank all occurrences of "similar" sequences in the collection. Collections of videos are analyzed to produce sequences of spatio-temporal descriptors that summarize the location, color and dynamics of independently moving regions with only a small number of bytes. Similarity between video sequences is defined using these descriptors. A distance measure has been developed between such descriptors. It is used to generate a tree structure for each video database and to provide efficient retrieval. Joint work with Remi Megret and David Doermann.

Arjen P. de Vries, CWI, The Netherlands

Title: Database Techniques and Video Data Management

Database technology has been extremely successful for administrative data, mainly because it offers a balance between flexibility and efficiency. Flexibility is obtained by enforcing data independence, a strict separation between requests expressed in a declarative query language and the actual approach to computing the answer to the request. Changing the physical properties of the data, such as the storage scheme or its access structures, does not affect client programs. Efficiency is obtained through query optimization, in the translation from the original declarative query into the query plan expressed in terms of physical operators.

The talk discusses our approach to generalizing these ideas to applications that handle video data rather than administrative data. Existing database technology does not accommodate video data management well, and we illustrate the problems encountered with both flexibility (expressiveness of the query languages) and efficiency (mainly related to data volume) in the setting of our participation in the Video Track at TREC-10. We discuss our research direction in solving these problems through better support for array data types at the conceptual level, supported by novel query processing techniques at the physical level. The proposed extensions to database management systems aim to regain an acceptable balance between flexibility and efficiency.

Nevenka Dimitrova, Philips Research USA

Title: Multimedia Story Segmentation

Today users have to cope with overwhelming numbers of TV channels and Web content sources. We introduce automatic content augmentation, as a novel approach to contextual information extraction on behalf of the user, where the context is provided by the primary content source (i.e., TV channel) and tailored to the user's preferences. There are three key aspects to content augmentation: (i) automatic extraction of annotations from the primary content source, (ii) Web Information Extraction, which automatically derives structured information from unstructured Web documents, and (iii) user modeling and personalization of the augmented content, including query formulation and relevance feedback. In this presentation I will focus on automatic extraction and annotation of stories from TV news programs for content augmentation. I will present a method of multimedia story boundary segmentation, which we call single descent. In contrast to text-based stories, which describe a complete narrative structure, multimedia stories combine text with audio and visual information. Basic to the method proposed here is the observation that multimedia stories are characterized by multiple intervals of constant or slowly varying multimedia attributes such as color family histograms, mid-level audio categories, and text categories. The constancy is measured with respect to a numerical description of the attribute, such as a set of probabilities. We assume that there always exists a dominant attribute, such as text categories, that drives the initial story segmentation. We show different methods of combining all the segments, anchored by the dominant modality/attribute segments, by using conditional set operators such as union and/or intersection of uniform segments.

Ajay Divakaran, Mitsubishi Electric Research Laboratories

Title: Video Indexing and Summarization using the Motion Activity Descriptor

MPEG-7 or "Multimedia Content Description Interface" is a recently proposed standard that will enable content-based browsing of multimedia databases much as text is browsed on the world wide web today. We present the MPEG-7 motion activity descriptor, our invention, and its applications. The descriptor can be extracted in the compressed domain and is compact, hence is easy to extract and match. It captures the gross motion characteristics of a video segment in a compact form. It enables effective indexing of video, either by itself or in combination with other features. We will describe a video summarization technique based on motion activity. Finally, we will demonstrate an application of our work to rapid video browsing.

Huiping Li, Xavier Gibert-Serra and David Doermann, University of Maryland, College Park

Title: Automatic Genre Classification of Video

Recent advances in digital camera technology, computing power and digital networking have enabled widespread access to a wide variety and large quantities of digital video. How to organize and annotate these video sources is the focus of a great deal of recent research in this field. Classification of digital video into categories such as sports, news, movies, commercials, documentaries and surveillance is an important task which will lead to greater efficiency in indexing, filtering, retrieval and browsing of the data from diverse sources or large repositories.

We will present work on using statistical methods to classify video genres. Features are extracted at the physical level (motion, color), content level (text, face, logo, sound), and structural level (shot length, scene transitions) and are used for training and classification. One method of particular interest is the Hidden Markov Model (HMM). We will present our experimental results and discuss the advantages and challenges of using an HMM-based scheme.

Mubarak Shah, University of Central Florida

Title: Video Categorization using Semantics and Semiotics

The amount of audio-visual data in the world is enormous and is increasing every day. It is very important to organize, categorize and structure this data to make it easier to retrieve and browse. This talk will deal with the problem of categorizing news and talk shows, movie previews, and scenes in feature films. We believe that categorization of videos can be achieved by exploring the concepts and meanings of the videos. To do this, we need to bridge the gap between the low-level contents and the high-level concepts. A categorization methodology for video should begin with an understanding of Cinematic Principles (also referred to as Film Grammar in the movie literature).

We have developed a framework for categorizing videos by bridging the gap between computable features and understandable concepts. First we study the attributes of TV shows, movie previews, and films, and how they provide a time experience to the audience. Then we describe techniques for retrieving "computable features" from the videos. Finally, we formulate models for the computable features, which provide a higher-level understanding of the videos.

We will present a new technique for structuring News and Game show videos by exploiting the scene transitions specific to these genres of programs. We will also show that understanding of film techniques can be applied to the problem of movie genre classification using movie previews. Finally, we will propose an approach to segmenting a full feature film into scenes, and semantically labeling each scene.

Alexander Hauptmann, Carnegie-Mellon University

Title: Finding Information in a Digital Video Archive

The Informedia Digital Video Library system extracts information from digitized video sources and allows full content search and retrieval over all extracted data. The system uniquely utilizes integrated speech, image and natural language understanding to process broadcast video. The News-on-Demand collection enables review of continuously captured television and radio news content from multiple countries in a variety of languages. A user can look for relevant material and review the sequence of news stories related to an event of interest in the world news. The Informedia system allows information retrieval in both spoken language and video or image domains. Queries for relevant news stories may be made using words, images or maps. Fast, high-accuracy automatic transcriptions of broadcast news stories are generated through speech recognition, and closed captions (teletext) are incorporated where available. Faces are detected and can be searched for in the video. Text visible on the screen is recognized through video OCR and can also be searched for. Images can be searched for using multiple image retrieval mechanisms. I will show how this system was used to answer queries in the 2001 TREC Video Retrieval Evaluation.

Ruoyu Roy Wang and Thomas Huang, University of Illinois at Urbana-Champaign

Title: A Framework of Human Motion Tracking and Event Detection for Video Indexing and Mining

Tracking and identifying human body motions are important issues for automatic indexing and discovering human-related events in video. Discovered dynamic information such as motion trajectories and event types are ideal metadata for video indexing. Model-based abnormal human motion event discovery also has significant applications in video mining. The major challenges encountered in these tasks, often in the form of monocular video analysis, arise from the following: unknown camera motion, background clutter, non-rigid articulation of objects of interest, and occlusion. To robustly track and analyze human motion in the light of these unknown conditions, we report a Maximum A Posteriori (MAP) probabilistic framework based on random sampling and temporal integration, under which tracking and event analysis operate simultaneously.

Our state space model permits the use of a mixture of discrete and continuous features, which are represented and propagated harmoniously in a particle-based random sampling fashion. Our formulation of object tracking and analysis as state space traversal problems allows analysis of multiple frames. The maximization of a joint data and state likelihood across time enables us to track objects robustly and analyze their events concurrently.

To deal with the curse of high dimensionality as the size of the state space increases, we identify active nodes in our temporal trellis with the Expectation Maximization (EM) algorithm. The random samples are propagated through video frames based on multiple modes discovered by the EM algorithm. The sparsity of the nodes affords good scalabilities as the dimensionality of the state space increases. Such a mixture summarization of local probabilistic density has multiple purposes, including: 1) increased efficiency of re-sampling, 2) prevention of sample depletion, 3) identification of active nodes for further trellis decoding, and 4) ability to tie event analysis to the discrete portions of active nodes.

The event analysis can co-occur with tracking because they operate under essentially the same state space. The discrete part of the space is our exemplar-based human body model; while the continuous counterpart is a set of allowable transformations of the model, such as translation and scaling. We model profile views of the human body in its 2D appearances. We first extract spatial and temporal gradient features of walking humans in a large training set, and then learn a set of exemplar templates to serve as our 2D model for human motions. Through our definition of the pair-wise distance between a model and an observation, and training of a transition matrix between model configurations, we can construct three common components in our tracking and analysis framework: object representation, object measurement model, and object dynamics model. These enable temporal event detections to be performed in a finite-state machine, whose parameters are derived from supervised training.

We demonstrate the efficiency of our framework with both simulated numerical experiments and real video sequences of human pedestrians walking.

Aya Aner, Lijun Tang, John R. Kender, Columbia University

Title: Beyond Key Frames: The Physical Setting as a Video Mining Primitive

We present an automatic tool for the compact representation, cross-referencing, and exploration of long video sequences, which is based on a novel visual abstraction of semantic content. Our highly compact graph-like representation results from the non-temporal clustering of scene segments into a new conceptual form grounded in the recognition of real-world backgrounds. We represent shots and scenes using mosaics derived from establishing shots, and employ a novel method for the comparison of scenes based on these representative mosaics. We then cluster scenes together into a more useful higher level of abstraction -- the physical setting. We demonstrate our work using situation comedies, where each half-hour (40,000-frame) episode is well-structured by rules governing background use. Consequently, browsing, indexing, and comparison across videos by physical setting is very fast. Further, we show that the analysis of the frequency of use of these physical settings leads directly to high-level contextual identification of the main plots in each video. We demonstrate these contributions with a browsing tool whose top-level single page displays the settings of several episodes. This page expands to a single page for each episode, and each episode menu summary is further expanded into scenes and shots, all by mouse-clicking on appropriate plots and settings according to user interests.

C.-C. Jay Kuo, University of Southern California

Title: Movie Content Analysis and Abstraction via Multimodal Information

The problem of automatically extracting semantic structure from movies and summarizing it in a hierarchical manner will be addressed in this talk. Multiple media cues are employed in this procedure, including visual, audio and text information. The generated hierarchy provides a compact yet meaningful abstraction of video data similar to the conventional table of contents, which can facilitate a user's access to multimedia content, including browsing and retrieval. We pay special attention to the detection and classification of important events in movies, such as two-speaker dialog scenes, multiple-speaker dialog scenes, and story-progressing scenes. After that, the problem of identifying speakers from a movie dialog scene will be examined. While most previous work on speaker identification has been carried out based on pure audio data, more robust results can be obtained by integrating knowledge from multiple media sources such as visual and audio information when they are available. Experimental results will be given to illustrate the performance of the proposed methodology.

Longin Jan Latecki, Temple University

Title: Context-dependent Detection of Unpredictable Events in Videos

By unpredictable events we understand parts of videos represented by their most significant frames, which are substantially different from their neighboring frames. This definition automatically implies that unpredictable events are context-dependent. We present an approach to detecting unpredictable events and demonstrate its performance on surveillance videos.

We map a video sequence into a polygonal trajectory by mapping each frame into a feature vector and joining the vectors representing consecutive frames by line segments. Shape analysis of the resulting polygonal curve allows us to detect frames representing unpredictable events.

Rainer Lienhart, Intel Labs

Title: Content-based Video Retrieval

Two important aspects of content-based video retrieval and data mining are automatic media content analysis and support for efficient browsing. As an example of automatic media content analysis, the speaker will present his novel system for text localization and text segmentation in images, web pages and videos. This will be followed by his work on automatic video abstracting, which creates a short video from much longer source video material. Video abstracting is a specific form of media browsing. Finally, the talk will conclude with a short overview of current research on smart media processing at Intel Labs.

B.S. Manjunath, University of California, Santa Barbara

Title: Mining Images and Video

Image and video mining techniques have novel applications in aerial image analysis and remote sensing. For example, consumer-grade video cameras are being increasingly used as remote sensing devices in geographic domains. These cameras are being flown in small planes over large areas of the Amazon forest to provide inexpensive alternatives to the more expensive satellite imagery. The video datasets are valuable resources for studying deforestation processes, but the size of the datasets makes manual analysis challenging. Automatic analysis techniques are needed. This talk will explore challenges to image/video processing in the context of data mining and present our recent work on mining aerial imagery and video. We have extensively investigated the use of homogeneous texture to annotate and classify aerial images. These image features have been shown to be effective at classifying the land types of interest: primary forest, secondary forest, and pasture. Analyzing the spatial arrangements of these land types gives further information about the stage of deforestation. For example, islands of forest surrounded by pasture are prone to dying out. Spatial data structures, such as Spatial Event Cubes, summarize the spatial arrangements of the land types. These summaries in turn allow effective analysis and visualization.

Milan Petkovic, University of Twente

Title: Knowledge-based Techniques for Content-based Video Retrieval

As amounts of publicly available video data grow, the need to query this data efficiently becomes significant. Consequently, content-based retrieval of video data turns out to be a challenging and important problem. In this talk, we address the specific aspect of inferring semantics automatically from raw video data using different knowledge-based methods. In particular, we focus on three techniques, namely, rules, Hidden Markov Models (HMMs), and Dynamic Bayesian Networks (DBNs). First, a rule-based approach that supports spatio-temporal formalization of high-level concepts is introduced. Then we shift our focus to stochastic methods and demonstrate how HMMs and DBNs can be effectively used for content-based video retrieval. All these approaches are integrated within our prototype video database management system and validated in the particular domains of tennis and Formula 1 videos. For these specific domains we introduce robust audio-visual feature extraction schemes and a text recognition and detection method. Knowledge-based techniques are used to map audio-visual features into high-level concepts such as approach to the net, rally, service, backhand slice stroke, etc. in tennis, and for the extraction of highlights in Formula 1 videos. For the latter, special attention will be given to the fusion of the evidence obtained from different media information sources. Finally, we present our experimental results, demonstrating the validity of the approaches, as well as advantages of their integrated use. Joint work with Willem Jonker.

Visvanathan Ramesh, Siemens Corporate Research

Title: Statistical Methods for Real-time Video Surveillance

The proliferation of cheap sensors and increased processing power has made real-time acquisition and processing of video information more feasible. Real-time video analysis tasks requiring object detection and tracking can increasingly be performed efficiently on standard PC's. Smart cameras are being designed that enable on-camera applications to directly output compressed data or meta-event information instead of raw video. These advances, along with major breakthroughs in communication and the Internet, are making possible real-time video monitoring in a variety of application sectors such as Industrial Automation, Transportation, Automotive Systems, Security/Surveillance, and Communications.

The real-time imaging group at SCR is focusing on the development of integrated, end-to-end solutions for video applications that require object detection, tracking, and action/event analysis. This talk will present an overview of our research on statistical methods for real-time video surveillance systems. Our systems for traffic monitoring will be highlighted. Implications of the availability of such systems on algorithms for Video Mining will also be discussed. Joint work with Dorin Comaniciu, Nikos Paragios, and other Ph.D. students

Malcolm Slaney, IBM Almaden Research Center

Title: Mixtures of Probability Experts for Audio Retrieval and Indexing

This paper describes a system for connecting non-speech sounds and words with linked multi-dimensional vector spaces. An approach based on mixtures of experts learns the mapping between one space and the other. This paper describes the conversion of audio and semantic data into their respective vector spaces. Two different mixtures of probability expert models are trained to learn the associations between acoustic queries and their corresponding semantic explanation, and vice versa. Quantitative performance data are presented based on commercial sound effect CDs.

John R. Smith, IBM T.J. Watson Research Center

Title: Statistical Modeling and Retrieval of Video Content

We describe a novel statistical framework for modeling and retrieving video content based on semantic concepts including scenes, objects, and events. The framework provides explicit models for simple concepts that have broad applicability and are deployed at query time to compose more complex concepts depending on the specific query. The retrieval framework also provides methods for integrating model-based querying with feature-based techniques in iterative searching that allows selective query refinement and relevance feedback. We describe experimental results of applying the statistical framework for video retrieval to a large automatically indexed corpus of video.

Nuno Vasconcelos, Compaq Research

Title: Bayesian Models of Video Structure for Segmentation and Content Characterization

Prior knowledge about video structure can be used both as a means to improve the performance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulation is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.