
A
Survey on MPEG-7
A
Multimedia Content Description Interface
Author: Randa Hilal
email: rhilal@mcs.kent.edu,
homepage: http://www.mcs.kent.edu/~rhilal
Prepared for Prof. Javed I. Khan
Department of Computer Science, Kent State University
Date: November 2001
Abstract: As the name suggests “Multimedia Content Description
Interface”, MPEG-7 is a standard for describing multimedia data or multimedia
file contents in order to facilitate searching multimedia databases, and
libraries surfing the web for multimedia resources using criteria such as text,
sound, or even graphics. This paper will
look at the MPEG-7 standard as of March 2001, and will look at some of its
applications in searching, filtering, classifying, and indexing of multimedia
resources.
Other
Survey's on Internetwork-based Applications
Back to Javed I. Khan's Home Page
Table of Contents:
Objectives of MPEG-7 standard
MPEG-7
tools
Scope of the standard
MPEG-7 application areas
MPEG-7 parts and their functionalities
Representing Internet Streaming Media
Metadata using MPEG-7 Multimedia Description Schemes
TV
Anytime as an application scenario for MPEG-7
Spoken Content Metadata and MPEG-7
On
the Evolution of Videotext Description Scheme and Its Validation Experiments
for MPEG-7
Research
Groups
Research
Papers
Other Relevant Links
Introduction
The Moving Picture Experts Group, MPEG for short, started
its first standard MPEG-1 in January 1988, which was intended for audio
and video compression and all the functions needed for multiplexing and
synchronizing audio and video streams into one stream called systems. While MPEG-1, when designed, was intended for
specific applications such as interactive CD and digital audio broadcasting,
yet it was generic enough for use for other applications.
While MPEG-1 was designed with specific applications in mind
MPEG-2, which started in July 1990, addressed the functions of
multiplexing one or more elementary streams of video and audio, as well as
other data streams, into single or multiple streams suitable for storage or
transmission. This was done by
developing two system layers; the Transport Stream (TS) designed for
environment where errors are likely, such as storage or transmission in lossy or
noisy media (cable, satellite and terrestrial), and the Program Stream (PS)
similar to MPEG-1 designed for relatively error- free environment.
In July 1993 MPEG started working on its third standard MPEG-4. Unlike MPEG-1 and MPEG-2; which required
a bit rate no less than 1Mbps, MPEG-4, as its first title suggests, was a “very
low bitrate audio-visual coding” standard that allowed the implementation of
the decoding on a wide range of programmable devices. MPEG-4 can encode units of aural, visual, or audiovisual
content, called “media objects”. These
objects can be of natural or synthetic origin.
It can also describe the composition of these objects, multiplex and
synchronize the data associated with these objects so they can be transported
over network channels providing a QoS appropriate for the nature of the
specific media objects.
At this point we had all the tools needed for digitizing,
encoding, decoding, compressing, and transferring multimedia contents in a wide
variety of medium and capacity. One
important link was still missing in the chain; how can we search for multimedia
contents in a wealth of multimedia resources available to us from media
libraries and databases, and what criteria can we use to search for, and select
a certain resource? In October 1998
there was a call for a new standard proposal MPEG-7 that will devise
standard ways for searching, filtering, classifying, indexing of multimedia
data.
In the following sections of this paper we will look at the
MPEG-7 standard, its scope, and objectives, its parts, its application areas,
and its functionalities. We will also
look at some examples of research papers that used the MPEG-7 standard to
implement a variety of applications.
Overview of the MPEG-7 Standard
MPEG-1, MPEG-2, and MPEG-4 made a wealth of audiovisual
information available in digital form, but the value of these information is
greatly dependent on the ease of finding it, retrieving it, accessing it,
filtering it, and managing it. MPEG-7,
an ISO/IEC standard is not the first attempt to use metadata to describe,
organize, and manage multimedia resources.
There have been many attempts to use varied forms of metadata and
description schemes to facilitate the different ways of managing and finding
digital multimedia data when needed. Some of these attempts is the Dublin Core
scheme that is widely used for simple description such as author names, date of
publication, etc., and XML/RDF for defining the relationship between any two
entities and give this relationship a name and use XML format to describe this
relationship. What MPEG-7 is doing is
defining standard schemes and using a standard language to describe the content
of audio and video records, movies, speech clips, graphics, text, and even
still pictures. The use of MPEG-7 is not
just restricted to database retrieval applications such as digital libraries,
but it also extends to applications for running broadcast channel selection,
multimedia editing, and multimedia directory services. These applications are widely varied they can
run in real or non-real time, they can be push or pull applications and they
are intended for human user consumption as well as computational systems
consumptions.
Objectives of MPEG-7 standard
MPEG-7 was not unique in what it intended to do, providing a
description of the content of multimedia resources, but it was unique in the
way it standardized the core technology and extended the limited capabilities
of proprietary solutions by including more data types, such as still pictures,
graphics, 3D models, audio, speech, video, and
special cases of these data types such as facial expressions, personal
characteristics, music mood, and so on.
MPEG-7 description tools are independent from the way these
contents are coded or stored, it works for digital data as well as analogue data,
or even material printed on paper.
MPEG-7 offers different granularities; the description can
be as general or as detailed as we want.
Although MPEG-7 is not application specific and it does not depend on
the way the contents are coded, but it can build on some of the features that
other standards offer when available.
For example it can use the feature of MPEG-4 in using the object as a
unit of encoding to attach its description to the object within the audio or
video file.
MPEG-7 can match its description of a resource to the
application it is used for, for a visual-based application it can give
description that relates to the shape, color, size, texture, position, or
movement. For an audio-based application
it gives description of the mood, tempo, tempo changes, etc. On a more sophisticated level it can give
description that include semantic information.
Some low level features included in the description can be automatically
extracted; other more sophisticated features may need a manual extraction. Besides the description of the multimedia
data MPEG-7 has to include some other information such as:
Ø
The form:
such as coding scheme, and data size.
Ø
Condition for accessing the material: such as intellectual property rights information,
and price.
Ø
Classifications:
such as parental rating.
Ø
Links to other relevant material: it can help in finding more search material.
Ø
The context:
such as in a documentary or teaching material, the title of the
material, name of the author, or the date or place the material was created.
Ø
Information describing the creation and
production processes of the content:
such as director, title.
Ø
Information relevant to the usage: copyright pointers, usage history, and
broadcast schedule.
Ø
Information of the storage features: format, encoding.
Ø
Structural information on spatial, temporal, or
spatio-temporal components: scene cuts,
segmentation in regions, region motion tracking.
Ø
Conceptual information about the content: such as objects, events, and their
interaction.
Ø
Information about how to browse the
contents: such as use of summaries.
Ø
Information about the interaction of the user
with the content: such as user
preferences, and usage history.
MPEG-7 tools
To
accomplish its task MPEG-7 defined a set of tools. These tools may or may not appear entirely in
a description, not only that the separation between them may not always be
clear depending on the contents described and the application using the
description. Further more the
descriptions may be stored with the audio-visual content on the same storage
medium or may be stored remotely on some other system. When this is the case then some additional
tools need to be implemented to link the contents to their descriptions. Contents and querying the contents do not
have to match in type; some visual contents can be queried using visual
description or textual description, or even speech description. MPEG-7 tools are very flexible; they can work
for many different applications and environments. This allows the coexistence of MPEG-7 and
other leading standards such as SMPTE Metadata Dictionary, Dublin Core, TV
Anytime, etc.
The main tools of MPEG-7 are:
Ø
Descriptors (D): can be used to
describe the various features of multimedia contents, they define the syntax
and semantics of each content feature. Figure
1 shows an example of a Descriptor.
Ø
Description Schemes (DS):
pre-defined structures of Descriptors and Description Schemes that specify the
semantics of their relationships.
Ø
Description Definition Language (DDL):
language to define new Description Schemes and Descriptors or extend the
existing ones. It provides standardized
grammar and syntax for unambiguously defining the Descriptors and Description
Schemes so they can be parsed by a variety of systems. In March 2000 it was decided to adapt W3C’s
XML schema language as the DDL language with the provision of extending it to
satisfy all the MPEG-7 requirements.
[9] lists some of these extensions;
Parameterized array sizes; Typed references; Built-in array and matrix data
types; Enumerated data types for MimeType, CountryCode, RegionCode,
CurrencyCode and CharacterSetCode.
MPEG-7-specific parsers will be developed by adding validation of these
additional constructs to standard XML Schema parsers...
Ø
System tools: they support the multiplexing and
synchronization of Descriptors with the content they describe.
|
|
<CatalogueEntry
xsi:type=”NewsDoc">
<Title>”CNN 6 oclock News”
</Title>
<Producerr>David James</Author>
<Date>1999</Date>
<Broadcaster>CNN</Channel>
</CatalogueEntry>
|
|
Figure 1: Descriptor
Scope of the standard
MPEG-7
will deal with applications stored on-line or off-line or streamed, and it can operate
in real-time and non real-time environments.
A real time environment in this context is that the description is
generated while the content is being captured.
It can also work for pull applications such as retrieval from digital
libraries and push applications such as filtering audio-visual streams
broadcasted over the Internet.
Generating descriptions requires first extraction of the features
(analysis), then the description is generated, and finally search engine
applications are employed to complete the job.
Feature extraction can be automated at lower levels of description or
can be interactive when we require higher levels of feature extractions. But automatic and manual or even
semi-automatic feature extraction is beyond the scope of MPEG-7. Implementation of the feature extraction was
left for the industry to compete with since interoperability is not required
for that. Also implementation of search
engines and filter agents algorithms is beyond the scope of MPEG-7 and it was
left to the industry to develop. Figure
2 shows a schematic representation of the scope of MPEG-7 Standard.
Figure 2: scope of the MPEG-7 standard
MPEG-7
emphasizes greatly on describing the audio-visual data, but this data could
contain within it some text that need to be searched and filtered that’s why
MPEG-7 considered existing solutions for doing that.
MPEG-7 application’s areas
MPEG-7
standard tools will make it possible to support a wide range of applications
such as digital libraries searching and indexing, broadcast media selection,
and media editing. It will make it
possible to search the web for multimedia data using a variety of criteria the
same manner it is searchable for text data using textual criteria. This list of possible applications that will
benefit from MPEG-7 is only an example of the endless possibilities.
Ø Architecture,
real estate, and interior design.
Ø Cultural
services in history museums, and art galleries.
Ø Digital
library searches for all kinds of archived multimedia resources.
Ø E-commerce,
advertising, on-line catalogues, e-shops.
Ø Education,
such as searching for support material.
Ø Home
Entertainment, management of personal multimedia collections, home video
editing, karaoke, etc.
Ø Investigation
services, human characteristics recognition, forensics.
Ø Journalism,
searching for famous people speeches.
Ø Directory
services, yellow pages, tourist information, etc.
Ø Remote
sensing, cartography, ecology, natural resources management, etc.
Ø Shopping
for different items.
Ø Social
services, such as dating.
Ø Surveillance,
traffic control and such.
And the list goes on and on…
Querying
a resource can be done in so many different ways:
Ø Play a few
notes of a song.
Ø Draw a
few lines on a screen and find an image that matches the drawing.
Ø Defining
an object by its color, texture, shape, etc. and find an object that matches
the definition.
Ø Defining
multimedia objects and the relationship between them, and finding what matches.
Ø Describe
an action and get scenarios that match.
MPEG-7 parts
and their functionalities
1.
MPEG-7 Systems – it includes the Descriptors (D) and Description
Schemes (DS) tools, those are the tools used to create descriptions and synchronize
the contents and their descriptions. And
it also includes tools for managing and protecting intellectual property. It defines the terminal architecture and
normative interfaces.
2.
MPEG-7 Description Definition Language – the
language used to create new Description Schemes and eventually new
Descriptors. It also allows for the
extension and modification of existing DSs.
XML Schema language was chosen to be the basic DDL language with its structural and datatype
components. Some MPEG-7 specific components
will be added to it too.
3.
MPEG-7 visual – Descriptors and Description Schemes dealing only
with visual descriptions. It includes color, texture, shape, motion, localization, and
other descriptors. Each of them can be a
basic descriptor or a sophisticated one. Table 1 shows some of the current
descriptors.
4.
MPEG-7 Audio - Descriptors and Description Schemes dealing only
with audio descriptions. This included
six technologies: the audio description framework (scale tree, low-level descriptors),
sound effect
description tools, instrumental timbre description tools, spoken content description, uniform silence segment,
and finally the melodic
descriptors to facilitate query-by-humming.
Table 1 shows some of the current descriptors
5.
MPEG-7
Multimedia Description Schemes (MDS) – Descriptors and Description Schemes dealing with
generic and multimedia features. Generic
features are features pertaining to all types of media such as vector and time.
Multimedia description tools are tools used when more than one medium
need to be described at the same time.
They are grouped in five groups:
o Content
description: representation of perceivable information.
o Content
management: information about the media
features, the creation and the usage of the audio-visual content.
o Content
organization: representing the analysis
and classification of several audio-visual contents.
o Navigation
and access: specification of summaries
and variations of the audio-visual content.
o User
interaction: description of user
preferences and usage history pertaining to the consumption of the multimedia
material.
6.
MPEG-7 Reference Software the eXperimental Model (XM) – an
experimental software implementation of the standard. It include the simulation platform of MPEG-7;
Descriptors (Ds), Description Schemes (DSs), Coding Schemes (CSs), and
Description Definition Language (DDL).
The experimental model has normative and non-normative parts. Normative parts consist of the Descriptor’s
and Descriptor Schemes syntax, semantics and binary representations of
both. The optional non-normative parts
of the software are the recommended data structures and procedures performed on
them for extraction and similarity matching procedures.
7.
MPEG-7 Conformance – guidelines and procedures for testing the conformance
of MPEG-7 implementations.
Type
|
Feature
|
Descriptors
|
Visual
|
Basic Structures
|
Grid layout
|
Histogram
|
Color
|
Color space
|
Dominant color
|
Color histogram
|
Color quantification
|
Texture
|
Spatial image intensity distribution
|
Homogeneous texture
|
Shape
|
Object bounding box
|
Region-based shape
|
Contour-based shape
|
3D shape descriptor
|
Motion
|
Camera motion
|
Object motion trajectory
|
Parametric object motion
|
Motion activity
|
Motion trajectory features
e.g., speed, direction, acceleration
|
Audio
|
Speech Annotation
|
Lattice of words and phonemes
plus metadata
|
Timbre
|
Ratio of even to odd harmonics
|
Harmonic attack coherence
|
Melody
|
Melodic contour and rhythm
|
Table 1 –
Overview of the current descriptors
Examples
of MPEG-7 application areas
I
included in this section some of the studies that were built on the preliminary
idea of the MPEG-7 Standard. Those examples
will clarify the idea of Descriptors, Description Schemes, and Description
Definition Language, and they will show how MPEG-7 will come in handy for use
in conjunction with a lot of other applications and technologies.
Representing Internet Streaming
Media Metadata using MPEG-7 Multimedia Description Schemes
This
study was done by Eric Rehm of Singingfish.com an Internet startup company in
Seattle, Washington that began the construction and population of a searchable
database of Internet streaming media.
This study used the MPEG-7 Multimedia Description Scheme (MDS) as a
guiding model to build on. The
Multimedia Description Group of MPEG-7 created a top-level entity and called it
“Generic AV
DS”. This entity
describes the audio and visual contents of a single AV document. It was used in this study to build on in
order to create an implementation of an Internet streaming media searchable
database. Figure1 shows the MPEG-7 AV
Description Schemes. This paper is
very important in the sense that it shows the hierarchy of the Description
Schemes, which basically are generic forms of structures to build Descriptors
on.
Figure 3- Streaming AV Description Scheme
Rehm
found out that some data and relationships hade to be modeled:
1. Overall Structure: Single
content item, playlist, SMIL authored content, etc.
2. Media
Information: URL link(s) to stream, bit rate, media format (RealMedia,
Window Media, etc…), duration, MIME type, media type (audio, video, animation,
etc…).
3. Creation Information: Title,
Author, Copyright, Artist, Album, Record label, Language, etc…
4. Classification:
Category, and Genre. Categories are the
root nodes of the taxonomy. Genre
represents a path from a root node using controlled vocabulary.
5. Related Material: Referencing page URL(s),
title, anchor text, HTML Meta tags (description keywords).
7. Spoken
Text: Transcript from
speech recognition.
The Overall Structure
The MPEG-7 Segment DS and Segment Decomposition allowed them
to model any hierarchical playlist format they encountered on the
Internet. Figure 4 shows the
Structure Support from MPEG-7 MDS.
Figure 4 – Structure Support from
MPEG-7 MDS
A
segment DS is actually an abstract class.
Subclasses of Segment DS, the VideoSegment, the AudioSegment, and TextSegment
DS were designed to contain information about the audio, video, and text in an
AV content item.
Media Information
The
MPEG-7 MediaInformation DS contains descriptions that are specific to the
storage media. It can contain one or
more MediaProfile DSs. Each MediaProfile
represents one of possibly many variations that can be produced form a master
media depending on the values chosen for the MediaCoding, MediaFormat (storage
format) etc. Internet streaming media
content is often encoded in more than one commercial format (RealMedia, Windows
Media, QuickTime), each at several bit rates, with each variation at a separate
URL. So they encoded the commercial
format with the MediaFormat’s System element.
See figure 5 Media Information DS.
Figure 5- Media
Information DS
If
two identical instances of particular stream exist on the Internet (very common
with MP3 for example), they can be simply represented with multiple
MediaInstance description.
Creation Information
The
CreationMetaInformation DS binds together creation and classification
information about AV content and other material related to it. It contains by whom and with what name, when,
and where the content was created. This
information can be extracted in many different ways. Automatic extraction form the header of the
stream is one way, and automatic extraction from the referring web page that
contain the URL of the stream is another way.
Figure 6- the CreationMetaInformation
Figure 6- CreationMetaInformation
DS
Classification
The
Classification DS is used as part of the larger CreationMetaInformation DS, to
categorize Internet streams into a proprietary taxonomy.
Related material
Related
Material DS is also part of the CreationMetaInformation DS, to hold information
about the web page(s) that contain links to streaming media. Such data is proven to increase the search
precision and recall.
Usage Information
The
Rights DS within the UsageMetaInformation DS is used to capture the copyrights.
Spoken Text
Spoken
text is extracted by using speech recognition tools, close caption decoding, or
transcripts provided by the content producer.
It is captured in the SpokenContent DS as part of the AudioSegment
DS. See figure 4.
Summary Information
The
MPEG-7 SequentialSummary DS is only used when there is a need to represent
multiple key frames extracted from a single Internet streaming video.
TV Anytime as an application scenario for
MPEG-7
[11]
Will show how TV Anytime Forum can make use of the MPEG-7 Description Schemes
to realize what is intended from the TV Anytime technology.
TV
Anytime an organization for development and standardization of the tools and
technologies needed towards the creation of an integrated
entertainment/information gateway. It
aims at providing value-added services such as personalizing and controlling
material of special interest to the end user accessed via TVs or computational
systems. In order to do that TV Anytime specified three required
technologies: metadata, content referencing,
and rights management.
Metadata
Metadata
is the core of the MPEG-7 standard. TV
Anytime could use the rich library of MPEG-7 Description Schemes without having
to reinvent them. But TV Anytime would
not need to use all the tools offered by MPEG-7, such not needed tools are the
low-level audio-visual features like color and loudness. This will require a mechanism to profile
MPEG-7 to certain type of applications.
Content referencing
The
AV material stream and the metadata stream in MPEG-7 are two separate streams
that may not reside on the same storage medium.
Consequently the AV stream can be digital, or analog, and the transfer
medium can be cable or satellite. The
important functionality required is the ability of the receiver to synchronize
and link the two streams together.
MPEG-7 supports this linkage functionality via different reference and
time Descriptors Schemes. Such DSs are
MidiaLocator DS to specify the link, MediaTimePoint DS to specify the absolute
start time, and MediaDuration DS to specify the segment’s duration.
Rights management
MPEG-7
has a specific Description Scheme to manage the copyrights and other
rights. But MPEG-7 cannot deal with the
security issues of TV Anytime.
How can TV Anytime use MPEG-7 metadata?
An
XML schema language parser can extract the information included in an MPEG-7
generic metadata Description Scheme validate it and map it to memory according
to the specific TV Anytime metadata format.
Then, an application that perform services such as searching and
accessing AV material can use the metadata information in memory to do that.
Example of a possible scenario
Since
every broadcast contains segments of higher of lesser importance, an end user
may request to view only the highlights of the broadcast such as the goals in a
soccer game broadcast. Figure 7 shows
a simplified extract of MPEG-7 Description Scheme, which is a simplified
high-level DS that has all the components required for time-exact linking into
AV material (MediaTime), for physical location of the AV material (MediaLocator),
for human comments on some AV material (StructuredAnnotation), and for
specification of a certain segment of an AV material (HighlightSegment). Figure
8 shows a simple example of description scheme specific to TV Anytime
application. It used the generic
MPEG-7 schemes to create a TV Anytime specific metadata scheme. Figure 9 shows a sample of a metadata
Descriptor that uses XML schema language.
The sample featured in figure 9 is about the soccer final of the1999
Champions League in Europe.
|
|
<schema
xmlns="http://www.w3.org/1999/XMLSchema"
xmlns:mp7="http://www.mpeg7.org/MP7Schema"
targetNamespace="http://www.mpeg7.org/MP7Schema"
elementFormDefault="unqualified"
attributeFormDefault="unqualified">
...
<!-- Schema component to
locate material in time -->
<complexType name="MediaTime">
<choice>
<element
name="MediaTimePoint" type="mp7:MediaTimePoint"/>
<element name="MediaRelTime"
type="mp7:MediaRelTime"/>
</choice>
<element name="MediaDuration" type="mp7:MediaDuration"
minoccurs=0/>
</complexType>
<!-- Schema component to
locate material physically -->
<complexType name="MediaLocator">
<element name="MediaURL"
type="mp7:MediaURL"/>
<element name="MediaTime"
type="mp7:MediaTime" minOccurs="0"/>
</complexType>
<!-- Schema component to
annotate material -->
<complexType name="StructuredAnnotation">
<element name="Who" type="mp7:ControlledTerm"
minOccurs="0"/>
<element name="WhatObject"
type="mp7:ControlledTerm" minOccurs="0"/>
<element name="WhatAction"
type="mp7:ControlledTerm" minOccurs="0"/>
<element name="Where"
type="mp7:ControlledTerm" minOccurs="0"/>
<element name="When"
type="mp7:ControlledTerm" minOccurs="0"/>
<element name="TextAnnotation" type="string"
minOccurs="0"/>
<attribute ref="xml:lang"/>
</complexType>
<!-- Schema component for
segments being a highlight in the material -->
<complexType name="HighlightSegment">
<element name="VideoSegmentLocator" type="mp7:VideoSegmentLocator"
minOccurs="0"/>
<element name="AudioSegmentLocator"
type="mp7:AudioSegmentLocator" minOccurs="0"/>
<attribute name="name" type="string"
use="optional"/>
<attribute name="themeIds" type="IDREFS"
use="optional"/>
</complexType>
</schema>
|
|
Figure
7- simplified extract of MPEG-7 Description Scheme
|
|
<schema xmlns= http://www.w3.org/1999/XMLSchema
xmlns:mp7="http://www.mpeg7.org/MP7Schema"
xmlns:tva="http://www.tv-anytime.org/TVASchema"
targetNamespace="http://www.tv-anytime.org/TVASchema"
elementFormDefault="unqualified"
attributeFormDefault="unqualified">
<import
namespace="http://www.mpeg7.org/MP7Schema"/>
<element
name="program">
<complexType>
<element
name="generalInfo" type="tva:generalInfoType" />
<element
name="highlight" type="tva:highlightType"
minoccurs="0" maxoccurs="unbounded" />
</complexType>
</element>
<complexType
name="generalInfoType">
<element
name="annotation" type="mp7:StructuredAnnotation" />
<element
name="link" type="mp7:MediaLocator" />
</complexType>
<complexType
name="highlightType">
<element
name="segment" type="mp7:HighlightSegment" />
<element
name="annotation" type="mp7:StructuredAnnotation" />
</complexType>
</schema>
|
|
Figure
8- simple example of description scheme specific to TV Anytime application
|
|
<program
xmlns="http://www.tv-anytime.org/TVASchema">
<generalInfo>
<annotation lang="eng">
<Who>Manchester United - Bayern
Munich</who>
<WhatAction>soccer champions league
final Europe</WhatAction>
<Where>Barcelona,
Spain</Where>
<When><Y>1999</Y><M>5</M><D>29</D></When>
</annotation>
<link><MediaURL>http://...</MediaURL></link>
</generalInfo>
<highlight>
<segment>
<videoSegmentLocator>...</videoSegmentLocator>
<themeIds>goal</themeIds>
</segment>
<annotation>
<Who>Mario Basler</Who>
<WhatObject>Bayern
Munich</WhatObject>
<When><M>6</M></When>
</annotation>
</highlight>
<highlight>
<segment>
<videoSegmentLocator>...</videoSegmentLocator>
<themeIds>goal</themeIds>
</segment>
<annotation>
<Who> Teddy Sheringham</Who>
<WhatObject>Manchester
United</WhatObject>
<When><M>91</M></When>
</annotation>
</highlight>
...
</program>
Figure 4: Sample instance
|
|
Figure
9 – Sample of a metadata Descriptor
Spoken Content Metadata and MPEG-7
There
are two level of descriptions in MPEG-7, one is low-level description that can
be automatically extracted such as image color for visual items and Fourier
power spectrum of audio items, the other is high-level description, semantic,
that requires human intervention because it contain a lot of abstractions of
humanly understood concepts. With the
increasing need to cut on cost and automate most of these extractions a new mid
level aroused that uses a lot of automation for extracting the abstract
concepts. As an example of these
automation attempts, is spoken content of audio, topic identification in text,
and object identification in images. But
these applications are not perfect because they contain so many variable such
as in non-canonical English, the sound of the words “picture” and “pitcher” can
be identical but their meaning is different.
The only way to disambiguate this is through topical of positional
context.
As
an example we are going to look at the spoken contents and how MPEG-7 deals with
the shortcomings of the design of the tools for automatic speech extraction
(ASR). Spoken content form an essential
component of the audio-visual description.
This content may be extracted at a number of levels, from phonetic
subword units (phones) through syllables to words. To illustrate the design considerations, we
are going to look at two examples, one is annotation of images, when taking the
picture, a person can include a short comment about the person in the picture
or the place where the picture was taken.
These comments can be used to construct metadata structures, by using
the ASR tools. The end user can use
these descriptors later to query the pictures database via audio or textual
queries.
The
ASR system suffers some problems; its accuracy is limited by the ambient noise,
out-of-vocabulary words, ungrammatical construction, and poor enunciation. A
special attention must be given to the limitations of the current ASRs and the
methods by which the metadata may be utilized for retrieval or other
purposes. Two problems must be
considered:
Extraction failures: as shown in figure 10-
Hypothetical lattice representing the phrase “please be quite sure”, the
ASR system decoding results are stored in some form of lattice, these lattices
represent a large number of hypotheses and many of these decoding contain the
correct lattice while the most probable possibility is incorrect. The solution to this is to retain all the
possible lattices in the metadata. This
will work well for short audio captions, but is not practical or accurate for
large audio files
Figure
10 – Hypothetical lattice representing the phrase “please be quite sure”
Extraction limitations: the ASR system has a vocabulary
dictionary of 20,000 to 60,000 words that it uses for comparisons. These words do not include many of the nouns
that can be found in an audio file.
These nouns are crucial to the meaning most of the time. By retaining the phonetic representation of
these sounds we may be able to retrieve an audio document by retrieval by
example through a combined word and phone retrieval. As a result we need to use lattices that
contain a combination of words and phones.
MPEG-7 SpokenContent Descriptor
After
looking at the problems that ASR suffers from, we can now look at how we are
going to use the results of the decoding in the MPEG-7 SpokenContent
Descriptor. The authors of [12] believe
that there will be need for representing a speech as a combined word and
phoneme lattice. Some audio documents
may contain more than one spoken annotation, such as the case in a photographic
library where each photo has its own annotation. In this case we need to retain multiple
lattices with links attaching them to other metadata. A separate header will contain information
pertaining to all lattices.
For
dealing with usability issues,
the multiple lattices by themselves do not form an adequate metadata of the
spoken contents, so a special SpokenContent Descriptor stored in the header
contains the language, a word lexicon, a phone lexicon, and optionally a word
phone indexes.
For
dealing with interoperability issues,
we need to include in the metadata and the ASR decoder that decode the query
the same phone set of a language because this is the only reliable way of
retrieval-by-example, since both ASR systems used for extracting the spoken
contents of a metadata, and the one used for retrieval by comparison are widely
different in their capabilities. The ASR
used for the metadata extraction is much more advanced. [12] Suggest to include the phone set in the
header as well.
On the Evolution of Videotext Description
Scheme and Its Validation Experiments for MPEG-7
Videotext
is the superimposed text or the embedded text in images and video frames. For example a videotext can be the anchor’s
name in the video clip, football game scores superimposed on the video frame,
or it can be the introductory and ending credits of each video material, or it
can even be the text written on someone’s clothing in the video clip. Videotext can be extracted and used to
browse, search, and classify video materials.
[13] looked at the standardization efforts of the VideoText Description
Scheme (DS) and modeled and tested the VideoText DS validity in browsing and
classifying videos. An application that
extracted the face and text information from a video was used. The extracted information was stored in the
XML format proposed by MPEG-7, (shown below), then the information were parsed
and used to browse and classify videos.
What is VideoText Description Scheme?
VideoText
DS is an MPEG-7 Description Scheme that has been derived from the MovingRegion
DS which covers the basic video object attributes such as bounding box,
trajectory and others. It inherits all
the attributes, decomposition, Descriptors, and Description Schemes from the MovingRegion
DS. It also contains all the syntactic
attributes of the text such as its language, font size, font style, and other
temporal and visual information such as its time, motion, color, and spatial
location. Figure 11 shows the
VideoText DS – syntactic aspects.
|
|
<!- ################################### -> <!- ``Videotext DS'': Syntactic Aspects -> <!- ################################### -> <simpleType name="TextDataType" base="string"> <enumeration value="Superimposed"/> <enumeration value="Embedded"/> </simpleType> <complexType name="Videotext" base="MovingRegion" derivedBy ="extension"> <element name="Text" type="TextualDescription" minOccurs="0" maxOccurs="1"/> <attribute name="TextType" type="TextDataType" use="optional"/> <attribute name="FontSize" type="positiveInteger" use="optional"/> <attribute name="FontType" type="string" use="optional"/>
</complexType>
|
|
Figure 11 – VideoText Description Scheme – Syntactic
aspects
VideoText
DS contains the following elements and attributes:
o TextDataType:
there are two types of text in a video, embedded, which is the text written on
people’s clothing in a video clip or shop and street names, and superimposed, which
is the text in a video that was generated by title machines in studios.
o Videotext:
text region in a video or set of images.
o Text: the
string containing the text recognized in the videotext.
o TextType:
attribute relating to the type of videotext.
o FontSize:
integer specifying the font size.
o FontType:
string specifying the font style.
|
|
<!- ########################################## -> <!- ``VideotextObject DS'': Semantic Aspects -> <!- ########################################## -> <complexType name="VideotextObjectDS" base="Object" derivedBy ="extension"/> <attribute name="id" type="ID"/> <attribute name="href" type="uri"/> <attribute name="CharacterCode" type="string"/>
</complexType>
|
|
Figure
12 – VideoTextObject Description Scheme – Semantic aspects
A
videotext could appear in a video clip next to an object, such as a text
appearing under a face in a video clip could mean that this text is the name of
the person in the video. Some words
appearing on an object in a video could mean the brand name of this
object. This shows clearly the
relationship between an object and a text in a video. This leads to the definition of a new DS
called VedioTextObject, which contains the semantic attributes of the VideoText
DS. Figure 12 shows the
VideoTextObject DS.
Extraction of VideoText DS
It
can be done automatically of manually.
There are three different methods to do the automatic extraction, region
analysis, edge analysis, and texture method.
IBM proposed the region-based algorithm, and Philips proposed the
edge-based algorithm. These algorithms
were used by the authors in their experiment on the validation of the VideoText
DS for video browsing.
Validation of the VideoText DS
Two
typical scenarios were used to test the validity of the VideoText DS. Video browsing and video classifications
based on the VideoText DS. For video browsing, the
authors used their automatic videotext event detection technique to detect the
presence of videotext in the video stream. Then two videotext extraction
applications were used and compared, the IBM system, and the Philips system
described above. The test video files
used were taken from the MPEG-7 test data.
For the video
classification the authors adopted an existing videotext application,
which classifies video segments onto known categories based on the location of
faces and text (an observation was made that in different TV categories there
are different face and text trajectory patterns). Two methods were used and compared for the
extraction of text and face trajectories, the domain based method and the
Hidden Markov Models (HMM).
The
paper concluded that the VideoText DS proposed by the MPEG-7 group is a
powerful feature; it provides rich, high-level semantic information that can be
used in numerous video applications.
What is beyond MPEG-7?
Today,
many elements exist to build an infrastructure for the delivery and consumption
of multimedia content. But there is one
detail still missing “looking at the big picture” to describe how the
existing and under development elements relate to each other. That is the aim of MPEG-21.
We
can define the MPEG-21’s job in so many different ways.
q MPEG-21
is going to define a multimedia framework to enable transparent and augmented
use of multimedia resources across a wide range of networks and devices used by
different communities.
q The
multimedia content delivery chain encompasses content creation, production,
delivery and consumption. To support
this, the content has to be identified, described, managed and protected. The transport and delivery of content will
occur over a heterogeneous set of terminals and networks within which, events
will occur and require reporting. Such
reporting must include reliable delivery, the management of personal data and
preferences taking user privacy and the management of financial transactions
into account. Doing that requires a
multimedia framework that will orchestrate the job of all the different parts.
The MPEG-21 multimedia framework will identify and
define the key elements needed to support the multimedia delivery chain as
described above, the relationships between and the operations supported by
them.
Summary
It
is clear that in this paper we gave MPEG-7 more attention than the other MPEG
standards. But we felt that in order for
the reader to understand what is MPEG-7 is all about there was a need to
understand what the other standard did and how they differed from each other in
the way multimedia contents are coded, stored, delivered, decoded, and
retrieved. We looked at what MPEG-1,
MPEG-2, and MPEG-4 did for the encoding of multimedia resources, then we looked
at what MPEG-7 is intended to do and how it employed some features of the previous
standards and added to them for efficient retrieval of the multimedia contents,
and finally we gave an overview of the intended task that MPEG-21 is going to
achieve.
The
research papers included gave examples about the attempts that were made in
putting the MPEG-7 standard Description Schemes and Descriptors in action. It was obvious that the different MPEG-7 Ds
and DSs are the way to go for searching, browsing, indexing, and managing
multimedia contents, although some improvements might be needed as the MPEG-7
standard evolves.
References
Research Groups
[1] INTERNATIONAL
ORGANISATION FOR STANDARDISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND
AUDIO
Short MPEG-1 description
JUNE 1996 by Leonardo
Chiariglione
http://mpeg.telecomitalialab.com/standards/mpeg-1/mpeg-1.htm
[2] INTERNATIONAL
ORGANISATION FOR STANDARDISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND
AUDIO
Short MPEG-2 Description
OCTOBER 2000 by Leonardo
Chiariglione
http://mpeg.telecomitalialab.com/standards/mpeg-2/mpeg-2.htm
[3] INTERNATIONAL
ORGANISATION FOR STANDARDISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND
AUDIO
Overview of MPEG-4 Standard
March 2001 by Rob Koenen
http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm#E9E1
[4] INTERNATIONAL
ORGANIZATION FOR STANDARDISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND
AUDIO
Overview of MPEG-7 Standard
(version 5.0)
March 2001 by José M. Martínez
http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm
[5] INTERNATIONAL
ORGANIZATION FOR STANDARDISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
MPEG-21
Overview
July
2001 by Jan Bormans, and Keith Hill
http://mpeg.telecomitalialab.com/standards/mpeg-21/mpeg-21.htm
Research Papers
[6] Leonardo Chiariglione. Open Source in MPEG
ACM digital library
[7] MPEG-7
Behind the Scenes
Jane Hunter
Distributed Systems Technology Center
University of Queensland
http://www.dlib.org/dlib/september99/hunter/09hunter.html
[8] Multimedia Description Framework (MDF)
for
Content Description of Audio/Video
Documents
Michael J. Hu and Ye Jian
ACM digital library
[9] The XML cover pages
http://www.oasis-open.org/cover/mpeg7.html
[10] Representing Internet Streaming Media
Metadata using
MPEG-7 Multimedia Description
Schemes
Eric Rehm, Singingfish.com
ACM digital library
[11] TV Anytime as an application scenario for
MPEG-7
Silvia Pfeiffer and Uma Srinivasan
ACM digital library
[12] Spoken Content Metadata and MPEG-7
J.P.A. Charlesworth and P.N. Gamer
ACM digital library
[13] On the Evolution of Videotext Description
Scheme and Its
Validation Experiments for MPEG-7
Chitra Dorai, Ruud Bolle, Nevenka Dimitrova, Lalitha
Agnihotri, and Gang Wei
Other Relevant
Links
[14] XML Schema Tutorial for DDL
http://archive.dstc.edu.au/mpeg7-ddl/mpeg7-xmlschema.ppt
Scope of Survey
Since the MPEG-7 standard was still
under study when this paper was written, the information written in it regarding
MPEG-7 are based on what was available from the INTERNATIONAL ORGANISATION FOR
STANDARDIZATION up to March 2001. The
study papers included in this survey paper are based on what was available from
the ACM digital library up to November 2001.
This survey paper is not a complete study about the MPEG standards nor
it is a study about each and every functionality of MPEG-7, but it is an
overview of the standards available and their chronological order up to the
date of writing this paper, which is geared towards giving the novice reader,
who wants to explore this area, some general idea that will help in making the
decision about more in depth readings.