Project: MPEG-7-based Audio Annotation for the Archival of Digital Video

Spoken Content Demonstrator

This demonstration tool extracts an MPEG-7 SpokenContent description from an input speech signal.

The MPEG-7 SpokenContent Description Scheme (DS) is a standardized representation of the output of an Automatic Speech Recognition (ASR) system.

It consists of:
  1. A header containing some general information about the spoken signal and the ASR system (notably: the word or phone lexicon used by the recogniser and some phone confusion statistics).
  2. A lattice consisting of an oriented graph in which the different paths represent the different possible transcriptions. Each node in the graph represents a time point between the beginning and the end of the speech signal. A link between two nodes corresponds to a recognition hypothesis (e.g. a word or a phone).
This information is stored in a specific MPEG-7 XML format.

This demonstration is based on a phone recognizer using a lexicon of 45 German phonetic units (including silence). Since we do not define any word model here, the resulting lattices only contain phone hypotheses. The extracted SpokenContent DSs can be used for different types of applications, especially for spoken document retrieval (SDR).


  1. Upload an audio file in WAV or MP3 format.
  2. Start the SpokenContent extraction process.
  3. Download the resulting MPEG-7 XML Spoken Content description.

