Introduction
This demonstration tool extracts an MPEG-7 SpokenContent description from
an input speech signal.
The MPEG-7 SpokenContent Description Scheme (DS) is a standardized
representation of the output of an Automatic Speech Recognition (ASR)
system.
It consists of:
- A header containing some general information about the spoken
signal and the ASR system (notably: the word or phone lexicon used by the
recogniser and some phone confusion statistics).
- A lattice consisting of an oriented graph in which the different
paths represent the different possible transcriptions. Each node in the
graph represents a time point between the beginning and the end of the
speech signal. A link between two nodes corresponds to a recognition
hypothesis (e.g. a word or a phone).
This information is stored in a specific MPEG-7 XML format.
This demonstration is based on a phone recognizer using a lexicon of 45
German phonetic units (including silence). Since we do not define any word
model here, the resulting lattices only contain phone hypotheses.
The extracted SpokenContent DSs can be used for different types of
applications, especially for spoken document retrieval (SDR).
Steps
- Upload an audio file in WAV or MP3 format.
- Start the SpokenContent extraction process.
- Download the resulting MPEG-7 XML Spoken Content description.
|