School Papers

1. project a large scale speech corpora were


The construction of standard speech database is unavoidable
requirement for the progressive development of speech recognition and
understanding systems. The past three decades have seen a steady growth of
interest in corpus-based techniques for speech and natural language processing
throughout the world. Corpus-based methods are found at the heart of many
language and speech processing systems. Though Bangla is one of the most widely
spoken languages spoken by about 245 million people around the world, the history
of corpus generation and corpus based Bangla speech recognition are not so far
and limited within few years. Among very few examples of Bangla speech corpora,
probably the first instance was C-DAC’s Bangla Katha Bhandar released in 2005
1. It was a product of Center for Development of Advanced Computing (CDAC) of
and is a collection of Annotated Speech Corpus for Bangla. Another step of
similar work was done by the Center for Research on Bangla Language Processing
of Bangladesh in 2010 2. In between these two, a research project financed by
the MOSICT of Bangladesh was completed in June, 2008. Under this project a
large scale speech corpora were recorded in SIPL of Islamic University 3. The
distinction of The SIPL speech corpora from other two is that it was designed
especially for Bangla speech recognition. As the continuation of the project
results organizing, labeling and similar other processing is still ongoing. In
this paper describes the design and development processes of connected word
speech corpus. After the basics of speech corpora, a sort description of BdNC01
text corpus has been discussed to understand the selection of words for speech
database design. In the next subsections, speech recording, editing processes
and final outcome are discussed. The paper concludes with the usability of the



A corpus is a collection of pieces of language text in
electronic form, selected according to external criteria to represent, as far
as possible, a language or language variety as a source of data for linguistic
research 4. A speech corpus or spoken corpus is a database of speech audio
files and text transcriptions in a format that can be used to create acoustic
models which can then be used with a speech recognition engine 5. In broad
sense, Speech Corpora may be viewed in two types as below:

1. Read Speech – This includes Book excerpts, Broadcast
news, Lists of words and Sequences of numbers.

2. Spontaneous Speech – This includes Dialogs between two or
more people (includes meetings), Narratives such as a person telling a story, Map-tasks
such as one person explains a route on a map to another and Appointment-tasks such
as two people try to find a common meeting time based on individual schedules.

Special kinds of speech corpora are non-native speech
databases that contain speech with foreign accent.

Speech corpus is the basis for both analyzing the
characteristics of speech signal and developing speech synthesis and
recognition systems. The corpus content becomes more and more complicated and
the size larger and larger with the development of computation power and the
speech technology. One of the selection methods of speech content of a corpus
is to derive the speech corpus from text corpus. For example, a speech corpus
of British English WSJCAM0 has been recorded at Cambridge University
from the Wall Street Journal text corpus 6.  Before recording a speech corpus, careful
selection of vocabulary is important since on average each out-of-vocabulary word
causes errors usually between 1.5 and 2 7. The recognizer vocabulary is
usually designed with the goal of maximizing lexical coverage for the expected
input. A straight forward approach is to choose the N most frequent words in
the training data which means that the usefulness of the vocabulary is highly
dependent upon the representativeness of the training data 8.

There are
different parameters to categorize a speech recognition system. Influential
parameters are speech types, speaker dependency, vocabulary size, etc. The importance
of these parameters is based upon the typical design considerations of a
recognition system, which may be closely related to a specific application or
task 9. In terms of speech types, speech recognition devices are usually faces
recognition problems with isolated or discrete, connected, or continuous
speech. Discrete speech requires a significant pause between words, may be 250
milliseconds. A single utterance may consist of a single word or a short string
of a number of isolated words. In continuous speech recognition systems, fluent or continuous speech flows with a rhythm and the
words bump into each other thus making recognition harder. In between
these two, connected speech recognizers do not require the intermediate pause
between inputs, but are able to detect word boundaries within a string of
connected speech. They do, however, require that the user carefully annunciate
each word like a dictation. Though many relevant literatures describe connected
words and continuous words as alternative terms, but because of vast diversity
of application it is required to define connected words separately. In speech recognition task, the difference in classification
between “connected words” and “continuous speech” is somewhat
technical. A connected word recognizer uses words as recognition
units, which can be trained in an isolated word mode. Particularly
in dictation and voice command recognition this type of systems becomes
efficient. Discrete, connected, and continuous speech recognition systems can
be classified further as either speaker-dependent or speaker-independent
systems. Speaker-dependent systems require that each speaker enter several
samples of each word in the vocabulary to form the reference templates 10. Another
important consideration to design a speech corpus is its vocabulary size. The
adjectives “small”, “medium” and “large” are applied to vocabulary sizes of the order of 100, 1000 and (over) 5000 words,
respectively. A typical small vocabulary recognizer can recognize
only ten digits; a typical large vocabulary    recognition system can recognize 20000 words 9. In
dictation and voice command recognition medium size vocabulary may be estimated
enough for satisfactory performance. Because it supports the study of Gould, Conti, and Hovanyecz 10 to determine the
feasibility of a limited capability automatic dictation machine which was simulated
along with isolated and connected speech modes using various vocabulary sizes.  In their experiment users composed and edited
letters with the simulated voice recognizer which had either a 1000 word
vocabulary or an unlimited vocabulary. The 1000 word vocabulary was composed of
the 1000 most frequently used English words. An analysis afterwards indicated
that roughly 75% of the words used in the letter writing task were available in
the 1000 word vocabulary.



BdNC01 corpus is a text corpus collected from web edition
of several influential Bangla newspapers during 2005-2011. BdNC01 contains a
large amount of Bangla text including more than 11 million word tokens. As a
requirement of this work, a program was developed using C Language to parse and
sort the text in BdNC01 corpus, the result was a list of words with their frequency
of occurrence in the text. The objective of this processing was to select a
list of high frequent 1000 or more words so that it becomes a good
representative of the language in consideration to construct a significant
connected speech database. A part of the list is shown in Table-1 and top
frequent 1000 words were selected to find some practical Bangla sentences. From
three issues of daily newspapers selected randomly, 52 sentences were selected
such that they include high frequent words as above. The list of sentences was
accepted for a small-medium vocabulary speech database and includes 252
different words in 343 places. The special characteristic of this list is that
some words are in multiple places with different context.