Speech Research Lab: Speech Synthesis, Speech Recognition, and Speech Processing

Link to info for users of Augmentative and Alternative Communication Devices

Link to STAR : Speech Training, Assessment and Remediation

Index:

General Information
About ModelTalker
ModelTalker Software
InvTool Software & Tutorial
BCCdb Software

Hardware Requirements
Inventory Design

How ModelTalker Works:

The ModelTalker System requires that a person whose voice we wish to emulate record an inventory of speech. These recordings are stored in a database. Speech is then synthesized from this database using longer stretches of recorded speech for utterances that closely or exactly match speech recordings in the database, and shorter stretches of recorded speech appended together for novel utterances. When speech is synthesized using the longer stretches of recorded speech, the intonation and timing patterns of the speech do not need to be altered. Thus the speech is close to the quality of recorded natural speech. When the speech is synthesized by appending together shorter stretches of speech, the timing and intonation patterns are generated from a set of rules. The resulting speech sounds like high quality synthetic speech, but will still retain the characteristics of the recorded voice in the database.

The shortest stretch of speech in the ModelTalker database is something we refer to as a "biphone". A biphone is a segment of speech that consists of 2 adjacent distinct phonemes in a word or phrase. For example, in the word "hat", one biphone would be "ha", and another would be "at". Synthesis consists of choosing and appending biphones at locations that match best acoustically. With a limited number of phonemes used in American English (around 42), synthesis of speech requires a reasonable number of biphones. However, because phonemes can be significantly affected acoustically by the sounds surrounding them, we can improve synthesis by including more than 1 example of common biphones in different contexts.

By including a complete set of biphones in the database, it is possible to synthesize virtually any word or phrase in American English. The database also includes common words and phrases, function words ("the", "and", etc.) in many different contexts, and user-chosen words and phrases. Thus, during synthesis, the database is searched and if words or phrases can be found as complete entities, they will be used. If the phrases, words, syllables, or any other longer stretch of speech cannot be found, biphones will be used for synthesis.

Currently, approximately 1650 words and phrases are recorded for the database. However, we are still in the process of investigating the content of the optimal inventory of recorded speech (see Inventory Design).