AI. duPont Hospital for Children
University of Delaware
SRL Main Page (text)Info for Users (text)
Info for Clinicians
With the help of Clinicians and clients with ALS, we seek to:
The ModelTalker speech synthesis system is a state of the art speech synthesizer that was specifically designed to meet the needs and interests of AAC users. The system is capable of providing a personalized voice with recorded natural speech quality for selected utterances and high quality, unrestricted text to speech synthesis in the same "voice" for novel utterances. The ModelTalker system presently works with augmentative communication tools such as the Prentke-Romich Wivik program, and in the course of the project, will be extended to work with emerging computer based AAC devices as a "plug compatable" synthesizer to provide augmented communicators with the alternative of using their own personalized voice in their AAC device.
The ModelTalker System requires that a person whose voice we wish to emulate record an inventory of speech. These recordings are stored in a database. Speech is then synthesized from this database using longer stretches of recorded speech for utterances that closely or exactly match speech recordings in the database, and shorter stretches of recorded speech appended together for novel utterances. When speech is synthesized using the longer stretches of recorded speech, the intonation and timing patterns of the speech do not need to be altered. Thus the speech is close to the quality of recorded natural speech. When the speech is synthesized by appending together shorter stretches of speech, the timing and intonation patterns are generated from a set of rules. The resulting speech sounds like high quality synthetic speech, but will still retain the characteristics of the recorded voice in the database.
The shortest stretch of speech in the ModelTalker database is something we refer to as a "biphone". A biphone is a segment of speech that consists of 2 adjacent distinct phonemes in a word or phrase. For example, in the word "hat", one biphone would be "ha", and another would be "at". Synthesis consists of choosing and appending biphones at locations that match best acoustically. With a limited number of phonemes used in American English (around 42), synthesis of speech requires a reasonable number of biphones. However, because phonemes can be significantly affected acoustically by the sounds surrounding them, we can improve synthesis by including more than 1 example of common biphones in different contexts.
By including a complete set of biphones in the database, it is possible to synthesize virtually any word or phrase in American English. The database also includes common words and phrases, function words ("the", "and", etc.) in many different contexts, and user-chosen words and phrases. Thus, during synthesis, the database is searched and if words or phrases can be found as complete entities, they will be used . If the phrases, words, syllables, or any other longer stretch of speech cannot be found, biphones will be used for synthesis.
Currently, approximately 1400 words and phrases are recorded for the database. However, we are still in the process of investigating the content of the optimal inventory of recorded speech (see Inventory Design).