The automatic extractor first creates a separate file of pitch-synchronous zcep coefficients for each carrier word. During this first pass, it also determines information such as the average duration of each phoneme in that specific inventory and the averages and the standard deviations of the zcep coefficients for each phoneme. During the next pass, each phoneme segment is evaluated and a table is created that includes where that segment matches best with every other segment of the same type, a number signifying the goodness of that match, a link to the previous segment in the carrier word from which the current segment is located, and a link to the following segment in the same carrier word. The goodness rating for all the matches with a segment are added together to give an overall goodness rating for that segment. The overall ratings for adjacent segments are then added together to get an overall diphone rating.
For each diphone, then, at most 5 examples of each diphone (those with the best overall goodness ratings) are chosen for the third pass. These diphones are compared to each other in the exact same manner as above (on a phoneme segment level), only each segment is only compared with other segments of that type of phoneme that were among the diphone examples in the third pass set. A new set of goodness ratings is determined for each diphone.
Based on these scores, the best rated example of each diphone is extracted from its carrier word at it's most inclusive boundaries. For instance, a diphone x1x2 consists of some part of x1 and x2. x1 matches with all other phonemes of type 1 in the final diphone set at various locations. Thus the first boundary of diphone x1x2 will be the location that includes all the matching locations x1 has with other 1's, but no more. The diphone extraction location in x2 will be handled in the same way (diphone x1x2 will include all matching locations of x2 with other phonemes of type 2, but no more). A table of all the possible diphone matching locations is printed into a file for use by the speech synthesizer.
Weights
Currently, the goodness rating for both the choices for best phoneme segment (and hence
diphone example) and for the best boundary locations are controlled by 3 weights for
plosives, and 4 weights for all other sounds. The 3 weights for the plosives
are:
Three different partial lists of carrier words were each recorded by 4 different speakers, making a total of twelve sets of partial carrier word lists recorded. The distinction between the carrier word lists was based on the type of word: nonsense words, 3 syllable words, and 7 syllable words. The lists were each designed so that diphone inventories resulting from these partial lists would be sufficient to synthesize 72 nonsense sentences. Two male and 2 female talkers recorded each of the different types of carrier word lists.
Each of the carrier word sets was labeled phonetically using speech lab's hmm labeler, and then the labels were hand adjusted. Each set was also amplitude adjusted to 90% of the maximum amplitude.
Then the automatic extractor was run twice on each of the carrier word sets. The automatic extractor was run once with the search space for the boundary location in 2 phoneme segments being matched limited to the center 1/100th of the total duration of each of the phoneme segments. Thus, where 2 phoneme segments matched was pretty much restrictred to the exact middle of each phoneme. The automatic extractor was then run with the search space for the boundary location limited to the center 1/2 of the total duration of each of the phoneme segments. Thus, possible boundary matching locations were checked for within the center 50% of each of the 2 phoneme segments being matched.
Finally, the 72 nonsense sentences were synthesized, thus creating 72 x 3 x 4 x 2 = 1728 sentences. However, after reviewing the 7 syllable inventories, it became apparent that there were numerous mislabels and the job of correcting the labels would be tedious and time consuming, thus making 7 syllable words our choice for carrier words impractical for full diphone inventories. Previous diphone inventories created from longer carrier words had been suboptimal as well because of the tendency to make unstressed segments extremely short, thus making it difficult to create intelligible speech from them. So it was decided to limit the comparison study to inventories created from the 3-syllable carrier words and the nonsense carrier words, or 72 x 2 x 4 x 2 = 1152 sentences.
Next Step
A somewhat formal comparison of the intelligibility and naturalness of the 2 sets
under the 2 conditions is planned for the next few weeks.
Speech Home Page (Text)
ModelTalker Page
duPont Hospital for Children
LINKS WITHIN THE SPEECH RESEARCH SITE
Projects (Text)
Publications (Text)
Related Links (Text)
Staff (Text)
Facilities (Text)
Upcoming Events (Text)