To Speech Lab's Automatic Diphone Extraction Page (with Tables and Pictures)


Speech Research Laboratory

duPont Hospital for Children
and the
University of Delaware



Automatic Diphone Extraction


Objective:

Improve automatic diphone extraction techniques and provide software which could be used by device manufacturers and speech clinicians to quickly develop user-selected voices for the ASEL speech synthesizer.

Method

The automatic extractor creates a set of diphones with variable, context dependent boundaries from a set of recorded carrier words and a set of corresponding pitch period files. The pitch period files contain a list of the locations of the onset of each pitch period in voiced speech and 10 msec epochs in unvoiced speech. The extractor assumes that the carrier words have been previously labeled with the appropriate phoneme labels (using a HMM labeler developed in the lab and trained on the Timit database).

The automatic extractor first creates a separate file of pitch-synchronous zcep coefficients for each carrier word. During this first pass, it also determines information such as the average duration of each phoneme in that specific inventory and the averages and the standard deviations of the zcep coefficients for each phoneme. During the next pass, each phoneme segment is evaluated and a table is created that includes where that segment matches best with every other segment of the same type, a number signifying the goodness of that match, a link to the previous segment in the carrier word from which the current segment is located, and a link to the following segment in the same carrier word. The goodness rating for all the matches with a segment are added together to give an overall goodness rating for that segment. The overall ratings for adjacent segments are then added together to get an overall diphone rating.

For each diphone, then, at most 5 examples of each diphone (those with the best overall goodness ratings) are chosen for the third pass. These diphones are compared to each other in the exact same manner as above (on a phoneme segment level), only each segment is only compared with other segments of that type of phoneme that were among the diphone examples in the third pass set. A new set of goodness ratings is determined for each diphone.

Based on these scores, the best rated example of each diphone is extracted from its carrier word at it's most inclusive boundaries. For instance, a diphone x1x2 consists of some part of x1 and x2. x1 matches with all other phonemes of type 1 in the final diphone set at various locations. Thus the first boundary of diphone x1x2 will be the location that includes all the matching locations x1 has with other 1's, but no more. The diphone extraction location in x2 will be handled in the same way (diphone x1x2 will include all matching locations of x2 with other phonemes of type 2, but no more). A table of all the possible diphone matching locations is printed into a file for use by the speech synthesizer.

Weights
Currently, the goodness rating for both the choices for best phoneme segment (and hence diphone example) and for the best boundary locations are controlled by 3 weights for plosives, and 4 weights for all other sounds. The 3 weights for the plosives are:

  1. Weight of magnitude of burst
  2. Weight of the deviation from the average duration of the phoneme
  3. Weight of the closeness to 1 the 2 fractions of the 2 phoneme examples being concatenated make.

For all other sounds, the 4 weights are:
  1. Weight of the goodness of the matching of the spectral parameters
  2. Weight of the closeness in pitch period duration
  3. Weight of the deviation from the average duration of the phoneme
  4. Weight of the closeness to 1 the 2 fractions of the 2 phoneme examples being concatenated make.

There is also an all-or-nothing weight that was designed to strongly encourage 2 voiced phoneme examples to only be matched at voiced locations (thus guarding against mislabeled phonemes and imprecisely labeled phonemes).



Current Experiment

The current experiment was designed to determine the best type of carrier word for creating the diphone inventories, and whether variable boundaries improved the quality of diphone inventories or whether variable boundaries made little difference.

Three different partial lists of carrier words were each recorded by 4 different speakers, making a total of twelve sets of partial carrier word lists recorded. The distinction between the carrier word lists was based on the type of word: nonsense words, 3 syllable words, and 7 syllable words. The lists were each designed so that diphone inventories resulting from these partial lists would be sufficient to synthesize 72 nonsense sentences. Two male and 2 female talkers recorded each of the different types of carrier word lists.

Each of the carrier word sets was labeled phonetically using speech lab's hmm labeler, and then the labels were hand adjusted. Each set was also amplitude adjusted to 90% of the maximum amplitude.

Then the automatic extractor was run twice on each of the carrier word sets. The automatic extractor was run once with the search space for the boundary location in 2 phoneme segments being matched limited to the center 1/100th of the total duration of each of the phoneme segments. Thus, where 2 phoneme segments matched was pretty much restrictred to the exact middle of each phoneme. The automatic extractor was then run with the search space for the boundary location limited to the center 1/2 of the total duration of each of the phoneme segments. Thus, possible boundary matching locations were checked for within the center 50% of each of the 2 phoneme segments being matched.

Finally, the 72 nonsense sentences were synthesized, thus creating 72 x 3 x 4 x 2 = 1728 sentences. However, after reviewing the 7 syllable inventories, it became apparent that there were numerous mislabels and the job of correcting the labels would be tedious and time consuming, thus making 7 syllable words our choice for carrier words impractical for full diphone inventories. Previous diphone inventories created from longer carrier words had been suboptimal as well because of the tendency to make unstressed segments extremely short, thus making it difficult to create intelligible speech from them. So it was decided to limit the comparison study to inventories created from the 3-syllable carrier words and the nonsense carrier words, or 72 x 2 x 4 x 2 = 1152 sentences.

Next Step
A somewhat formal comparison of the intelligibility and naturalness of the 2 sets under the 2 conditions is planned for the next few weeks.





NAVIGATION

Speech Home Page (Text)
ModelTalker Page
duPont Hospital for Children


LINKS WITHIN THE SPEECH RESEARCH SITE

Projects (Text)
Publications (Text)
Related Links (Text)
Staff (Text)
Facilities (Text)
Upcoming Events (Text)



This document was last updated April 7, 1998
Web Comments/Questions: yarringt@asel.udel.edu