Automatic Diphone Extraction

OBJECTIVE: Improve automatic diphone extraction techniques and provide software which could be used by device manufacturers and speech clinicians to quickly develop user-selected voices for the ASEL speech symthesizer.

Background

The automatic extractor takes a set of recorded carrier words and a set of corresponding files that contain a list of the locations of the onset of each pitch period in voiced speech and 10 msec epochs in unvoiced speech and creates a set of diphones with variable, context dependent boundaries. The extractor assumes that the carrier words have been previously labeled with the appropriate phoneme labels.

The automatic extractor makes a first pass in which it creates a separate file of pitch-synchronous zcep coefficients for each carrier word and determines information such as the average duration of each phoneme in that specific inventory. During the next pass, each phoneme example is evaluated and a table is created that includes where that example matches best with every other example of that phoneme, a number signifying the goodness of that match, a link to the previous phoneme example in the carrier word from which the current example is located, and a link the the following phoneme example in the same carrier word. The goodness rating for all the matches with that example are added together to give an overall goodness rating for that example. The overall rating is then added to each adjacent phoneme example from the original carrier word in order to get an overall diphone rating for the previous- current phoneme example diphone and for the current-following phoneme example diphone.

For each diphone, then, at most 5 examples of each diphone (those with the best overall goodness ratings) are chosen for the second pass. These diphones are compared to each other in the exact same manner as above (on a phoneme-example level), only each phoneme example is only compared with other examples of that phoneme that were among the diphone examples in the second pass set. A new set of goodness ratings is determined for each diphone.

Based on these scores, the best rated example of each diphone is extracted from its carrier word at it's most inclusive boundaries. For instance, a diphone x1x2 consists of some part of x1 and x2. x1 matches with all other examples of 1 in the final diphone set at various locations. Thus the first boundary of diphone x1x2 will be located so that it includes all the matching locations x1 has with other 1's, but no more. The diphone extraction location in x2 will be handled in the same way (diphone x1x2 will include all matching locations of x2 with other examples of 2, but no more). A table of all the possible diphone matching locations is printed into a file for use by the speech synthesizer.

Weights
Currently, the goodness rating for both the choices for best phoneme (and hence diphone) and for the best boundary locations are controlled by 3 weights for plosives, and 4 weights for all other sounds. The 3 weights for the plosives are:

  1. Weight of magnitude of burst
  2. Weight of the deviation from the average duration of the phoneme
  3. Weight of the closeness to 1 the 2 fractions of the 2 phoneme examples being concatenated make.

For all other sounds, the 4 weights are:
  1. Weight of the goodness of the matching of the spectral parameters
  2. Weight of the closeness in pitch period duration
  3. Weight of the deviation from the average duration of the phoneme
  4. Weight of the closeness to 1 the 2 fractions of the 2 phoneme examples being concatenated make.

There is also an all-or-nothing weight that was designed to strongly encourage 2 voiced phoneme examples to only be matched at voiced locations (thus guarding against mislabeled phonemes and imprecisely labeled phonemes).



Informal Experiments with Weight Settings

After carefully evaluating the effects of these weights on the overall goodness of inventories produced, I've come to the conclusion that it pretty much doesn't make a whit of difference what you set those weights to. Yeah, increasing the weight of the spectral match may make one inventory sound better in one place, but it will also make it sound worse in another. And there's no consistency to what makes what sound better where. Now, I've got to admit I've largely been focusing on voiced sounds and especially spectral mismatches, because those errors are the most jarring to my ear. However, there is some evidence that the spectral mismatches don't really affect overall intelligibility all that much and I should be looking at the fricatives, affricates, and plosives a bit more carefully. I haven't played much with the plosive weights yet, although intuitively (and hence how it is set in the automatic extractor) the weight of the burst should be the most salient weight. If I get time sometime, I'll play with that.


Experiments with Different Carrier Word Types

Three different partial lists of carrier words were each recorded by 4 different speakers, making a total of twelve sets of partial carrier word lists recorded. The distinction between the carrier word lists was based on the type of word: nonsense words, 3 syllable words, and 7 syllable words. The lists were each designed so that diphone inventories resulting from these partial lists would be sufficient to synthesize 72 nonsense sentences. Two male and 2 female talkers recorded each of the different types of carrier word lists.

Next, each of the sets was labeled phonetically using the hmm labeler, and then the labels were hand corrected. Each set was also amplitude adjusted to 90% of the maximum amplitude using the aa program.

Then the automatic extractor was run twice on each of the carrier word sets. The automatic extractor was run once with the search space for the boundary location in the 2 phoneme examples being matched limited to the center 1/100th of the total duration of each of the phoneme examples. Thus, where 2 phoneme examples matched was pretty much restrictred to the exact middle of each phoneme. The automatic extractor was then run with the search space for the boundary location limited to the center 1/2 of the total duration of each of the phoneme examples. Thus, in this case, the 2 phoneme examples could be joined anywhere within the center 50% of each of the examples.

Finally, the 72 nonsense sentences were synthesized, thus creating 72 x 3 x 4 x 2 = 1728 sentences.

Results
Upon a quick informal listening, it became immediately apparent that the 3 sets of carrier words had not been labeled similarly, and that the 3 syllable and the 7 syllable word sets needed to be relabeled. After a quick pass over the 7 syllable word sets, it became equally apparent that no one was going to be able to hand-correct the phoneme labels in a complete inventory of 7 syllable carrier words without losing her sanity. So only the 3 syllable carrier word sets were relabeled and recorrected. The automatic extractor was then rerun under both conditions on all 4 of the 3 syllable sets and the sentences were regenerated.

A somewhat formal comparison of the 2 sets under the 2 conditions is planned (fairly soon), although informally I'd have to say that I think the sentences from the nonsense words sound slightly better than the sentences from the 3 syllable words, and that the sentences extracted with a 50% search space for the boundary sound just slightly better than the sentences extracted with a 1% search space.



Discussion and Future Directions

Overall, I'd have to say that if you've got a good carrier word inventory, you get good diphones for synthesis, and if you've got bad carrier words, speech synthesized from the resulting diphones isn't going to sound smooth and natural. I still need to do a more formal comparison of the nonsense vs. the 3 syllable carrier word inventories, and I should spend some time playing with the weights to see if certain consonants become clearer and more identifiable depending on the settings. Currently there is an informal comparison experiment on the web. I need to alter that so that the 3 syllable inventories are included in that, and so that random sentences are generated rather than the same 10 sentences for all 4 talkers for all 4 conditions (2 carrier word types, 2 auto. ex. settings). I will probably use that same experiment layout here at the lab using other asel employees as subjects for a quick comparison experiment. Once that's determined, we can then record a full set of carrier words and run the extractor on them to create a complete voice. At some point we may want to add an amplitude blending algorithm at the diphone boundaries in synthesized speech. Also, a technical report on how the extractor works is currently being written. Otherwise, I don't know how much more work we'll put into the extractor at this point.