OBJECTIVE: Improve automatic diphone extraction techniques and provide software which could be used by device manufacturers and speech clinicians to quickly develop user-selected voices for the ASEL speech symthesizer.
The automatic extractor makes a first pass in which it creates a separate file of pitch-synchronous zcep coefficients for each carrier word and determines information such as the average duration of each phoneme in that specific inventory. During the next pass, each phoneme example is evaluated and a table is created that includes where that example matches best with every other example of that phoneme, a number signifying the goodness of that match, a link to the previous phoneme example in the carrier word from which the current example is located, and a link the the following phoneme example in the same carrier word. The goodness rating for all the matches with that example are added together to give an overall goodness rating for that example. The overall rating is then added to each adjacent phoneme example from the original carrier word in order to get an overall diphone rating for the previous- current phoneme example diphone and for the current-following phoneme example diphone.
For each diphone, then, at most 5 examples of each diphone (those with the best overall goodness ratings) are chosen for the second pass. These diphones are compared to each other in the exact same manner as above (on a phoneme-example level), only each phoneme example is only compared with other examples of that phoneme that were among the diphone examples in the second pass set. A new set of goodness ratings is determined for each diphone.
Based on these scores, the best rated example of each diphone is extracted from its carrier word at it's most inclusive boundaries. For instance, a diphone x1x2 consists of some part of x1 and x2. x1 matches with all other examples of 1 in the final diphone set at various locations. Thus the first boundary of diphone x1x2 will be located so that it includes all the matching locations x1 has with other 1's, but no more. The diphone extraction location in x2 will be handled in the same way (diphone x1x2 will include all matching locations of x2 with other examples of 2, but no more). A table of all the possible diphone matching locations is printed into a file for use by the speech synthesizer.
Weights
Currently, the goodness rating for both the choices for best phoneme (and hence
diphone) and for the best boundary locations are controlled by 3 weights for
plosives, and 4 weights for all other sounds. The 3 weights for the plosives
are:
Next, each of the sets was labeled phonetically using the hmm labeler, and then the labels were hand corrected. Each set was also amplitude adjusted to 90% of the maximum amplitude using the aa program.
Then the automatic extractor was run twice on each of the carrier word sets. The automatic extractor was run once with the search space for the boundary location in the 2 phoneme examples being matched limited to the center 1/100th of the total duration of each of the phoneme examples. Thus, where 2 phoneme examples matched was pretty much restrictred to the exact middle of each phoneme. The automatic extractor was then run with the search space for the boundary location limited to the center 1/2 of the total duration of each of the phoneme examples. Thus, in this case, the 2 phoneme examples could be joined anywhere within the center 50% of each of the examples.
Finally, the 72 nonsense sentences were synthesized, thus creating 72 x 3 x 4 x 2 = 1728 sentences.
Results
Upon a quick informal listening, it became immediately apparent that the 3 sets of
carrier words had not been labeled similarly, and that the 3 syllable and the 7
syllable word sets needed to be relabeled. After a quick pass over the 7 syllable
word sets, it became equally apparent that no one was going to be able to
hand-correct the phoneme labels in a complete inventory of 7 syllable carrier words
without losing her sanity. So only the 3 syllable carrier word sets were relabeled
and recorrected. The automatic extractor was then rerun under both conditions on
all 4 of the 3 syllable sets and the sentences were regenerated.
A somewhat formal comparison of the 2 sets under the 2 conditions is planned (fairly soon), although informally I'd have to say that I think the sentences from the nonsense words sound slightly better than the sentences from the 3 syllable words, and that the sentences extracted with a 50% search space for the boundary sound just slightly better than the sentences extracted with a 1% search space.