Altering natural speech in any non-trivial way almost invariably degrades it. The present study compared two methods for altering F0 in natural speech which are intended to generate minimal signal degradation. Overall, both methods were found to succeed about equally well when intelligibility was used as the measure of signal degradation. Despite the gross similarity in effects on speech intelligibility of the two methods, several aspects of the results can be considered.
First, this study was motivated by the observation that the HYBRID method produced speech that was subjectively preferable to the PSOLA method. The basis of this subjective impression may be related to the observation in the present study that PSOLA tended to produce both the best and the poorest speech in the experiment. In subjectively assessing speech quality, listeners may assign greater weight to the perceptible failures of a signal processing method than to the (effectively imperceptible) successes of the method. This is at least consistent with our results which suggest that degradation from the HYBRID method may be less severe but more common than degradation due to PSOLA.
A second factor in comparing the two methods is stability across talkers. For the two talkers in the present study, we found that the adult's speech was more degraded by F0 shifts than was the child's speech. Note that overall, the adult's stimuli were more intelligible that those of the child talker, but they were also more adversely effected by shifting F0. This was true independent of processing method (although PSOLA produced slightly more degradation than HYBRID LPC for the adult talker) and independent of direction of F0 shift. While LPC-based methods are often cited as being heavily effected by individual talker differences, we see no evidence of this being a problem in the present case since the HYBRID method was slightly less effected by talker differences than was the PSOLA method.
Finally, we consider the finding in the present study that PSOLA performed better in lowering F0, while the HYBRID method performed better in raising F0 (cf. Figure 2). Recall that, with the HYBRID method, F0 is lowered by appending zeroes onto the residual to extend each pitch period. Thus, only information in the filter coefficients is retained during the extended portions of pitch periods, and that information may be unavailable if the filter ringing decays rapidly (i.e., if the poles are relatively broad in bandwidth). By contrast, with the PSOLA method, pitch period length must be more than doubled before the extended portion of each pitch period contains only zeroes. In effect, PSOLA retains more information about the original speech signal in the extended portions of each pitch period. If this is the case, the HYBRID method could be improved by (a) increasing the filter order to retain more information about the speech signal, (b) increasing the effective length of a pitch frame by windowing in the way PSOLA does, or (c) a combination of the two.
While the Hybrid LPC method we've described is not overall better at preserving speech intelligibility than the PSOLA method, it may produce slightly better sounding speech, and seems to have advantages for some talkers and/or when F0 is increased. Future studies will examine possible reasons for the poorer performance of the HYBRID method when F0 is decreased.