Speech Research Lab: Speech Synthesis, Speech Recognition, and Speech Processing

Link to STAR : Speech Training, Assessment and Remediation

Dysarthric Speech Database

Method

1.Dysarthric Speakers and Sentence Material

To examine segmental intelligibility, a list of words and associated foils was constructed in a manner similar to that described by Kent, Weismer, Kent, and Rosenbek (1989). Each word in the list (e.g., boat) was associated with a number of (usually) minimally different foils (e.g., vote, moat, goat). The word and its associated foils formed a closed response set from which listeners in a word identification task selected a response given a dysarthric production of the target word. However, unlike the test designed by Kent, et al. (1989) the test words were embedded in short semantically anomalous sentences, with three test words per sentence (e.g., The boat is reaping the time). Note also that, unlike Kent, et al. (1989) who used exclusively monosyllabic words, these test materials included disyllabic verbs in which the final consonant of the first syllable of the verb could be the phoneme of interest. That is, the /p/ of reaping could be tested with foils such as reading and reeking. The complete stimulus set consisted of 74 monosyllabic nouns and 37 disyllabic verbs embedded in sentences (there are two nouns and one verb per sentence). To counter-balance the effect of position within the sentence for the nouns, talkers record 74 sentences with the first 37 sentences randomly generated from the stimulus word list, and the second 37 sentences constructed by swapping the first and second nouns in each of the first 37 sentences.

The talkers were eleven young adult males with dysarthrias resulting from either Cerebral Palsy or head trauma. Seven of the talkers had Cerebral Palsy. Of these seven, Three had spastic CP with quadriplegia, two had athetoid CP (one quadriplegic), and two had mixed spastic and athetoid CP with quadriplegia. The remaining four talkers were victims of head trauma (one quadriplegic and one with spastic quadriparesis), with cognitive function ranging between Level VI-VII on the Rancho Scale. The speech from one of the talkers (head trauma, quadriplegic) was extremely unintelligible. Because the speech was so poor, it was not marked at the phoneme level, and perceptual data was not collected for this talker.

The recording sessions were conducted in a wheelchair accessible sound-attenuated booth using a table-mounted Electrovoice RE55 dynamic omni-directional microphone connected to a Sony digital audio tape recorder, model PCM-2500 situated outside the recording booth. The talker was seated, typically in a wheelchair, next to the experimenter or speech pathologist, and approximately 12 inches from the microphone. The recording sessions began with a brief examination by a speech pathologist including administration of the Frenchay Dysarthria Assessment (Enderby, 1983). Following the assessment and after a short break, the experimenter entered the recording booth to lead the talker through a series of recordings which included the set of 74 semantically anomalous sentences described above followed by two paragraphs of connected speech. The speech material was typed in large print on paper placed in front of the talker and the talker was given some time to familiarize himself with it before the recording began. For the sentence material, each sentence was read first by the experimenter and then repeated by the talker. This assisted all talkers in pronunciation of words and was essential for some subjects with limited eyesight or literacy. Finally, the talkers recorded two paragraphs of connected speech. On average the entire recording session was completed in two and one half to three hours, including time for breaks. The recorded sentences of both the dysarthric talker and the experimenter were later digitized and the six words in each sentence were marked using a waveform/spectrogram display and editing program (Bunnell and Mohamed, 1992). Phonetic labeling was done using a DHMM labeling program (Menendez-Pidal, et al., 1996) and the automatically assigned labels for the dysarthric speech were later corrected by undergraduate students from the University of Delaware under close supervision of the authors.

2. Listeners and Identification Testing

A minimum of five normal hearing listeners were recruited from students at the University of Delaware for listening tests with each of the dysarthric speakers. Listeners were seated in a sound dampened room facing a touch screen terminal and heard sentences presented binaurally over TDH-49 headphones at an average level of 72 dB SPL.

The sentences were presented in either their original form, or in a time-warped version. The time-warped sentences were adjusted to match the timing of the corresponding sentence spoken by the experimenter as a prompt for the dysarthric talker (included in this database) and were typically about half the duration of the original speech. The mode of presentation was random within a set of trials with the constraint that half of the presentations were in original mode and half were time-warped. The presentation order of the pre-recorded sentences was also randomized. The waveforms and perceptual data for the temporally-adjusted speech are not included with this database.

At the start of each trial, the terminal screen was cleared and a new sentence frame appeared with the three target word locations in each sentence containing a list of possible response words from which listeners attempted to select the words that they thought the talker was attempting to say. For instance, a sentence might appear as follows:

FIN	SIPPING	BATH
The THIN	is SINNING	the BADGE
SIN	SITTING	BATCH
BIN	SIPPING	BASH
PIN		BASS
INN

Thus, each target word was associated with several similar sounding foils and the listener had to pick the correct alternative from the list (depending on the target word, anywhere from four to six alternatives were available). Subjects selected a response alternative by touching that alternative on the screen of the CRT. To minimize possible list position effects in the response data, the order of response alternatives for the target words was randomly selected each time a sentence was presented.

There were two sets of 37 sentences from each talker. The first set contained one repetition of each word from the stimulus pool and the second set contained a second repetition of each word (with the initial and final nouns swapped in each sentence). Within a presentation block, all sentences in one or the other of the two sets were presented once in original form and once in time-warped form. Each set of sentences was presented 12 times to each of 5 listeners. Thus, there were a total of 60 presentations of each production of each word in each mode. Note, however, that the recorded material for talker SC had ten listeners, so the amount of data was doubled. Talker JF had two extra listeners with a total of 9 extra set-pair presentations between them. Each production for talker JF, therefore, was heard 69 times in each mode.

Data was recorded using a program that kept track of the mode of presentation, the sentence, the word within the sentence, the correct alternative, the subjects response and the response reaction time in milliseconds (up to 30 seconds).

3. Perceptual Data

Only the perceptual data for the dysarthric speech in its original form are included with this database. For each talker, the data are collapsed across listeners and presentations. Except for talkers SC and JF, each line in the data file for a talker represents the distribution of 60 responses for a single utterance (5 listeners times 12 presentations). The number of responses represented in the distributions for talkers SC and Jf are 120 and 69, respectively (see above). Note that occasionally N will be slightly lower due to "sleepers", cases where there was no response, or where the program timed-out before the subject responded. Each line of data lists the talker, the sentence number, the word position within the sentence, the word, the target-phoneme position within the word (initial, final, intervocalic, or medial), the number of response alternatives, the target phoneme, and each of the foils in the response set. When the number of response alternatives is less than the maximum of 6, asterisks are used as fillers. Next is N, and then the proportion of correct responses for the target phoneme, followed by the proportion chosen for each of the foils (in the order that they are listed). Each data file contains 222 lines (74 sentences times 3 words per sentence). Note, however, that the data file for talker SC contains two extra lines. This is because of an error in which the word "bat" in sentences 18 and 55 was sometimes presented with foils for an initial "b" and sometimes presented with foils for a final "t".

References

Bunnell, H. T., and Mohammed, O. (1992). "EDWave - A PC-based Program for Interactive Graphical Display, Measurement and Editing of Speech and Other Signals." Software presented at the 66th Annual Meeting of the Linguistic Society of America.

Enderby, Pamela M. (1983) "Frenchay Dysarthria Assessment". College Hill Press.

Kent, R.D., Wiesmer, G., Kent, J.F., and Rosenbek, J.C. (1989). "Toward Phonetic Intelligibility Testing in Dysarthria." Journal of Speech and Hearing Disorders., 54, 482-499.

Menéndez-Pidal, X., Polikoff, J.B., Peters, S.M., Leonzio, J.E., and Bunnell, H.T. (1996) "The Nemours Database of Dysarthric Speech." Proceedings of the Fourth International Conference on Spoken Language Processing, October 3-6, Philadelphia, PA, USA. [pdf] [ps]