An Augmentative Communication Interface Based On Conversational Schemata Peter B. Vanderheyden (vanderhe@asel.udel.edu) Applied Science and Engineering Laboratories Department of Computer and Information Sciences University of Delaware / A. I. duPont Institute Wilmington, DE 19899 USA Many people with severe speech and motor impairments make use of augmentative and alternative communication (AAC) systems. These systems can employ a variety of techniques to organize stored words, phrases, and sentences, and to make them available to the user. It is argued in this paper that an AAC system should make better use of the regularities in an individual's conversational experiences and the expectations that an individual normally brings into a conversational context. An interface and methodology are proposed for organizing and retrieving sentences appropriate to a particular conversation context, possibly developed from earlier conversations. These conversations are represented according to the schema structures discussed by Schank (1982) as a model for memory and cognitive organization. The interface allows the user to proceed with minimal effort through conversations that follow the schema closely, and facilitates the derivation of new schemata when a conversation diverges from an earlier one. This interface is intended to operate in parallel with and to complement a user's existing electronic communication system. Investigations to consider the effectiveness of the interface and methodology are planned for the future. keywords: augmentative communication, natural language processing, schemata, scripts An Augmentative Communication Interface Based On Conversational Schemata Many people with severe speech and motor impairments make use of augmentative and alternative communication (AAC) systems. These systems can employ a variety of techniques to organize stored words, phrases, and sentences, and to make them available to the user. It is argued in this paper that an AAC system should make better use of the regularities in an individual's conversational experiences and the expectations that an individual normally brings into a conversational context. An interface and methodology are proposed for organizing and retrieving sentences appropriate to a particular conversation context, possibly developed from earlier conversations. These conversations are represented according to the schema structures discussed by Schank (1982) as a model for memory and cognitive organization. The interface allows the user to proceed with minimal effort through conversations that follow the schema closely, and facilitates the derivation of new schemata when a conversation diverges from an earlier one. This interface is intended to operate in parallel with and to complement a user's existing electronic communication system. Investigations to consider the effectiveness of the interface and methodology are planned for the future. 1 Introduction The goal of designing a system for augmentative and alternative communication (AAC) is to facilitate the communication of people who have difficulty in speech, writing, and sign language. Speech synthesis and digitally-encoded recorded speech have made it possible to provide a person with a voice. For many people with severe speech and motor impairments, keyboards and many other computer input devices are difficult to use, and sentence production can be very slow. The design of the interface to the AAC system, therefore, can go a long way towards enabling the user to communicate with this new voice more effectively and with less effort. Early AAC devices were simply boards or books containing symbols such as letters, words, and pictures. The person using the communication board pointed at a symbol on the board, and the person with whom they were conversing was responsible for identifying the symbols pointed at, and interpreting their meaning. The burden of interpretation was laid on this second person. This person was also given the power to control the conversation, and to manage the topic and turns. Computerized AAC systems provide the augmented communicator with a voice, but are also able to organize words and sentences so that they are more easily accessible. Some systems apply natural language techniques in order to predict the user's next word or to fill in missing words, making use of lexical, syntactic, and semantic information within the current sentence (Demasco & McCoy, 1992). In this way, a well-formed sentence might be produced with less time and effort. An AAC system should also take the greater conversational context into consideration, and this paper develops and discusses one approach for doing so in order to further facilitate augmented communication. 2 Schemata 2.1 Schemata for stories When listening to a friend describe the day's events and hearing that he picked up his clothes from the laundromat, one would generally assume that the clothes had been cleaned and the friend had paid before leaving the laundromat with them. How is it that we can infer these details, if they were never explicitly stated? Schank and Abelson (1977) offered one explanation. As a result of picking up clothes at the laundromat many times, we have developed a mental script to represent the typical sequence of events involved in visiting the laundromat, and typically the clothes were cleaned and we paid for them before leaving with them. Scripts are built up during the course of an individual's life, as a result of the individual's experiences, perceptions, and interpretations of those experiences. Processing information by incorporating it into scripts has a number of advantages. The sheer magnitude of the information is reduced by storing only once every occurrence that follows the typical sequence of events. Only when an exception occurs, when, for example, we notice the spot on a favorite shirt has not been removed, do we store the details of a new event. By representing the typical sequence of events for a given situation, scripts also provide a means of inferring actions that have not yet taken place or have not been explicitly stated. A script for a familiar situation can also be abstracted and used to provide initial expectations in a related but novel situation. To give an example, if we take our clothes to a laundromat that we have never before visited, we can also generalize our experiences in the familiar laundromat to apply to this new one. Schank (1982; informally in 1990) extended and modified the idea of scripts into a hierarchy of schema structures called MOPs (memory organization packets). Continuing with the laundromat example and taking the abstraction a level or two higher, in any novel situation where we are the customer, we would expect to pay for services rendered, and a metaMOP would represent this general customer-in-a-store situation in Schank's system. MetaMOPs represent high-order goals. This metaMOP would contain separate MOPs for more specific instances of this goal, such as picking up clothes from the laundromat, or picking up the car from the mechanic. MOPs can themselves be hierarchical, so a general laundromat MOP can be associated with any number of MOPs for specific laundromats. Each MOP contains scenes, or groups of actions, that occur within a MOP. The MOP for picking up clothes at the laundromat might include an entrance scene, a scene for getting the clothes, a scene for paying, and so on. Each scene has associated with it any number of scripts, where a script contains the actual actions that have taken place. One script for the entrance scene, for example, may include opening the door, walking in, and greeting the shopkeeper. To demonstrate the use of schemata in understanding stories and answering questions, Schank (1982) described a number of computer experiments, including CYRUS. CYRUS contained databases of information about two former Secretaries of State, integrated new information with these databases, and provided answers to questions such as "Have you been to Europe recently?" and "Why did you go there?" Miikkulainen (1993) developed DISCERN, a computer program that built schemata from input text, represented the schemata subsymbolically (in terms of features and probabilities, rather than words), and answered questions based on the input texts. 2.2 Schemata for conversation Conversations can be described in terms of the intention and form of each utterance, and the overall structure in which the utterances of the participants occur. The question-answering systems of Schank and Miikkulainen demonstrate that the literal, or locutionary, exchange of information that is one component of conversation can be described in terms of schemata. Kellermann et al. (Kellermann, Broetzmann, Lim, and Kitao, 1989) described conversations between undergraduate students, meeting for the first time, as a MOP and identified 24 scenes. These conversations appeared to have three phases: initiation, maintenance, and termination. Scenes in the initiation phase, for example, included exchanging greetings, introducing themselves, and discussing their current surroundings. Scenes tended to be weakly ordered within each phase, but strongly ordered between phases, so that a person rarely entered a scene in an earlier phase from a scene in a later phase. A number of scenes involved what the investigators called subroutines, or the common sequence of generalized acts: get facts, discuss facts, evaluate facts, and so on. JUDIS (Turner and Cullingford, 1989) is a natural language interface for Julia (Cullingford and Kolodner, 1986), an interactive system that played the part of a caterer's assistant and helped the user plan a meal. JUDIS operated on goals, with each goal represented by a MOP containing the characters (caterer and customer), scenes (either mandatory or optional), and the sequence of events. Higher-level MOPs handled higher-level goals, such as the goal of getting information, while lower-level MOPs handled lower-level goals, such as answering yes-no questions. JUDIS recognized that the person with which it was interacting had goals of their own, and tried to model that person's goals on the basis of their utterances. It also recognized that several MOPs could contain the same scene, and several scenes can contain the same utterance. Only one MOP executed at a time, but other MOPs that were consistent with the current state in the conversation were activated as well. 3 Augmentative Communication Systems An augmentative communication system must address the abilities and needs of the individual, and the context in which the system will be used. 3.1 Word-based, sentence-based, and letter-based systems On a word-based interface, the user selects individual words and word-endings. Such systems offer the advantage of a great deal of flexibility. The user can produce any sentence for which the vocabulary is available, and is in complete control of the sentence content, length, and form. However, word-based sentence production relies heavily on manual dexterity or access rate, and on the individual's linguistic and cognitive abilities. An individual who can select only one item per minute will either produce very short sentences or will produce long gaps in a conversation while selecting the words. Such a system may not be suited for an individual who has difficulty generating well-formed or appropriate sentences (Elder and Goossens', 1994). Sentence-based systems allow an individual to utter an entire sentence by selecting a single key sequence, resulting in much faster sentence production. The sentence that is produced can be prepared to be long or short, and linguistically well-formed, thus overcoming some of the difficulties of word-based systems. However, strict sentence-based systems have shortcomings of their own. The user is limited to the often small number of sentences prestored in the system. These sentences are syntactically correct, but cannot be modified to be appropriate in a given semantic or pragmatic context. As well, the user can incur additional cognitive load if the interface design makes the task of locating and retrieving sentences non-trivial (Baker, 1982). A third class of systems are letter-based, requiring the user to enter words letter by letter. Letter-based systems can have many of the strengths and weaknesses of word-based systems. Letter-based input is flexible, potentially removing even the constraints imposed by system vocabulary limits. However, the demands of entering each letter can be even greater than the demands of entering whole words. This is one reason why some letter-based systems attempt to predict the word as it is being entered, reducing some demands on the user but possibly introducing others (Koester & Levine, 1994). Of course, systems need not be exclusively letter-based, word-based, or sentence-based. On a system from the Prentke Romich Corporation called the Liberator`, for example, a user can map an icon key sequence to a word, a phrase, or an entire sentence. Templates can be set up containing a phrase or sentence with gaps to be filled by the user at the time of utterance. 3.2 Conversational considerations CHAT (Alm, Newell, and Arnott, 1987) was a prototype communication system that recognized general conversational structure. A model conversation would begin with greetings, then move on to smalltalk and the body of the conversation, then finally to wrap-up remarks and farewells. Often, the exact words we use for a greeting or a farewell are not as important as the fact that we say something. A person using CHAT could select the mood and the name of the person with whom they were about to speak, and have CHAT automatically generate an appropriate utterance for that stage of the conversation. An utterance was chosen randomly from a list of alternatives for each stage. Similarly, while pausing in our speech to think or while listening to the other participant in a conversation, it is customary to occasionally fill these gaps with some word or phrase. CHAT could select and utter such fillers on demand. To assist the augmented communicator during the less predictable main body of the conversation, a database management system and interface called TOPIC (Alm et al., 1989) was developed. The user's utterances were recorded by the system, and identified by their speech acts, subject keywords, and frequencies of use. If the user selected a topic from the database, the system suggested possible utterances by an algorithm that considered the current semantics in the conversation, the subject keywords associated with entries in the database, and the frequency with which entries were accessed. The possibility of allowing the user to follow scripts was also considered. These systems offered the user an interface into a database of possible utterances, drawn either from fixed lists (CHAT) specific to several different positions in the conversation, or reusing sentences from previous conversations (TOPIC) organized by semantic links. Once the conversation had entered the main body phase, however, there was no representation of the temporal organization of utterances. As well, topics were linked in a relatively arbitrary net, rather than organized hierarchically. The body of a conversation is not always as difficult to predict as these systems may lead one to believe. There are many contexts in which conversations can proceed more or less according to expectations. For example, when ordering lunch at a favorite restaurant or when dropping off or picking up a laundry order, conversational exchanges are often quite standardized. 4 A Schema-based AAC Interface The interface proposed in this section represents conversations according to schemata, in order to take advantage of their predictable structure. The conversation is broken down into smaller and smaller substructures that are constructed to correspond well with a person's own intuitive representation. When a conversation proceeds in the expected manner, the interface follows with it and displays appropriate and complete sentences from which the user may select an utterance. Sentence and phrase templates can be set up, to provide some flexibility with very little effort demanded of the user. The interface provides the speed of access inherent in many sentence-based systems, but with greater flexibility. Combining these advantages with the user's regular AAC system, the resulting system seems well adapted to facilitating conversation. 4.1 The schema framework The initial schemata in the interface are developed a priori. In the evaluation studies, described in a later section, these schemata will be developed in consultation with users. A separate MOP is planned out for each conversational context that is likely to reoccur frequently, and for which there are reasonably well-developed expectations. These MOPs are then grouped by similar goals, and a higher-level MOP is developed by generalizing among the members of each group. For example, MOPs for going to McDonald's, Burger King, and the lunchroom cafeteria might all be grouped under the "eating at a self-serve restaurant" MOP. Going to other restaurants might fall under the "eating at a waiter-served restaurant" MOP, and together these two MOPs would be contained in the more general "eating at a restaurant" MOP. A metaMOP is defined here as any category of activities that spans several MOPs. For example, the "going out to eat" metaMOP might contain MOPs for choosing where to go eat, travelling to a restaurant, eating at a restaurant, and then returning home. A receptionist's metaMOP for "dealing with a new employee" might contain MOPs for the introductory conversation, a description of the company, introductions to several other employees, and showing the new employee to their office. Each MOP contains an ordered sequence of scenes, each scene contains a list of (currently no more than one) scripts, and each script contains a partially ordered set of sentences. A sentence may contain slots as well as words, and each slot would be associated with a group of fillers. When the user selects a sentence with slots, the appropriate list of fillers is displayed and any one of these can be used to fill the slot. The user can also choose to enter a slot filler, or the entire sentence for that matter, using their regular existing AAC system. Sentence templates containing slots and fillers provide a convenient means of producing a wide range of sentences of similar form. For example, the sentences "I'd like a shake and an order of fries" and "I'd like a Big Mac and a root beer" could be produced by selecting a template "I'd like a ____" and the fillers `shake' and `order of fries' or `Big Mac' and `root beer'. (The conjunction `and' is inserted automatically when multiple fillers are selected for a single slot.) Only three selections are required to produce either sentence, rather than nine selections to enter the words separately (or six, if `order of fries' is counts as one word). The templates also capture the intuitive similarities in the form and function of the sentences. The hierarchical structure of this representation quite naturally leads to inheritance of properties by lower-level schemata. If "eating at McDonald's" involves the same sequence of scenes as contained in the more general "eating at a fast food restaurant", for example, it would access these scenes by inheritance from the more general MOP rather than containing a redundant and identical copy of them. The McDonald's MOP could still differ from the more general MOP by having its own set of fillers for slots (such as the list of food items that can be ordered) and its own set of scripts and sentences. Inheritance provides a mechanism for providing schemata for contexts that are novel but similar in some respects to existing schemata. When the user enters a new fast food restaurant for the first time, if a MOP for a restaurant serving similar food exists, then it could be selected. If such a MOP does not already exist, then the MOP for the general fast food restaurant could be selected. Entering a restaurant that the user has never before been in, the interface is able to provide an appropriately organized set of sentences and sentence templates to use. 4.2 Sequential organization The sentences in a MOP are presented to the user according to the order of the scenes in the MOP. As a scene begins, the first sentence in the scene's script is highlighted. When this scene is completed, the sentences contained in the scene scroll out of sight, and the sentences for the next scene are displayed. In this way, the interface keeps pace with the conversation, and minimizes the need to search for the next desired sentence. A conversation may advance to the next (or any other) scene at any time by scrolling and selecting a sentence in that scene. The simplest method for participating in a conversation requires the user to access only two keys. The user confirms with one key that the highlighted sentence should be used, and the interface utters this sentence and highlights the next one in the current scene. This cycle repeats until the user advances to the next scene with the second key, at which point the first sentence of that scene is highlighted. (At the very least, a third key is needed if sentences in a scene are to be uttered out of order.) For the purpose of this two-key operation, scenes are assumed to be strongly ordered within a conversation MOP. If scene B follows scene A in the MOP, selecting a sentence in scene B (or any later scene) indicates that scene A has been completed. In many cases this appears to be a reasonable generalization, though perhaps not always. One generally greets a person at the beginning of a conversation, for example, but it may also happen that the greeting is made after some initial exchange. To return to a previous scene, the user can either select that scene directly, or cycle through the remaining scenes by repeatedly selecting the "next scene" key until the first scene comes into focus once more. 5 Future Work 5.1 Preliminary investigations The goal of this interface is to facilitate an augmented communicator's participation in conversations. A preliminary investigation will ask several people who have AAC systems to comment on the effectiveness of such an interface after using it for a few days. Each AAC user will participate in developing the schema hierarchy for their own interface. In an iterative fashion, the author will guide each user in developing a small number of simple preliminary schemata. The users will then employ these schemata in conversations with their regular communication devices. The author and the user together will review interactions recorded by the communication device, and enhance the existing schemata or develop new ones. This cycle will be repeated until the schemata are developed to the user's satisfaction. A schema development program is planned for the future, to allow the users to continue refining and adding schemata to the interface on their own. This interface is intended to be applicable to all AAC users and all conversational contexts. For this reason, it is hoped that a diverse group of people will be able to participate in this preliminary investigation. Of particular interest will be people who use their AAC systems in the context of their employment. 5.2 AAC users with developmental delays The interface and methodology developed in this paper have many features in common with strategies for training developmentally delayed adolescents and adults to use their augmentative communication systems (Elder and Goossens', 1994). Elder and Goossens' discussed the conversational contexts of domestic living, vocational training opportunities, leisure/recreation, and community living. They developed an activity-based communication training curriculum, in which students are taught context-appropriate communication in the process of performing the relevant activity with the instructor or with another student. A script is generated for each activity, and is represented by an overlay to be placed over the individual's AAC system. The authors emphasized the importance of a concentrated message set, meaning that all of the words, phrases, or sentences required to complete an activity should appear on a single overlay. Activities that have a logical sequence to them are advantageous because one event in the activity can act as a cue to recall the next event, and it is impossible to successfully complete the activity in any but the correct order. Supplemental symbols could be added off to the side of an overlay for, in one example, specific food types in a "making dinner" script. These similarities suggest that the schema-based interface developed in this paper may be an effective aid in communication training. The groups of activities described by Elder and Goossens' are all good candidates for any conversational schema hierarchy. The script-based overlays are analogous to the scripts contained within a conversation MOP, and the supplemental symbols are similar to the filler sets for sentence slots. 5.3 Possible extensions to the interface AAC systems that attempt to predict words on the basis of the initial letter(s) selected by the user in a domain-nonspecific context may have a very large vocabulary to consider for each word in a sentence. A similar problem of scale can face systems that attempt to complete partial or telegraphic sentences. A schema-based interface makes use of the current MOP and the current position within the MOP to define a specific conversational domain. This domain could serve to constrain or prioritize the vocabulary and semantics that the system would need to consider, and reduce the time to process the sentence. The network of MOPs and their substructures must currently be constructed by the investigator, in consultation with the user. Determining which contexts should be represented is obviously a highly subjective issue that reflects the individuality of one's experiences. It would be preferable to develop a means by which users could construct their own hierarchy of schemata. Better still would be a dynamic system that could store sentences as they were produced during a conversation, and for which the schemata could be created and updated interactively by the user. 6 Summary An interface for augmentative communication systems is proposed that makes the expected content of a conversation available to the user. This can facilitate interaction in predictable situations by reducing the need to produce common utterances from scratch. A methodology is described for organizing conversations in a variety of contexts according to hierarchical schema structures. At the highest level, complex goals are represented by metaMOPs, and more specific goals by MOPs. Each MOP contains a list of scenes in the order in which they are expected to occur in the conversation. Each scene contains sentences that the AAC user can choose. Sentences can be complete or in the form of templates containing slots to be filled in as needed. This interface makes it possible to participate in a conversation by using only two keys, if the conversation fits a MOP closely, but is in general intended to be used together with an individual's regular AAC system. 7 References Alm, N., Newell, A. F., and Arnott, J. L. (1987) A Communication Aid Which Models Conversational Patterns. In Proceedings of the RESNA 10th Annual Conference. (pp. 127-129). San Jose, CA. Alm, N., Newell, A. F., and Arnott, J. L. (1989) Database Design For Storing and Accessing Personal Conversational Material. In Proceedings of the RESNA 12th Annual Conference. (pp. 147-148). New Orleans, LA. Baker, B. (1982) Minspeak. Byte. (pp. 186-202). Cullingford, R. E., and Kolodner, J. L. (1986) Interactive advice giving. In Proceedings of the 1986 IEEE International Conference on Systems, Man and Cybernetics. (pp. 709-714). Atlanta, GA. Demasco, P. W., and McCoy, K. F. (1992). Generating text from compressed input: An intelligent interface for people with severe motor impairments. Communications of the ACM. 35(5). (pp. 68-78). Elder, P. S., and Goossens', C. (1994) Engineering Training Environments for Interactive Augmentative Communication: Strategies for adolescents and adults who are moderately/severely developmentally delayed. Southeast Augmentative Communication Conference Publications Clinician Series: Birmingham, AL. Kellermann, K., Broetzmann, S., Lim, T-S., and Kitao, K. (1989) The Conversation Mop: Scenes in the stream of discourse. Discourse Processes. 12, 27-61. Koester, H. H., and Levine, S. P. (1994). Quantitative indicators of cognitive load during use of a word prediction system. In Proceedings of the RESNA `94 Annual Conference. (pp. 118-120). Nashville, TN. Miikkulainen, R. (1993) Subsymbolic Natural Language Processing: An integrated model of scripts, lexicon, and memory. MIT Press: Cambridge, MA. Schank, R. C., and Abelson, R. P. (1977) Scripts, plans, goals and understanding: An inquiry into human knowledge structures. Erlbaum: Hillsdale, NJ. Schank, R. C. (1982) Dynamic Memory: A theory of reminding and learning in computers and people. Cambridge University Press: NY. Schank, R. C. (1990) Tell Me A Story: A new look at real and artificial memory. Charles Scribner's Sons: NY. Turner, E. H., and Cullingford, R. E. (1989) Using Conversation MOPs in Natural Language Interfaces. Discourse Processes. 12, 63-90. 8 Acknowledgments This work has been supported by a Rehabilitation Engineering Research Center Grant from the National Institute on Disability and Rehabilitation Research (#H133E30010). Additional support has been provided by the Nemours Foundation. Peter B. Vanderheyden (vanderhe@asel.udel.edu) Applied Science and Engineering Laboratories Department of Computer and Information Sciences University of Delaware / A. I. duPont Institute Wilmington, DE 19899 USA