A SOFTWARE ENGINEERING APPROACH TO DEVELOPING AN OBJECT-ORIENTED LEXICAL ACCESS DATABASE AND SEMANTIC REASONING MODULE by Wendy Mair Zickus A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Computer and Information Sciences Spring 1995 Copyright 1995 by Wendy Mair Zickus All Rights Reserved Acknowledgments This work has been supported by a Rehabilitation Engineering Research Center Grant from the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education (#H133E30010). Additional support has been provided by the Nemours Research Programs. I would like to thank my mother, Mary G. Mair, for being a surrogate mother to my daughter, Rebecca, while I was working on my graduate studies. I would also like to thank both her and my father, Robert D. Mair, for being supportive of my work over the past few years. To my advisors, Kathleen F. McCoy and Patrick W. Demasco, I am indebted for their guidance, patience, and faith in me. When I started working with them in the Natural Language Processing Lab at the Center for Applied Science and Engineering, located at the A. I. duPont Institute, I was a new graduate student in the Computer Science Department interested in Object-Oriented Programming. Now I am an Object-Oriented Natural Language Processing programmer with some expertise in Computational Lexical Semantics. I want to thank my daughter, Rebecca, for understanding all of the times when I could not be there due to projects and academic obligations, and especially for allowing me to help her celebrate her birthday in June instead of May due to finals. And last, but definitely not least, I'd like to thank my husband, Timothy E. Zickus for his love, support and seemingly endless editing of this thesis. His pride and belief in my abilities and capabilities were of immeasurable importance to the completion of my research work at The University of Delaware. TABLE OF CONTENTS LIST OF FIGURES viii ABSTRACT ix Chapter 1 INTRODUCTION 1 2 RELATED RESEARCH 7 Motivation 7 Application Domain: Augmentative and Alternative Communication 8 AAC Language Technology Overview 9 Some Natural Language Processing (NLP) used in AAC Applications 10 Need for Word Knowledge in NLP based AAC Systems 11 Compansion 13 Computational Lexical Semantic Overview 17 Dictionaries and Lexicons 20 Paper-Based 20 On-Line 21 WordNet 22 Efforts in Merging Lexical Resources 30 3 LAD DESIGN 34 Motivation 34 Functional Interface 36 Argument Structure 36 Functions for Semantic Knowledge 37 Functions for Syntactic Knowledge 44 Functions for Other Specialized Knowledge 45 LAD Resources 46 WordNet 47 Case Frames 49 Morphology 49 Phonetic Information 50 Syllabification Information 50 Frequency Information 51 LAD Mapping 51 Semantic Categories 52 Queries Requiring Multiple Databases 53 Complicated Queries 54 Application Areas 55 Object-Oriented Design 56 Extensibility 59 Implementation Status 60 4 SEMANTIC REASONING MODULE 61 Motivation 61 Case Frames and Semantic Reasoning 62 Semantic Categories 63 Preferences 63 SRM and LAD (Knowledge Representations) 67 Secondary Verb Frames vs. WordNet Verb Frames 67 Semantic Output 68 SRM Processing 69 5 CONCLUSIONS 73 Accomplishments 73 LAD Implementation Status 74 WordNet Usage 75 SRM and LAD Results 75 SRM Performance vs. Compansion Performance 77 Discussion 80 Future Work 81 BIBLIOGRAPHY 83 Appendix A LAD SOURCE DESCRIPTION 90 Appendix B SRM SOURCE DESCRIPTION 98 Appendix C SEMANTIC CATEGORIES 108 Appendix D WORDNET VERB FRAMES 111 LIST OF FIGURES Figure 2.1 Specified Case Frame for the Verb BREAK 15 Figure 2.2 List of 25 Unique Beginners for WordNet Nouns 23 Figure 2.3 Hyponymic Relations of Seven WordNet Unique Beginners 24 Figure 2.4 WordNet Verb Sentence Frames 28 Figure 3.1 Language Access Database Diagram 48 Figure 3.2 LAD Object-Oriented Class Hierarchy 58 Figure 4.1 Specified Case Frame for the Verb LIKE 66 Figure 4.2 List of Filled Case Frames 69 Figure 4.3 Algorithm for the Generation of Filled Case Frames (Part 1) 70 Figure 4.4 Algorithm for the Generation of Filled Case Frames (Part 2) 71 ABSTRACT The concept of enabling computers to understand and generate `natural' language has been enticing to mankind ever since we first saw and heard HAL in the movie "2001: A Space Odyssey." However, the state of the art in Natural Language Processing (NLP) is a long way from creating a reality of this concept. One of the major bottlenecks in the implementation of NLP is the lexicon: the place where the system's information about words is stored. There are difficulties in deciding what information should be stored in a lexicon, and even greater difficulties in acquiring this information. To date there have been several efforts which provide on-line lexical database resources with varying amounts of lexical information for NLP systems. One problem that remains, however, is that a particular NLP application may need information from several such sources, and there is no standard way to access and combine information from several lexical resources. This thesis describes the Language Access Database (LAD) system, a system designed to incorporate multiple lexical databases and tools under one consistent functional interface in order to facilitate systems requiring syntactic, semantic and lexical knowledge. The technical aspects of the design were mainly influenced by research in the areas of Computational Lexical Semantics, Augmentative and Alternative Communication, and Natural Language Processing. The implementational aspects of the design were influenced by the paradigm of Object-Oriented programming to create a system that is easily extendable and upgradable. This thesis presents LAD and the Semantic Reasoning Module (SRM), a semantic parsing system designed to test the LAD system's functionality. Chapter 1 INTRODUCTION In order for communication to exist, there must be some agreed upon representation of symbols, whether iconic, gestural, vocal, or written, and an agreed upon set of rules upon which to build combinations of symbols into sentences of meaning. Human beings have the capacity to learn how to communicate, to pass along the semantic and syntactic rules to their following generations, and eventually to expand upon and refine these symbols and rules. Through these representations, mankind has the ability to communicate, to generate new ideas and thoughts, to learn and to build upon the knowledge built up in the complex structure called language. According to the American Heritage Dictionary (1985), language is: The use by human beings of voice sounds, and often of written symbols that represent these sounds, in organized combinations and patterns to express and communicate thoughts and feelings. A system of words formed from such combinations and patterns, used by the people of a particular country or by a group of people with a shared history or set of traditions. Since the advent of computers, humans have been searching for ways to improve them. We strive to make them faster, increase their memory capacities, improve the user-interface peripherals and the methods we use to communicate with them. From a business perspective, there is a feeling of frustration at times, voiced about the computer not living up to its heralded increased productivity and decreased cost expectations. This frustration is partially caused by the high learning curve for employees to master the computer and its software, and partially due to the unnatural communication methods we use to describe our problem to the computer, solve it, and interpret the results. This is one impetus pushing computer scientists to give our programs the capability to understand or generate language in order to provide a more `natural' way for computer users to be able to communicate with their systems and vice versa. Natural Language Processing (NLP) is the area within computer science that deals with trying to create ways to process human language for various reasons and applications. In order to do this, there must be some means of representing language structure, and a mechanism for using this representation for generating and understanding language within the computer. Efforts to find solutions to these tasks have focused on three major areas of research: syntactic, semantic and pragmatic. The syntax of a language refers to the grammatical rules defining allowable ways to link words together into sentences. Semantics refers to the meanings of individual words and to the meanings formed by combining the words in a sentence. {FOOTNOTE: One NLP solution used to resolve word sense ambiguities is to look at the individual words in a sentence in relation to the other words of the sentence (i.e., the words in a sentence constrain the possible meanings of each other) (Small & Rieger, 1982).] Pragmatics refers to the overall context in which language is used (i.e., the situational context and past conversational experiences between conversing partners), creating expectations about future language interactions and context. This thesis will focus on the issues raised by the semantics of language. One of the difficulties faced by the area of semantics is that many words have multiple meanings (e.g., shoe can refer to protective covering worn on the foot, a restraining device used to help cars brake, or a container for cards in a casino). Once this is determined, a semantic system must reason about how individual word meanings can be used in resolving the meaning of a sentence. This has caused a focus on the kinds of information needed in order to do semantic parsing and reasoning. Do we need hierarchically linked relationships between words as well as syntactic and definitional information? How should we address the word sense disambiguation issue? [FOOTNOTE: This is one of the major questions faced today by researchers in Computational Semantics. It was the focus of the Post-COLING94 Workshop on Directions of Lexical Research, and is one of the main foci of the upcoming 1st Annual workshop for the IFIP Working Group for Natural Language Processing and Knowledge Representation to be held at the University of Pennsylvania in April 1995.] Some of this research has resulted in the specialized field of Computational Lexical Semantics, while other research has led to an increased understanding and improved methods and techniques necessary to handle the complex task of understanding and generating natural language. [FOOTNOTE: Though the state of the art in NLP still has restricted domains, there is nowadays broader application coverage for NLP systems (Allen, 1994).] Augmentative and Alternative Communication (AAC) research focuses on developing technologies to aid in spoken and written communication processes for people with motor and/or cognitive impairments. The abilities of people using AAC devices cover a broad spectrum of capabilities, from the mild to the more severely impaired individuals and from those with a single disability to those with a combination of cognitive, speech or language impairments. AAC devices are designed to enhance the language capabilities of this group and thus come in a variety of forms with a variety of technological options. Some AAC devices are non-electronic boards or books that may contain letters, words, concepts, or phrases written in traditional orthography or pictorial representations. Others are electronic, containing similar representations and usually accompanied by speech synthesis that provide a user with the capability to select and "speak" an utterance. Some electronic AAC devices include predictive capabilities to allow the user to create messages with a reduced physical load. This thesis will focus on the AAC devices that incorporate Artificial Intelligence and Natural Language Processing techniques and principles. Such systems require knowledge attached to the words in their system dictionaries. The dictionary requirements for NLP-based AAC systems reflect the needs of the three major NLP research areas: syntactic, semantic and pragmatic. For the syntactic area, there is a need for grammars (i.e., a set of rules that refer to word categories and word patterns), morphological information (e.g., +S for plural, +ING), and word category information (e.g., NOUN, VERB, PREPOSITION). Syntactic knowledge is needed for systems that use word prediction, grammar checkers, and/or language tutoring (McCoy & Demasco, 1995). Semantic knowledge is useful for systems that process and/or generate language. Examples of semantic knowledge needed would include selectional restrictions, case frames, conceptual meaning for words (e.g., the semantic categories for Mary are human and animate, window are fragile and physical object, hammer are tool and physical object), relational knowledge (e.g., a car is-a vehicle, a canary is-a bird), attributive knowledge (e.g., a canary is small and yellow with wings, feathers and a beak), functional knowledge (e.g., knives are used for cutting, trees are used for shade, energy and building) and knowledge of parts (e.g., a car has a windshield, a hood, brakes, fenders, tires, steering wheel). Semantic knowledge has proven useful to AAC systems which try to understand the meaning the user of the system is attempting to convey. And finally, for the area of pragmatics, knowledge concerning situational context and communication history, conversational information (i.e., turn-taking rules, organizational patterns for story-telling) is needed. This pragmatic information is important for systems that model typical conversational patterns (Alm et al., 1993). There is a variety of applications that require large amounts of lexical knowledge. While there are several on-line lexical resources available today, no single one contains all the information which may be necessary for AAC/NLP applications. [FOOTNOTE: Some of the efforts to develop such on-line lexical resources will be discussed in Chapter 2.] In the Language Access Database (LAD), we attempt to provide such a resource by creating a virtual dictionary which combines information from several available lexical and linguistic sources in a transparent, seamless manner. This project will provide a uniform dictionary interface to a variety of lexical resources (e.g., WordNet, collocative information, verb case frame data, word frequency data, phonetic information, syntactic information, category information). LAD provides the user (i.e., application program) with a functional interface through which they may pose a variety of queries. The LAD program will search the appropriate lexical resource in order to handle a query. LAD shields the user program from the task of understanding which (set of) lexical resource(s) actually contains the necessary information. Many different kinds of lexical information can be obtained through this consistent user interface. LAD has the capability to extract information needed by systems to enhance their capabilities to understand language, to generate semantically and syntactically correct language, to process language and/or to produce speech-synthesized language, depending on the needs of the individual systems. The focus of this thesis is the LAD system and its design. Before describing LAD in detail, Chapter 2 provides the technical background. It first motivates the need for a system like LAD by describing the application domain of Augmentative and Alternative Communication concentrating on NLP-based AAC applications. It also covers the area of Computational Lexical Semantics and work in the area of dictionaries (both paper-based and on-line, and efforts at merging lexical resources). The chapter highlights current research, knowledge base needs of systems, and available knowledge that current lexical sources provide. Chapter 3 details the LAD design: its functional interface, lexical resources, mapping capabilities, application areas, object-oriented design, extensibility, and implementation status. This chapter provides a complete description of LAD including the different kinds of lexical information that it can provide, a listing of applications that can benefit from LAD, and the flexibility the LAD design provides with respect to future enhancements and/or additional lexical resources. Chapter 4 details an application written to test and evaluate LAD called the Semantic Reasoning Module. It describes the kinds of information the module needs, its semantic reasoning, and how LAD provides the information needed for it to function properly. Finally, Chapter 5 concludes this thesis with an analysis of the testing and evaluation of LAD, a discussion regarding what LAD does and does not provide, and a look into the future directions of the LAD research project. Chapter 2 RELATED RESEARCH Motivation A great deal of knowledge is required in order to understand language. There are many areas in Computer Science and other professions researching the different `kinds' of knowledge needed for this task, such as Computational Lexical Semantics, Linguistics, Psycholinguistics, Augmentative and Alternative Communication (AAC), and Natural Language Processing (NLP), just to name a few. Some of these research areas concentrate on the fundamental structure of language, its syntactic and semantic properties, while other areas concentrate on what is needed in order for a computerized system to represent the meaning of textual input to facilitate computer understanding and/or generation of language. They all agree that the task set before them is difficult and may end up being impossible to accomplish in its most complete form. The progress made in this area has already led to vast improvements in computerized understanding and processing within restricted domains and applications (Chen & Huang, 1994). The need for and benefits of systems that will allow for unconstrained input justifies the continued efforts of researchers in this field (McHale & Crowter, 1994). The following section provides an overview of the application area of AAC which was the catalyst of this work. We focus on NLP-based AAC systems and their lexical semantic requirements. The following sections provide background into the area of Computational Lexical Semantics, on which this thesis has relied for technical information about words (i.e., information that must be associated with words for various kinds of processing). The final section covers research on dictionaries, both paper-based and on-line, for further guidance on the kind of information that should be available on words. Application Domain: Augmentative and Alternative Communication The field of Augmentative Alternative Communication (AAC) is devoted to developing alternative technology for people with disabilities [FOOTNOTE: These disabilities are primarily caused by afflictions such as cerebral palsy, ALS, etc.] that prevent normal means of communication. The technology may provide an alternative means of communication and/or may be used in the assessment and training for these individuals (Demasco et al., 1994). The research has centered around the areas of input (e.g., eye gaze and gesture), language (e.g., representation, organization and processing of language), and output (e.g., speech synthesis) (Demasco & Mineo, 1995). This thesis will focus on the language area in the following subsections, giving an overview into the current research in AAC and NLP-based AAC technology. We will look at the knowledge required by these systems and discuss an example of NLP-based AAC technology called Compansion (Demasco & McCoy, 1992). AAC Language Technology Overview The majority of the research in the language area of AAC can be grouped in the three categories of representation, organization and processing of language. The representation group research concerns vocabulary design (e.g., icons, word board) (Mineo et. al., 1994; Chang et al., 1993; Magnusson & Briem, 1994; Albacete et al., 1994). The organization of language group includes organizational models (Demasco et al., 1994) and access issues (e.g., scanning, direct access) (Koester & Levine, 1994). The processing group includes research on rate enhancement techniques (e.g., letter prediction, word prediction) (Venkatagiri, 1993; Hamilton, 1994; Demasco et al., 1994; McCoy et al., 1994B; Vanderheyden et al., 1994) and language correction (Suri & McCoy, 1993; Morris et al., 1992). In early research, little attention was paid to the syntactic, semantic or language issues involved (Kraat, 1990) in developing AAC technologies. However, since the late nineteen-eighties, a subarea in AAC research that concentrates on applying Natural Language Processing techniques to enhance AAC communication devices has emerged. The University of Delaware and the University of Dundee are forerunners in this area of research (McCoy & Demasco, 1995). One of the major problems involved in developing intelligent AAC devices is that suddenly the system designer/user need not only be concerned about the usability issues, placement and access to words, but also needs to provide the information about words which is necessary to do intelligent reasoning. Some Natural Language Processing (NLP) used in AAC Applications The use of Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques in the development of AAC systems and devices continues to grow both in research laboratories and, more recently, in commercial products. The use of AI/NLP methods in any application area requires significant language knowledge about words and their various morphological forms, syntactic categories (e.g., NOUN, VERB, CONJUNCTION), thematic roles (e.g., AGENT, THEME, LOCATIVE) (Fillmore, 1977) and control information (McHale & Crowter, 1994). Within AAC, the need to support relatively unconstrained message production (in contrast to structured communication as in a database query language) requires that this knowledge be broad as well as detailed. Some of the underlying NLP work looks at many of the variables involved for a system to understand ill-formed language (Weischedel & Sondheimer, 1983; Jensen et al., 1983). This research includes developing methods to resolve issues such as, unknown words, word sense disambiguation, resolving referents, etc. (Granger, 1983; Fass & Wilks, 1983). In order to handle well-formed input, a system can generally resolve most of these variables, as long as the domain of the application is specific. In the case of ill-formed input, however, there are additional obstacles to consider (e.g., filling in missing words and/or punctuation, ad hoc abbreviations, lack of tense agreement, causality violation, goal violation, object or event referenced out of context, subject-verb disagreement, homonyms confused). In order to deal with irregularities of input, NLP researchers have been required to develop new methodologies (Small & Rieger, 1982). Need for Word Knowledge in NLP based AAC Systems Many word-based AAC systems contain words and word phrases written in traditional orthography. Other systems are based on graphic or icon-based conceptual representations. The communication devices designed for children with disabilities tend to be icon-based; one of their functions is to enable children to understand that language can be used to communicate desires and needs, such as getting help, obtaining objects, expressing likes and dislikes (von Tetzchner, 1988). However, as the child progresses, she/he requires new methods to communicate at her/his increased vocabulary levels. One method to help accommodate this is through spelling, another would be through learning more creative and complicated sequences of icons. There are two predominate icon-based languages used in AAC today. One is the Blisssymbolic [FOOTNOTE: Blissymbolics is a communication system designed by Charles Bliss based on symbols. It was originally thought to be an international symbol system to minimize misunderstandings between people of different language groups (e.g., Chinese, English, and Japanese speakers). Since 1971, Blisssymbolic has been used on communication devices by children with speech disabilities (Magnusson & Briem, 1994).] language (Archer, 1977) and the other is the iconic language of the Minspeak system which is based upon the principle of semantic compaction [FOOTNOE: Semantic compaction is the process involving mapping concepts on to multi-meaning icon sequences and using these sequences to retrieve messages stored in a computerized system. It was developed by Bruce Baker. (Baker, 1982). In particular, Minspeak provides an excellent example of a system developed with linguistic and syntactic principles in mind. It will be the focus of the following discussion. One major focus in the design of the Minspeak language is finding conceptually associated iconic representations and then combining these icons with syntactically bound icons. For instance, a sequence of the APPLE icon and the VERB icon will produce the word eat, while a sequence of the APPLE icon and the NOUN icon will produce the word food. Minspeak considers the icon images to act as indexes to concepts and these concepts translate to words depending on the conceptual mapping of the iconic sequences (Chang et al., 1992). The researchers involved with developing the iconic languages used by Minspeak have recently been studying the use of AI techniques to introduce semantics to the language design. They have introduced some components from Conceptual Dependency Theory [FOOTNOTE: Conceptual Dependency (CD) is a theory of Natural Language Processing developed by Roger Schank that provides computer programs with the capability of "understanding" natural language sentences by representing sentences in a series of primitives. The goal of CD is to represent language such that it aids drawing inferences from sentences and is independent of the language (e.g., Chinese, English, French) (Rich & Knight, 1991).] to enhance the development of appropriate iconic sequences for use in programming the device (Chang et al., 1993; Albacete et al., 1994). Kraat's concerns about the aphasic community (Kraat, 1990) has pointed out that AAC has not addressed the issue of semantic meanings in language; this concern is being addressed by researchers such as Baker and Magnusson who are beginning to incorporate AI techniques into Minspeak and Blisssymbolic. The BlissGrammar program, for example, will correct sentences made with Blissymbols so that the text-to-speech converter will speak or print out easily understood sentences (Magnusson & Briem, 1994). Other AAC researchers have employed AI techniques to word prediction (VanDyke, 1991), letter-based abbreviation expansion (Stum & Demasco, 1992), word-based systems (McCoy & Demasco, 1995), and sentence retrieval techniques (Waller et al., 1992). In order to incorporate the AI/NLP techniques to these methods, a great deal of knowledge is required. Word and letter prediction systems need to be able to access frequency information. [FOOTNOTE: Most frequency information is derived from either bigram and trigram frequencies or from corpora information. A study of simulated experiments at the Applied Science and Engineering Laboratories showed that using corpora frequency data, predictions after the first letter were correct 44.8% vs. a correct rating of 27.9% using bigram prediction (Hamilton, 1994).] Letter-based abbreviation expansion systems require a set of rules and frequency information. One of the more sophisticated NLP-based AAC systems, Compansion (Demasco & McCoy, 1992), was developed at the University of Delaware and the Center for Applied Science and Engineering. Compansion is an example of a word-based system that requires a great deal of knowledge in order to do its processing. That system and its necessary knowledge is described in the following section. Compansion One example of an intelligent AAC technique is Compansion, an approach that takes telegraphic input from a user and expands it into a syntactically and semantically well-formed sentence. The Compansion technique assumes a communication system based on words, pictures, or icons (i.e., non-spelling) and attempts to enhance the user's message production rate [FOOTNOTE: While the Compansion techniques has been primarily described as a rate enhancement technique, it also has potential applications in helping users learn how to produce grammatical sentences.] by requiring only the selection of content words. One advantage of such a system is that it reduces the need to represent morphological information (e.g., verb inflections, endings). This is potentially very beneficial for systems that use picture-based representations. The Compansion system uses both syntactic and semantic knowledge in its processing. The system's processing consists of three major phases. The first phase is the "word order" parser and relies on a syntactic grammar which captures the regularity of the telegraphic input in this initial parsing phase. The word order parser is responsible for grouping words into sentence-sized chunks and indicating each word's part of speech. It is also responsible for attaching modifiers (e.g., compound possessives, adjectives and adverbs) to the word they are most likely modifying. The second phase consists of the "semantic" parser which reasons about the meaning of the content words of the sentence-sized chunks and develops a semantic representation for each chunk. The final phase is the translator/generator which takes the output of the semantic parser and generates an English sentence using a syntactic grammar of English. Here we focus on the knowledge used by the semantic parser since it requires the most sophisticated knowledge sources. The semantic parser takes a set of words and attempts to fit these items into a well-formed semantic structure, thus determining the intended meaning. In the current implementation, processing is non-incremental; all of the input words are taken together and a semantic representation is created which best accommodates the set of words as a whole. Generally there will be at most one word identified as the main verb in the input words; the parser must determine which semantic role is being played by the other words. Consider the processing of the input John break window hammer. Once break is identified as the verb, the parser must decide which word of the input represents the agent/experiencer (i.e., person or thing doing the action), which represents the theme (i.e., thing being acted upon), and so on (Fillmore, 1977). This information is represented in the semantic parser in the form of a case frame that includes preference information (Fass & Wilks, 1983) to help the system reason about the possible intended use for each word of input. For example, a simplified form of the specified case frame for break is shown in Figure 2.1 below: [Figure Diagram] verb - break agexp [toFillPref 4] [[human 3] [animate 2] [ergative 2]] theme [toFillPref 3] [[physical 3] [fragile 4] [object 1]] instr [toFillPref 2] [[tool_box 3] [tool 3] [solid 1]] goal [toFillPref 1] [human 3] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Figure 2.1 Specified Case Frame for the Verb BREAK The above frame captures two kinds of preferences used by the system. The first kind of preference, the case importance preference, is indicated by the toFillPref rating. This indicates which of the roles are most important to fill for a particular verb. In the frame above, the important roles to fill are the AGEXP (agent/experiencer), THEME, and INSTR (instrument). This is indicated by their respective ratings of four, three and two. A four indicates a very high preference while decreasing numbers indicate less preferred, but possible, options.The second kind of preference rating is the case filler preferences which are motivated by Preference Semantics (Wilks, 1975). They indicate what kinds of objects could fill a role and indicate a rating on the "goodness" of each type of filler. The frame in Figure 2.1 indicates that the AGEXP role is preferred to be filled by a human with a rating of three, but that any animate object or ergative object (e.g., a car) would also be acceptable with ratings of two. The THEME role is preferred to be filled by a fragile object, but a physical or abstract object could also serve as a filler, with ratings of four, three, and one respectively. A third kind of preference used by the system, but not directly illustrated in the Figure 2.1, is the higher-order case preferences. This case preference is intended to account for interactions between cases and their fillers. For instance, if a non-human animate (e.g., dog) is the AGEXP of a material process (e.g., eating), then it is highly unlikely that an INSTR is being used; however if the AGEXP is a human, an INSTR would be very likely. In the semantic parser, this preference is captured by a rule which subtracts from the overall goodness of an interpretation when these conditions are found. An interpretation is given a rating by adding together the numbers obtained by multiplying the case filler and case importance ratings for each word's case (and then subtracting off the number indicated by higher-order case preferences). So for the input John break window hammer, we prefer that John be the AGEXP (with a rating of 4 times 3), window be the THEME (with a rating of 3 times 4) and hammer be the INSTR (with a rating of 2 times 3), giving us an overall preference rating of 30. The idiosyncratic case constraints are not used to add or subtract from a case frame's preference rating, but are intended to capture mandatory and forbidden cases within a frame. For instance, a mandatory feature of the verb hit is the THEME, while GOAL is a forbidden feature for this same verb. These constraints are placed directly on the verb frames by the absence or presence of these cases (McCoy et al., 1994A). The basic idea of the semantic parser is to fit the non-verb words of input into the case frame in the best way possible. In order to do this, the semantic parser must access type-information associated with each word. For instance, it must be able to tell that John is a human, that hammer is not a human but a physical object, and that a window is fragile. With this information the semantic parser can reason about the words of input and generate the sentence John breaks the window with a hammer. A major problem for Compansion as a viable AAC system is the amount of information it requires about each word it may encounter. This information is not readily available in any standard database or on-line dictionary, and it must currently be specified by an NLP researcher. This, of course, greatly hampers the extendability and customization of the system. Computational Lexical Semantic Overview Computational Semantics is a cross-disciplinary research area focusing on enabling computers to reason about a language's semantics. This area is comprised of NLP computer scientists, linguists, and psycholinguists and the focus is on developing methods for determining the "meaning" of natural language sentences. Many Computational Semanticists have devoted their time to determining appropriate meanings for component parts (Jackendoff, 1978; Levin & Pinker, 1992; Pustejovsky, 1991) (e.g., the various meanings for prepositional "location" phrases). Because many believe that the meaning of a sentence is primarily influenced by the main verb, much of the work has been devoted to determining the meanings of particular verbs and how they influence sentence meanings (Levin, 1992; Palmer & Polguere, 1994; Levin & Hovav, 1992). Others have studied how language and word/sentence meanings are acquired (Levin & Pinker, 1992; Jubak, 1992). The methodology in Computational Lexical Semantics is to study the acquisition and use of component pieces of language and identify how these pieces influence the meaning of the whole sentence. Because these component pieces are made up of words, much work has been devoted to the study of appropriate lexical meaning that could explain their meaning. The ultimate goal of lexical semantics is to develop lexicons (containing appropriate meaningful information), methods, and tools that will produce multi-purpose systems capable of handling unrestricted language (McHale & Crowter, 1994). Current research into the role that phrasal syntactic properties play in the understanding of words has resulted in a unifying focus, leading the community to believe that the core elements of semantic representations are becoming clearer (Levin & Pinker, 1992). This has in turn led to increased attention being focused on lexicon representations (Church, 1994; Macleod et al., 1994), the application of computational and statistical techniques to corpora and machine-readable dictionaries (Hogan & Levin, 1994), and to new tools for the study of lexical representation (Berwick et al., 1994; Zickus, 1994). NLP systems require knowledge about the words they must process. This knowledge is difficult to acquire and represent in a machine, and has definite effects on system performance, leading to the term "lexical bottleneck" (Byrd, 1989). Current Computational Lexical Semantics (CLS) research aimed at resolving this bottleneck has led to areas of specialization that may or may not need to be applied in conjunction with each other (i.e., the problem is too big to be attacked from one front, so the work has been split up into various areas in the hope that by combining efforts from one or more of these specialized groups, a solution will be found). There are groups seeking to improve the quality of computerized lexicons by basing them on psycholinguistic principles (Miller et al., 1993; Miller & Fellbaum, 1992). Another group is attempting to further refine these improved lexicons and/or other CLS tools by applying them to each other, such as using Beth Levin's work on diathesis alternations in order to further improve WordNet or vice versa (Berwick et al., 1994; Zickus, 1994; Gu, 1994). Other groups are looking for ways to improve syntactic and/or semantic analysis methods through corpus-based lexical research (Church, 1994; Zernik, 1989). Some combine this corpus-based research with improved lexicons such as WordNet for the purpose of syntactic parsing (Macleod et al., 1994). There are groups focusing on defining and isolating the specific needs of Machine Translation systems (Chen & Huang, 1994), and finally, there are groups researching the lexical needs of integrated systems (Gros et al., 1994; McKevitt & Guo, 1994). While these separate groups are focusing on their specific research, they also maintain a focus on the end result of their work, resolving the lexical bottleneck. There have been discussions at various international and national workshops [FOOTNOTE: The International Post-COLING94 Workshop on the Directions of Lexical Research in Beijing, China was devoted to these issues, as is the upcoming 1st Annual Workshop for the IFIP Working Group for Natural Language Processing and Knowledge Representation in April 1995 at the University of Pennsylvania. Other discussions include the SIGLEX Workshop at Berkeley in June 1991 and a workshop on the Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generative at the AAAI 1995 Spring Symposium Series at Stanford University in March 1995.] and conferences concerning what is needed for lexical processing, what is currently available, and what needs to be done, if it is feasible. There are two notable research efforts that have been incorporated by many Computational Lexical Semantic groups recently: the work by Beth Levin on defining classes of English verbs based on diathesis alternations, and WordNet, an example of an improved lexicon developed by researchers at Princeton University. The work by Beth Levin separates verbs into distinct verb classes based on the syntactic phenomena in which a verb may participate (Levin, 1993). This class differentiation research is on-going and could prove to be an important tool for generating systems that can extract semantic meaning from language and/or generate syntactically and semantically correct language. WordNet is the main lexical resource accessed by LAD and will be described in the next section on dictionaries. Dictionaries and Lexicons Dictionaries have been around for a long time and have developed relatively standardized formats through years of publishing (Vizetelly, 1915). Though the shape, size and organizational structure has changed over the centuries, the main role of the dictionary has not. Dictionaries provide users with a way of locating definition, morphological, relational, antonymical, syllabic and phonetic information about words. This section will look at the format and function of paper-based dictionaries, on-line dictionaries, lexicons and WordNet, and conclude with a discussion of other efforts in merging lexical resources. Paper-Based While there are various paper-based versions of dictionaries available today (e.g., Websters, American Heritage), they all have similar organizational formats. The words are stored in alphabetical order, and they contain a wide range of information about words such as spelling, pronunciation, inflected and derivative forms, etymology, parts of speech, definitions, illustrative uses of alternate senses, synonyms, antonyms, and special usage examples. This information has proven useful to people over the years. There are, however, some difficulties in locating certain types of information in paper-based dictionaries. For instance, if you look up the word poplar, you will find out that a poplar is a kind of tree, but in order to locate coordinate [FOOTNOTE: Coordinate terms, or sister terms, are terms that are derived from the same parent, or superordinate. They can be thought of as kind-of terms of that parent. (e.g., poplar, ash, maple are all coordinate terms and are kind-of trees).] terms for poplar (an example of a lateral kind of search), you need to start at the beginning of the dictionary and look at the definitions of every word to see if they are also types of trees. Thus using paper-based dictionaries, you can search vertically for superordinate terms or subordinate terms without too much difficulty, but to search laterally takes a great deal of effort. On-Line With the advent of computer database systems, it was a logical step to create on-line dictionaries. The data can be stored in ways to link dictionary items not only in a hierarchical manner, but also in a lateral manner, thus reducing the time for lexical searches and increasing the flexibility of the searches. It wasn't long, though, before researchers decided to augment the information maintained within on-line lexicons to provide information needed for systems designed to process language. This led to the on-going debate concerning what approach to take in order to store both conceptual meaning [FOOTNOTE: Conceptual meaning is the cognitive content of words. This captures the expression of phenomena that are deeply embedded in the language.] and collocative meaning. [FOOTNOTE: Collocative meaning is the communication of meaning via associations between words or word classes. It attempts to explain words in terms of their relations with other lexical items and word classes.] One school of thought believes that the best way to encode semantic knowledge is through corpora analysis (Velardi et al, 1991). These researchers don't store words as in a normal dictionary, but use corpora to produce conceptually related concepts which they then use to interpret the meaning of words taken in context. Examples of conceptually related concepts (CRC's) are an activity is related to a location which is a place, a change is related to a final state which is a product, and farming is related to a location which is a field. They build semantic knowledge bases of these CRC's, in place of lexicons, from corpora analysis. The resulting semantic knowledge base can then be used in semantic processing for domain-specific applications. The other school of thought is to design on-line lexicons that incorporate additional knowledge necessary for deriving conceptual and collocative meanings (Miller et al., 1993). There are two on-line lexicons commonly being used by the research community, these are the Longman's Dictionary of Contemporary English (LDOCE) and WordNet. The next section describes WordNet in more detail. WordNet WordNet (Miller et al., 1993) is an on-line dictionary/thesaurus developed by a group of psycholinguists at Princeton University with over 95,000 lexical entries, 51,000 being simple words and 44,000 being collocations (e.g., attorney general). WordNet has the ability to distinguish a number of different senses of words and to produce synonym sets, called synsets, [FOOTNOTE: A synset is a grouping of synonymous words (i.e., words with the same meaning). An example of a synset is [merchandise, wares, product].] and sentence glosses for these differing senses. The primary purpose of a dictionary is to provide definitions for words. WordNet provides definitions for all synsets and thus for every sense of every word. WordNet incorporates pyscholinguistic theories and research attempting to mimic how people organize lexical memory (Miller et al., 1993) in its design. The central organizing feature for the lexical entries is a lexical matrix: a mapping between words and the synsets to which they belong. There are four different word databases associated with WordNet: one for nouns, one for verbs, one for adverbs and one for adjectives; each one having a slightly different hierarchical representation. This hierarchical organizational method is very natural for nouns, as it is based on psycholinguistic findings that people store nouns in their memories in a hierarchical manner from specific to general (Miller, 1993). WordNet has isolated 25 unique beginner synsets (see Figure 2.2), which are used to build up the hierarchical structures for nouns (see Figure 2.3). [act, action, activity] [animal, fauna] [artifact] [attribute, property] [body, corpus] [cognition, knowledge] [communication] [event, happening] [feeling, emotion] [food] [group, collection] [location, place] [motive] [natural object] [natural phenomenon] [person, human being] [plant, flora] [possession] [process] [quantity, amount] [relation] [shape] [state, condition] [substance] [time] Figure 2.2 List of 25 Unique Beginners for WordNet Nouns [Figure Diagram] Figure 2.3 Hyponymic Relations of Seven WordNet Unique Beginners WordNet uses a lexical inheritance system in its hierarchical structuring of nouns. This allows for semantic components to be inherited from their superordinate/parent terms. The base synset for most hierarchies is [thing, entity]. As mentioned previously, WordNet stores its words in synsets. In so doing, a noun is stored in as many synsets as there are different senses to the word. For instance, bank is in 6 separate synsets, (e.g., Sense 1, 2, 3 - [bank], Sense 4 - [depository financial institution, bank], Sense 5 - [bank, supply, reserve], and Sense 6 - [bank, bank building]). Notice how the first three synsets look identical on this level. However, if you look at their superordinate terms, the differing senses of the word become evident. The superordinate term synset for Sense 1 of bank is [slope, incline], for Sense 2 is [ridge] and for Sense 3 is [array, arrangement]. By placing nouns into synsets in this manner, WordNet can be used by systems to enhance word sense disambiguation. One of WordNet's strengths is the fact that it has all of its over 95,000 words broken into their different senses. However, this strength is also a weakness in that there is no agreed upon standardization among the lexical research community as to the number of senses any given word has, or what label should be given to these senses (Ide & Veronis, 1994). As an example, Martha Palmer is a researcher who has spent years defining the different senses of the word break (Palmer, 1990; Palmer & Polguere, 1994). Her work has merged the 30 different paper-based dictionary meanings into four core senses for break. WordNet, on the other hand, has 23 different senses for break, although some of the discrepancies between WordNet's different senses are difficult to distinguish (e.g., Sense 1 - violate, fail to agree with, go against, break, be in violation of; Sense 2 - transgress, violate, go against, breach, break, be in violation of). It is important though, to have some way to disambiguate word meanings in order to facilitate semantic reasoning. For instance, if we are talking about the word bank, are we talking about the financial institution where we make monetary transactions or are we talking about the bank by the river where we go fishing or swimming or picnicking? To this end, the sense information in WordNet is a valuable lexical tool. As mentioned previously, there are some limitations with traditional dictionaries. One such shortcoming is that the information stored with a word is often incomplete. When one looks up a noun, for example platypus, one learns that it is a semiaquatic, egg-laying mammal, but unless one is an expert on mammals, there is no way, other than by looking up mammal, to find out if a platypus has hair. Dictionaries are ordered alphabetically and not grouped semantically, therefore such searches can be cumbersome. This weakness in contemporary dictionaries demonstrates one of the major strengths of WordNet: its semantic and lexical relations. By using the WordNet on-line lexicon, it is easy to discover the attributes of a given noun by traversing the semantic links of its superordinate term (i.e., its parent). WordNet's semantic and lexical links also enable searching for various relationships to other words in the English language. One of these relationships is the is-a hierarchical relationship, which is one of the most pivotal in giving meaning to words that are related to each other. For instance if we know what a vehicle is, then it helps us to understand what a car is by knowing that a car is-a vehicle. This information is useful for applications that need semantic relationships between words, such as systems designed to teach English as a second language. Another important lexically-linked feature needed by many lexical systems is a means by which you can discover what parts or attributes a word possesses. This is often referred to as the has-a relationship. As in the case of the is-a relationship, the WordNet database has semantic links to help derive this information. For instance, using has-a knowledge you can determine that a car has an accelerator, gun, throttle, automobile engine, boot, luggage compartment, trunk, fender, bumper, car door, car mirror, car window, mudguard, first gear, low gear, floorboard, glove compartment, gear box, transmission, radiator grill and so on, all using WordNet's semantic and lexical links. Another feature lacking in contemporary dictionaries, but available through WordNet, is information about coordinate terms (i.e., sister terms). Someone looking for information about other mammals would be forced to search the dictionary from beginning to end looking for terms that are classified as mammals. The prototypical lexical entry for a word points to its superordinate term, not laterally to its coordinate terms or downwards to its hyponyms (i.e., its children or subordinate terms). This is one of the strengths of WordNet: its ability to reach related terms easily through its direct links to superordinate, coordinate and hyponymic terms makes searches of such information routine. The familiarity rating of a word is a useful indication for the probability of use for a word. In order to incorporate familiarity into WordNet, the designers have attached a syntactically tagged index of familiarity to each word. Generally, frequency is used as an indication of familiarity for words, however, WordNet's designers note that the relationship between frequency of occurrence and polysemy has been well-documented ever since 1945 (Miller, 1993). They go on to contend that polysemy predicts the amount of lexical access as well as frequency does, since the more frequently a word is used, the more different meanings it has in dictionaries, therefore, WordNet uses polysemy as its familiarity index. The familiarity rating is a useful way for users to locate a suitable alternative choice for a word simply by traversing a word's hierarchical links and finding the word with the highest familiarity rating. For instance, by traversing the WordNet hierarchical links for the word diary which has a familiarity count of 2, you can find the alternative choice of writing or writings which has a familiarity count of 7. This is a useful feature for systems interested in writing, generation, or teaching English as a second language. This familiarity feature is not only available for nouns in WordNet, but also for the verb hierarchy. Due to the nature of verbs, the complexity of their predicate argument structures (e.g., noun phrases), and other verb features (e.g., entailment, polysemy), it is unreasonable to represent verb entries in the lexicon merely with synsets (Miller & Fellbaum, 1992). To compensate for the complex nature of verbs, WordNet organizes them into senses based both on synsets and other verb features. It also provides short glosses to indicate the "meaning" of each sense (Fellbaum, 1993). WordNet was not designed to recognize the syntactic regularities that are a part of the semantic meaning of verbs; to compensate for this, WordNet includes one or more generic verb sentence frames for each verb synset. These frames distinguish features of the verbs by demonstrating the kinds of sentences in which they may participate. These frames indicate specific features of verbs that have been highlighted by researchers such as Beth Levin (e.g., argument structure, prepositional phrases and adjuncts, sentential complements and animacy of the noun arguments) (Miller & Fellbaum, 1992). The sentence frames are limited to a standard, generic format that is shown in Figure 2.4 (for a complete list of the WordNet verb frames, see Appendix D). Somebody ----s something Somebody ----s somebody Something ----s Somebody ----s Something is ----ing PP Somebody ----s something to somebody Something ----s something Somebody ----s that CLAUSE Something ----s somebody Figure 2.4 WordNet Verb Sentence Frames WordNet is an exceptional on-line dictionary, containing a great depth and breadth of lexical and semantic knowledge. It is currently being used by many researchers (Church, 1994; Macleod et al., 1994; Berwick et al., 1994; Zickus, 1994; Sutcliffe et al., 1994) as a tool to help solve some of the basic road-blocks in Computational Lexical Semantics and NLP, especially as a means of word sense disambiguation. WordNet does have some weaknesses. The morphology information within WordNet is minimal and it only goes in one direction: stripping off endings or searching exception lists to find the root form of a word (e.g., went is changed to go during a WordNet database search). There is no morphological mechanism for adding endings to root words (e.g., being able to pluralize story to stories). Morphological information is important to systems involved in Natural Language Processing/Generation. Additionally, if someone desires phonetic information (e.g., speech synthesis systems), information on non-noun/verb/adjective/adverb terms (e.g., word-based systems), proper nouns, or information on function words; they need to go to another source. WordNet makes a nice base for developing a multi-purpose linguistic tool. It has the traditional dictionary information (e.g., definitions, example sentences, part of speech tags, illustrative uses of alternate senses, synonyms, antonyms), and it takes advantage of computer technology by incorporating semantically related links within its lexicon to provide additional lexical information. WordNet is the primary dictionary resource accessed by the Language Access Database (LAD) which will be described in Chapter 3. Efforts in Merging Lexical Resources There are other current research efforts in merging lexical resources besides the LAD project. While the two discussed here have similarities with LAD, there are significant differences in approach, design and functionality. The first work is being developed by Kevin Knight and Steve Luk at USC/Information Sciences Institute and the second work is by Michael McHale and John Crowter at Griffiss Air Force Base in New York. Since both works use the Longman's Dictionary of Contemporary English (LDOCE), a brief description of it is provided before further discussion proceeds. LDOCE is designed to be a learner's dictionary of English as a second language. It contains over 27,000 words, chosen for their core vocabulary aspects and frequency of current usage, and over 70,000 word senses. It includes short definitions, examples of usage, syntactic categories (e.g., adj followed by to), semantic categories (e.g., human, animate object), and pragmatic categories (e.g., economics and business). One of the nice features of LDOCE, from the angle of incorporating it into other computerized systems, is that it has a controlled defining vocabulary of approximately 2000 basic words. This allows systems to need less word knowledge while processing the definitions; however, this can also lead to more complex syntactic representational structures in order to fully define words with the limitations of a constrained vocabulary. Knight and Luk (1994) are building a large-scale knowledge base for machine translation using semi-automatic methods for manipulating and merging existing resources. Their resources are: 1) the PENMAN Upper Model which is a network of about 200 nodes to influence linguistic choices; 2) the ONTOS model from Carnegie Mellon University which is an ontology designed to support machine translation; 3) LDOCE; 4) WordNet; and 5) the Harper-Collins Spanish-English bilingual dictionary containing thousands of Spanish words with English translations. Their methodology is to merge these on-line dictionaries, semantic networks and bilingual resources through semi-automatic methods (e.g., conceptual matching of semantic taxonomies, definitions). By using the resources chosen, they have access to much of the semantic and syntactic information needed by NLP-based systems. These include semantic categories (e.g., human, inanimate object, tool), sense information, and phonetic information (though they do not mention it, LDOCE provides pronunciation data on its words). This work has not been completed. Most of their work so far has been in developing the merging algorithms needed in order to match up the different resources. No complete testing has been done, therefore evaluation of this system is not possible. The goal of the second research group is to construct a lexicon from a machine readable dictionary. The resources used are: 1) a Principle-Based Parser (PBP), based on Chomsky's Government-Binding Theory; 2) LDOCE; and 3) Roget's International Thesaurus, with approximately 225,000 words. A PBP is based on a lexical theory of grammar and needs more information than most parsers require (e.g., morphological forms, part of speech, syntactic complements, thematic roles and control information). The combination of LDOCE and Roget's seems to provide the information necessary for the PBP. One of the limitations of this system is with regard to thematic roles (e.g., AGENT, THEME, BENEFACTIVE, EXPERIENTIAL, and LOCATIVE). LDOCE does not explicitly provide thematic role information. McHale and Crowter (1994) used a method of searching for repeated patterns of words in the word definitions in LDOCE in order to extract this information (e.g., the pattern to cause to is indicative of an AGENT role). This only proved successful for two-third's of the verbs; of these, ten percent had conflicting patterns and had to be re-analyzed by hand. They were also limited to the five thematic roles listed above. They developed a mapping facility that correctly mapped sixty-three percent of the word senses in Roget's Thesaurus to the word definitions in LDOCE. This work is not completed. However they did implement a lexical browser (limited to aerospace terminology) that will allow a user to select a word which produces the word's definition and links to any aerospace terms. The browser also has speech output in the form of uttering the word and a sample sentence from LDOCE. The system being designed by Knight and Luk (1994) is for the specific domain of machine translation and has not addressed the needs of other domains. Their design of merging the five specified resources together limits their system's extensibility. In other words, if they wish to add a feature, such as frequency, they need to re-work a great deal of their system in order to merge in additional resources. They have addressed the domain-specific needs of machine translation systems, but have not provided for the needs of other NLP-based systems. While McHale and Crowter (1994) were aiming at providing NLP technology with the ability to handle unconstrained language, they had to cut back on these aims due to personnel and time restrictions and thus have fallen short of their original goals. Their system is designed to provide a large, general lexicon for a broad coverage, domain-independent, syntactic parser (PBP); however, it does not address the needs of some of the NLP-based systems discussed earlier. What is needed is a system with a wider scope of information, a more extensible architecture (for incorporation of future lexical and linguistic resources), that contains semantic and syntactic information capable of supplying the needs of several real-world applications. Chapter 3 LAD DESIGN Motivation One of the limitations of AAC devices today is the size and lack of information available in their dictionaries. While some may contain an adequate amount of words, none of them contain sufficient information on these words to do semantic and syntactic reasoning, such as the word information needed for a semantic parser. In addition, while there is substantial interest in the development of natural language interfaces within the general software community, there currently do not exist any lexical databases that provide both a broad coverage (in terms of numbers of words) and sufficient depth of information (e.g., case frames) for individual words (Zickus et al., 1995; McHale & Crowter, 1994). There are several lexical tools and systems available on the market today; however, there are limitations with most, if not all, of them. They each have their own strengths, weaknesses, and specializations. In order to take advantage of more than one tool at a time, there needs to be a centralized interface system that will extract and glean the desired information in a consistent, understandable and functional manner. Thus the idea behind the Language Access Database (LAD) was developed. The approach to designing LAD has been to create an implementation with C++ and Lisp [FOOTNOTE: In our NLP laboratories, we often use Lisp to develop prototypes and C++ for commercial application development.] interfaces that allows a programmer to access several different databases (or lexical resources) in a seamless manner. LAD provides the programmer with a set of functions which can be used to retrieve various types of information. LAD then queries appropriate databases, retrieves the desired information, and returns it to the user. Thus the user is given the impression that a single database containing a variety of information is available, while LAD may actually access several different lexical resources to handle a query. At the same time, the programmer is given as much or as little control as they need. For instance, a user can simply query LAD about the frequency of a word and LAD will return the frequency rating found for the most generally accepted meaning of that word in some default corpus. Alternatively, if the programmer prefers, LAD provides mechanisms for the user to specify a specific "sense" of the word they are interested in and/or specify which corpora they would like to use. Thus it provides ways for users to override its default settings and to exhibit a great deal of control over the way a specific query is processed. LAD accesses several different lexical resources, the most unusual of these being the on-line dictionary/thesaurus WordNet, that was discussed in the previous chapter. It is WordNet that contains much of the semantic information needed for intelligent AAC applications. This chapter will describe the functionality of the LAD design, the lexical resources available to LAD, the LAD mapping functionality for different application areas, the object-oriented design of LAD, its extensibility, and finally its current implementation status. Functional Interface LAD is designed to be a lexical resource for a variety of applications. A goal of this work is to provide a functional interface that allows a user to query for several different types of lexical information. Often the methods provided allow for optional arguments that give the programmer more control over the way the query is evaluated. In this section, the LAD design functionality will be described. This section has been broken into four sub-sections; the first one describes the argument structure of the LAD methods and the next three reflect the "kind" of lexical knowledge being returned (e.g., semantic, syntactic, and other specialized knowledge) by the LAD functions (for more detailed LAD specifications, see Appendix A). Argument Structure LAD was designed based on the object-oriented paradigm. This gives the system the capability of having a flexible argument structure, providing LAD with an adaptability and robustness that otherwise would have been difficult to obtain. All of the functions described in the next three sub-sections have the over-loaded argument capability of being able to add more detailed query specifications. This means that every function, besides being able to provide the more generic information requested for a word, can also return information for a specific sense of a word, part of speech for a word, and lexical source used for the information retrieval. For instance, an is-a(book, dramatic_composition) would search all senses of the word book to see if dramatic_composition is in its hierarchical lexical links, while is-a(book, dramatic_composition, 2) would only search (WordNet) sense 2 of book for this superordinate. Frequency information searches might add either part of speech and/or source information into its query, such as frequency(and, CONJUNCTION) or frequency(writes,VERB, mobywords). This concept of providing specific information for queries is an important one for the LAD system. It enables the capacity to return both generic and specific information on words from its lexical resources. The next three sections describe the functions of LAD; the kind of argument(s) they take and the output they yield is generally illustrated with examples. Functions for Semantic Knowledge is-a(word, parent) This function takes two arguments; the word that is-a information is needed for and the parent-term being considered for the is-a query. For instance if we know what a vehicle is, then it helps us to understand what a car is by knowing that a car is a vehicle. The output from this function is an integer, 0 for false and 1 for true. In the example shown, the output is a 1 for true. If the query was for the input book, dramatic_composition, and 2 (indicating sense 2 of book), the answer would be 0 for false, since dramatic_composition is not part of the hierarchy for book as in "record, recordbook, book." is-hierarchy(word) This function takes one argument, word; it returns the is-a information associated with the word. For instance, you might know that a poplar is a tree, but you might want to know what other superordinate terms it has. In this case, the output is a character string containing all of the hierarchical links for poplar (e.g., "poplar, poplar tree => angiospermous tree, flowering tree => tree => woody plant, ligneous plant => vascular plant, tracheophyte => plant, flora, plant life => life form, organism, being, living thing => entity"). This tells the user system what other categories poplar can be considered as an is-a term. If multiple hierarchies exist, they are separated, and if sense is specified, only the hierarchy for that sense of the word is returned. list-semantic-categories(word) This function takes one argument, the word for which semantic category information is needed. Semantic category information is related to the is-a information described in the previous example. In this case, however, the categories returned are limited to those used by the Compansion system (listed in Appendix C). For instance, if we ask for the list of semantic categories hammer belongs to, it returns the list, "object, inanimate, tool." The output from this query is a character string. has-a(word, attribute) This function takes two arguments, the word that has-a information is needed for and the attribute being searched to determine if the word contains it. For instance, if we want to know if a car has-a windshield, we will use this query. The output from this function is a integer, 0 for false and 1 for true. In the example shown, the output is a 1 for true. has-parts(word) This function takes one argument, the word that we want has-parts information about. For instance, you might want to know all of the attributes a car has. The output from this function is a character string. In this case, the output is a character string containing "car - windshield, engine, trunk, wheels, steering wheel, mirror, fender, accelerator, bumper, floorboard, glove compartment, roof, suspension, tail pipe, turn signal, hood, radiator grille, gears,..." The output for sense 3 of car (as in railway car) would return "suspension, suspension system." semantic-properties(word, property) This function takes two arguments, a word and a semantic property. The goal of this function is provide semantic information about a word; for instance, given the noun tree, what is its means of reproduction? In this case the word would be tree, the property would be reproduction, and the output would be the character string "seeds." Other semantic queries that could be made of the noun tree would be utility yielding "shade, protection from the wind, produces oxygen, provides fuel, provides wood for construction," or height yielding "trees are generally tall." The output of this query is a list of character strings. definitions(word) This function takes one argument, the word for which a definition is required. For instance, if the argument is tree, the output will be "Sense 1=>tree, tree diagram - a figure that branches from a single root; `genealogical tree,' Sense 2=>tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms." If the function is given the arguments platypus and 1 (requesting sense specific information), the output would be "a platypus is a semiaquatic, egg-laying mammal." familiarity(word) This function takes one argument, the word needing familiarity information. There is a great deal of research into the polysemous [FOOTNOTE: In general, polysemy refers to the multiple meanings some words have, especially verbs.] nature of words. Some researchers contend that polysemy is an indication of the frequency of a words usage, while others contend that it has to do with the familiarity of a word. The LAD familiarity rating is based on polysemy. As we have defined familiarity, it indicates a prediction on lexical access. The output is in the form of an integer ranging from 1 (not familiar) to higher ratings, such as 48 (very familiar). The familiarity rating for the word break is 45. sense-info(word) This function has one argument, the word for which sense information is needed. The output is the number of senses found for that word and is an integer. For example, the output for the sense-info query for the word break is 23. This argument does not have a sense specific overloaded function, as it would not be logical to ask how many senses sense 1 of a word has. alternate-word(word) This function takes as an argument a word for which an alternate choice is needed. Often times when someone is writing a paper, letter or thesis, it is useful to get a good alternative for a word (e.g., when writing about broncos, another acceptable term would be horse, its generic type). LAD uses the index of familiarity to provide useful alternative words. The output of this query is a character string. For example, given the input bronco, LAD will return "horse." coordinate-terms(word) This function will take one argument, the word looking for its coordinate (or sister) terms. The output is in the form of a character string. So, if it is given the argument watch, it will return "clock, hourglass, sandglass, sundial, timer, atomic clock, ticker." If it is given a specific sense of the word, as in ash and the sense is 2 (for a type of wood or tree), the output will include, "yellowwood, balsa wood, boxwood, acacia, redwood, bamboo, poplar, birch..." is-coordinate-terms(word, sister) This function will take two arguments, the word being queried about and a possible sister term. The output is an integer, 0 representing false and 1 representing true, indicating if the sister argument is indeed a coordinate term of the word or not. For example, if the first argument is ash and the second argument is poplar, the result would be 1, or true. synonyms(word) This function takes one argument, the word for which synonyms are needed. The output of this function is a character string. For instance if the argument is the word shoe, the output would be "footwear, footgear, horseshoe, U-shaped plate, brake shoe." If there is a sense specifier argument to this function, for example, break and 4, the output will be "disperse, dissipate, scatter, break, spread out, break up, come apart, separate, part, split, move apart." parent-terms(word) There is one argument to this function call, the word requiring parent-term information. As in previous examples, if a sense specifier argument is provided, then the parent-term for that specified sense of the word will be returned, else it will return all of the parent-terms for all senses of the word. The output from this function is a character string. For instance, if there is one argument, shoe, the output will be "{footwear, footgear}, {plate, scale, shell}, {restraint}" and if the arguments are platypus and 1, the output will be "mammal." children-terms(word) There is one argument for this function call, the word looking for its children-terms. By adding a sense specifier argument, it searches for the specified sense of the word, otherwise it searches for all senses of the word. For instance, if the argument is lake with no sense specified, the output will be "{reservoir, artificial lake}, bayou, Great Lakes, Lake Erie, Lake Huron, Lake Ontario, Lake Michigan, Lake Superior, {lagoon, laguna, lagune}, {pond, pool}, Caspian Sea)." The results from this search are character strings. In the case of multiple strings for different senses, they will be returned in a list format to provide a separation between the different strings. antonyms(word) There is one argument for this function, the word for which antonym data is needed. All antonyms for the word will be returned, unless there is a sense specifier argument, then the function will return sense specific information. For example, if one argument, male, is given, the output will be "woman, female." The output for this function is a character string. In the case of multiple antonyms for different senses, they will be returned in a list format to provide a separation between differing antonym strings. nominal-relationship(word) There is one argument to this function, the word for which nominal relationship information is needed. The output of these searches is either a character string or a list of character strings, depending on whether sense specific information is requested. For instance, if the word is Amazon and sense 1 is specified, the output will yield "river." If there are multiple answers, as in the case of no sense specification, the answers are separated by being placed in a list. reverse-nominal-relationship(word) Alternatively, LAD also has been designed to include a reverse nominal relationship function. This will take one argument, the familiar category name. It will return a character string for its output. For example, if the argument is satellite, the output string will be "moon, astronomy satellite, communications satellite, space station, Salyut, Skylab, sputnik, spy satellite, weather satellite, meteorological satellite." case-frames(verb) This function has one argument, the verb for which a specified case frame is desired. The output is a verbFrame, specified verb case frame object. The particular case frame contains the information needed for the SRM project described in the next chapter. There are twenty distinct verb case frames (see Appendix B), each having different ratings and semantic category lists. For instance, given the verb break, the output would look like: verb - break agexp [toFillPref 4] [[human 3][animate 2][ergative 2]] theme [toFillPref 3] [[physical 3][fragile 4][object 1]] instr [toFillPref 2] [[tool_box 3] [tool 3] [solid 1]] goal [toFillPref 1] [human 3] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Functions for Syntactic Knowledge is-part-of-speech(word) This function takes two arguments, the word that is being searched and a part of speech tag (an overloaded argument). The output is an integer, 0 for false and 1 for true. For instance, if the arguments are bow and NOUN, the output will be 1. Similarly, if you have bow and VERB, the output would also be 1, for true. However, if the arguments were bow, VERB, 9, the output would be 0, since bow has 10 senses as a NOUN, but only 4 as a VERB. get-part-of-speech(word) This function takes one argument, the word for which you are searching for part of speech information. The output is a string indicating the parts of speech for the word. For instance, if the argument is bow, the output would be "NOUN, VERB, ADJECTIVE." word-compounds(word) This function takes one argument, the word for which you are searching for compound usages. The output is a character string. For instance, if the argument is dandelion, the output would be "common dandelion, dandelion, dandelion green, dwarf dandelion, fall dandelion, krigia dandelion, russian dandelion." morphology(word, tense, kind) This function can take up to three arguments. The first argument is the word for which morphological information is desired. The next two arguments specify the kind of morphological information wanted for that word. The output of this function will be a character string. For instance if the arguments are be, present, first-person-singular, the output would be "am." The third argument is optional, since not all morphological searches require a third argument, for example, candy, plural would yield "candies." Functions for Other Specialized Knowledge pronunciation(word) This function has one argument, the word for which phonetic information is requested. For instance, if the arguments are lead, NOUN, sense 9, the output will be the phonetic string for the word as used in the sentence I need more lead for my pencil. Whereas, if the arguments are lead and VERB, the output will be the phonetic string for lead as in I will lead the girl scout troop tomorrow. syllabification(word) There is one argument for this function. It is the word requiring syllabification information. The output will be a character string. For instance if the argument is category, the output will be "cat-e-gor-y." frequency(word) This function will take one argument, the word requiring frequency information. The output will be a floating point integer. For instance, the output for and, and CONJUNCTION would be 1.68549, or if the arguments were writes, VERB, and Moby Words, the output would be 0.211624. LAD Resources The LAD project aims at providing a great deal of functional capabilities. While many lexical resources are available today, not one of them provides all of the information LAD requires. Therefore, in order for LAD to accommodate the functionality just discussed, it needs to have access to several linguistic and lexical database resources. The primary LAD resource is the WordNet database. Some of the other sources that LAD can access include an internally developed verb case frame database (where verb frames such as the one from our previous example are stored), a morphology database, a file containing phonetic information, a syllabification file, and databases containing various information derived from the Brown corpus and the Carterette corpus. The following sections cover some of the lexical resources of LAD. It should be noted that LAD has been designed for extensibility and because of this, augmenting it with additional databases is a fairly straightforward process. Figure 3.1 shows the overall structure of LAD and the resources it accesses. One important function of LAD is the integration of multiple lexical resources and/or files. These resources are shown on the right part of the figure. The architecture is extensible in that new lexical resources can be added without modification of the database engine. WordNet LAD derives a majority of its semantic knowledge from WordNet. The is-a and has-a relationships, word definitions, familiarity, sense-information, alternate-term-information, coordinate terms, synonyms, parent-terms, children-terms, antonyms, nominal-relationships, reverse-nominal relationships, word-compounds and most of the semantic-property information, are all derived from the WordNet lexicon. The hierarchical structure of the WordNet data, as described in Chapter 2, is one of the major reasons so much semantic knowledge can be extracted. This aspect of WordNet combined with its broad lexical coverage (over 95,000 lexical entries), has provided LAD with a vast amount of available knowledge and has proven to be an excellent choice as its primary lexical source. [Figure Diagram] Figure 3.1 Language Access Database Diagram Case Frames The case frame information is contained in a secondary database consisting of a flat-file with over 85 verbs and their attached case frames. The case frames contain data used for the semantic reasoning done in the Semantic Reasoning Module (SRM) described in the next chapter. The SRM application was developed to test the LAD design, including the extensibility required for adding additional resources. In addition, the case frame database illustrates how multiple lexical resources can be used to retrieve desired information. For instance, the number of verb case frames that LAD has access to is larger than 85 due to the synonym sets in WordNet. For example, some systems may need verb frame information on verbs that do not have frames (e.g., pummel). By default, LAD currently searches for a verb frame from the case frame database. In the case where the verb is not represented in the secondary database, LAD searches for synonyms of the verb from WordNet (e.g., crush) and then checks in the case frame database for one of the synonyms. Morphology An important aspect of computerized language generation and understanding is the ability to do morphological reasoning about words. For example, if the word of input is trees, a system will need the ability to reduce it to the root word tree in order to search for information associated with the word in various databases. [FOOTNOTE: It is normally the case that word information is associated with root words in databases. There are instances, however, such as WordNet which has a morphological processor that reduces words to their root form before it searches the database. ] Likewise, for the generation of language, if the input is John be happy or They be happy, morphological information is needed to find the third person present singular or third person present plural form of the word be in order to generate either John is happy or They are happy. As you can see from these examples, morphology is an important lexical tool. Much of the needed morphological information is contained in collections of databases provided by Moby Words. However, one might imagine a database which contains only exception data and a processor which handles other morphological information in a rule-based system. [FOOTNOTE: There is presently a rule-based morphology program with exception tables presently being used in a different application at the Natural Language Processing Laboratory.] Incorporating such a system would be straightforward given the LAD design. Phonetic Information The phonetic information LAD will access is contained in the Moby Pronunciator II file from the Moby Word II collection. It contains over 175,000 phonetic entries using the standard International Phonetic Alphabet. Moby Pronunciator II contains many common names and phrases borrowed form other languages, as well as a large number of common English words, compound words and phrases. This resource will provide phonetic information for applications interested in producing speech synthesis or phonetic information. Syllabification Information The syllabic information LAD will access is contained in the Moby Hyphenator II file from the Moby Word II collection. It contains 187,175 hyphenated single and compound words. This resource will provide syllabic information for applications requiring hyphenated word information. Frequency Information Corpora have been used by various NLP applications in place of large lexicons to enable semantic and contextual word meaning (Velardi et al., 1991). The Brown Corpus is one of the more prominently analyzed and used corporas. It was compiled by W. Nelson Francis and Henry Kurcera at Brown University using written American English sources sampled in 1961. It contains 1,014,294 words from sources that include the press, religion, popular lore, biographies, essays, general fiction, science fiction, adventure fiction, romance stories and humor. It comes in both a syntactically tagged version and an untagged version. The frequency information database will access frequency information from the Brown Corpus, the Carterette Corpus and Moby Word II, which includes a frequency file of the 1,000 most frequently used English words, a file with the 1,000 most commonly used English words on the Internet computer network in 1992, and a file with the 1,185 most frequently occurring substrings in the King James Version Bible. LAD Mapping LAD is intended to be a useful semantic and lexical tool for multiple systems, which are designed independently of LAD and LAD's resources. It is anticipated that queries may require some mapping between various sources in order to be fulfilled. For instance a query may require LAD to access one database in order to get information needed to access another database to properly retrieve the necessary information. In addition, there may be queries which require retrieving similar information from two different databases, but the information itself may be categorized differently in the two databases. In order to handle these naming discrepancies and complex searches, LAD has a mapping capability which allows information between different databases to be used consistently from one to another, even though the information may be given different names in the different databases. The mapping and availability of multiple database accesses are done in a transparent manner to the functional interface. Semantic Categories A semantic category is a way of expressing attributes of a word or the roles a word can fill (e.g,. song -> auditory, auditory communication, auditory sensation). The Semantic Reason Module (SRM), described in Chapter 4, requires a list of semantic categories a word can fill in order to semantically reason about possible word roles. LAD facilitates the SRM by searching the WordNet database to determine what categories can be attributed to a word. However, there are some semantic categories in the SRM that have different names in the WordNet database (e.g., SRM - ingestible, WordNet - foodstuff, nutrient; SRM - visual, WordNet - visual_perception, visual_communication). To compensate for discrepancies between specifications for LAD data and user-system's data, LAD uses the mapping capability previously described. For example, when the SRM wants to know if a carrot is an ingestible item, LAD will access the mapping table to find ingestible's translation into WordNet categories. Finding that ingestible translates into foodstuff or nutrient, LAD will search the WordNet database for foodstuff and/or nutrient as elements of the hierarchy for carrot and return a true or false answer to the SRM (see Appendix C for the complete WordNet to SRM mapping). In this manner, LAD transparently handles discrepancies in system specifications and its accessible source specifications. Queries Requiring Multiple Databases NLP systems often are designed with more than one knowledge base. These multiple knowledge bases enable complex reasoning about different "kinds" of data. For example, a system requiring corpora knowledge using LAD would get information from both the Brown and Carterette corpora. However, if they specified a specific corpus, then they would receive data from just one corpus. [FOOTNOTE: Different corpora are gathered from different sources. Depending on the sources in the corpora, it can influence statistical data retrieved.] For instance, frequency information can be determined in a number of ways, one being the word count of a word in a large corpus. LAD will have access to several corpora. These corpora differ in the type of information found in them; for instance, the Carterette Corpus contains samples of spoken language and the Brown Corpus contains samples of written language from a wide variety of sources. LAD provides multiple frequency information queries to enable users to chose specific databases from where they obtain frequency information, if they have a stronger preference for one corpus over the other. However, if no database is specified, LAD will use a default path to retrieve frequency information. Alternately, LAD could combine frequency information from several different sources using mapping information to decide if and how the information can reasonably be combined. Another example of LAD requiring multiple databases is the searching methodology for verb case frames mentioned previously. If LAD cannot find a case frame for the verb it is processing, then it accesses information from WordNet that will enable it to continue its search in the verb case frame database. Complicated Queries At times, NLP systems require data that necessitates LAD to search in one or more databases before searching a final database to find the answer. For instance, in the case of a system requiring corpora statistics for a specific sense and part of speech information on a word, LAD would first need to access the WordNet database for the part of speech specified and then locate the specific sense of that word. Finally, LAD would need to call its corpora mapping function with the located word sense in order to retrieve the desired information. Another example of a complicated query would be one required for a computerized system with speech synthesis capabilities which relies on phonetic data to enable it to generate speech. LAD will be able to search the phonetic database for a given word and part of speech tag and return the pronunciation string. Phonetic information will be available for LAD through the phonetic database, but it will require a mapping function from a specific part of speech category for a word to the corresponding WordNet database (e.g., if the word is a noun, then it needs to access that word in the WordNet noun database). Then it needs to locate the intended word sense (based on semantic information it gets a word sense, e.g., lead as a metal), and conclude its search by using a mapping utility that maps from WordNet to the pronunciations located in the phonetic database. This will enable LAD to accommodate different pronunciations of the same word (i.e., lead as in He is the Project lead vs. lead as in I need lead for my pencil). Application Areas LAD is designed to interact with multiple lexical databases in a transparent manner, so that the user-system treats LAD as a single dictionary. The resulting system will be a useful tool for various NLP-based AAC applications. Some applications that could benefit from LAD would be a system with a speech synthesizer needing pronunciation information, and a system with a syntactic-based word predictor using morphological information to predict correct verb forms. LAD will also be able to provide frequency information for systems designed to enable word and/or letter prediction. These systems rely on frequency information to determine appropriate choices for the next word/letter(s) of input. LAD is currently being tested with a semantic parser based on the reasoning principles used in Compansion (discussed in the Section, Application Domain: Augmentative and Alternative Communication, in Chapter 2). This application requires verb case frame information as well as semantic category information and is discussed in more detail in the next chapter. LAD can also provide the necessary information for other NLP-based systems, such as machine translation or syntactic parsers, as discussed at the end of Chapter 2. Object-Oriented Design The object-oriented paradigm is a set of theories, standards, and methods that together represent a way of organizing knowledge (Budd, 1991). Actions occur when a message is sent to an object, the object interprets the message and then performs some method in response to the message. All objects are instances of a class and if the same message is passed to multiple objects from the same class, they perform the same method. The sender of the message to the object does not need to know or even care to know how the action is accomplished, just that it is done (Budd, 1991). To take this into the realm of everyday living, if I call my local florist, Bill, and order a bouquet of red roses for my mother, I do not care about the details of how the florist actually fulfills my request; I just care that the order is filled. In this example, Bill, an instance of the class florist, gets a message, an order of a bouquet of red roses to be sent to Mary Mair, requiring him to use a method, create_red_rose_bouquet, a function providing actions to make up the bouquet of red roses, and another method, delivery, to deliver my order to my mother. In addition, I can be assured that if I call Bill to deliver a vase of pink carnations to my neighbor, Lorraine, Bill will recognize the different variables and use the appropriate methods to carry out this different request. This notion of objects and classes is the basic building block of object-oriented programming. In order to build upon this to increase the power and flexibility of the paradigm the additional notion of inheritance is utilized. Classes can be organized into hierarchical structures that can inherit attributes and methods from the class (or classes) that it is derived from. To take this back into my previous example, Bill and I are both humans and thus inherit a great deal of attributes and knowledge from the class human. I am also a Computer Scientist and thus have a great deal of knowledge from the class Computer_Scientist, while Bill is also a member of the Florist class and thus has a great deal of knowledge associated with the Florist class. In the human class there might be a function detailing the basic actions needed in order to create a bouquet of red roses, such as get a vase, fill it with water and place red roses in the vase. The Florist class needs more detailed information than this. Therefore they would have a method that overrides the basic method in the human class with one that goes into more detail, such as cutting the stems at an angle of 30 degrees while the stem is under water, placing the roses into a vase filled with specially treated nutritious water, and so on. With this brief discussion and example into what the object-oriented paradigm is, the following paragraph will describe the LAD object-oriented design as depicted in Figure 3.2. LAD is based on the object-oriented paradigm. Its base class, lad, contains the virtual methods needed to perform its functions. Presently derived from the lad class are the wordnet class and the case_frame class. Their respective classes contain methods that require more detailed instructions than the basic methods in the lad class. For instance, the is-a function requires information from WordNet in order to return an answer of substance. The lad method for is-a only indicates that it returns 0 (false/NULL). On the other hand the wordnet class is-a method actually searches the WordNet databases and retrieves the is-a information associated with the specified word. The case_frame class is-a method knows how to handle is-a information about particular verb words. For instance, the query break is-a relational would return false, since break is a material verb and not a relational verb. [Figure Diagram] Figure 3.2 LAD Object-Oriented Class Hierarchy As depicted in Figure 3.2, the level directly under the base class, lad, is the first level of derived classes that contain their own methods and data specifications. This enables LAD to inherit some functionality from the base class and overload those methods within their own classes that need to provide additional functionality. The level beneath the case-frame class depicts the two classes that form the case frame class (for more detailed information, see Appendix B); specified case frames, which hold the case and role choices for specific verbs, and filled case frames, which hold the words that have filled a case and its preference rating. The dotted-lines below the specified-case-frame class indicate several derived classes for the different types of verbs (e.g., relational, attributive, oral, written). These classes specify to the system what the toFillPref ratings are on each case of a specified frame, and what categories (with their associated ratings) can fill each case. This hierarchial structure gives LAD the ability to grow and change with relative ease. Extensibility LAD has been designed using an object-oriented paradigm to increase its extensibility and robustness. By using this paradigm, information not currently available to LAD can be added easily without any major changes to the existing code. Instead, a new class can be created which knows how to handle the information accessible in the new lexical resources. If this class is derived from the LAD hierarchy, it will then be readily accessible to LAD. As an example, pronunciations are not available in WordNet or the case-frame secondary database which are currently accessible by LAD. By creating a pronunciation class, instances of pronunciation can be stored in a secondary database and accessed in a means similar to the way verbs with case frames are currently accessed in the case-frame database. The functionality of LAD is general enough to be used by many different derived modules. A resource having functional capabilities not presently available in the LAD design, can be included in LAD with minimal code changes (e.g., add the new virtual functions to the base lad class, derive the new class from lad, define the methods associated with this class, and add calling capabilities to the LAD driver in order to initiate the newly declared and defined functions). In this manner, LAD not only has the ability to increase its lexical and linguistic resources, but it can incorporate new design features with a minimal amount of effort. This is a notable difference and improvement over the design of the system developed by Knight and Luk (1994) mentioned in Chapter 2. They merge their five lexical resources together, and would require substantial alterations to the existing code in order to incorporate new resources, if they were desired. Implementation Status LAD has not been fully implemented, as we felt it was important to complete a full analysis of LAD's functional design, implementing the WordNet and case frame classes with enough functionality so that testing and then evaluation could be performed. Currently implemented are the functions that perform the is-a, list-semantic-categories, has-a and case frame queries (for more details on LAD's functions, see Appendix A) which are needed by the Semantic Reasoning Module (SRM). The following chapter will go into the SRM in more detail. Chapter 4 SEMANTIC REASONING MODULE Motivation LAD was designed to be a useful lexical resource for multiple applications. In order to test LAD's capabilities, the Semantic Reasoning Module [FOOTNOTE: The SRM was based on principles developed in Compansion, but implemented in C++, instead of Lisp, to test out the theoretical and functional design of LAD.] (SRM) was designed and implemented. The SRM takes as input uninflected content words intended as a telegraphic message. [FOOTNOTE: In the input, the verb is identified and the other words are limited to be nouns which play some role with respect to the verb.] The SRM returns a list of all the possible filled case frames which capture possible intended messages associated with the input words. Each filled case frame indicates which role each of the words play with regard to the verb, and has a "case rating" associated with it that can be used to compare the relative goodness of the various generated filled case frames. In order to accomplish this, the SRM obtains case frame and semantic information from LAD. The following sections will describe certain requirements of the SRM and how LAD is able to satisfy them (for more detailed SRM function specification, see Appendix B). Case Frames and Semantic Reasoning The SRM uses two case frame representations to perform semantic reasoning. The first type of case frame is the specified verb case frame, previously discussed in Chapter 2. The second is the filled case frame, described in more detail in the section on Semantic Output in this chapter, which will be used to generate syntactically and semantically correct sentences. The SRM takes as input a verb and one or more uninflected nouns, which are intended to fit together into a well-formed sentence. The system's goal is to determine the proper roles of the content words with respect to the verb, and to generate filled case frames that reflect all possible combinations of these roles. The first interaction between LAD and the SRM consists of the extraction of a verb case frame for the input verb. The information contained within the case frame specifies what roles can be filled, how important it is to fill each of these roles, and the kinds of words that can be used to fill each role. This case frame provides semantic expectations for the remaining words of input. The next interaction between LAD and the SRM is to determine what semantic categories the words of input can fill. In some cases these semantic categories must be mapped by LAD onto categories appropriate for the verb frame by LAD's mapping mechanism (as discussed in Chapter 3). Using the information retrieved by LAD, the SRM can generate all possible filled case frames and present them to the user. The filled case frames (along with their goodness rating) is the SRM output. These could then be taken by another component to generate syntactically and semantically correct sentences for the words of input, and could be returned to the user for feedback as to which generated sentence best matches their intended meaning. Semantic Categories The semantic categories used by the Semantic Reasoning Module capture information that can be associated with words. This information provides distinctions between classes (e.g., animate vs. inanimate) and is motivated by the roles that objects can play with respect to various verbs. These semantic categories are the same categories employed by the Compansion system (see Appendix C). Preferences The job of the SRM is to determine likely sentence meaning. For the SRM, this sentence meaning is captured in a case frame representation based on work by Fillmore (1977). In this representation, the verb is central and the nouns are said to play a small number of roles with respect to the verb. The roles used include: AGEXP (agent/experiencer) is the object doing the action. For us, the AGEXP does not necessarily imply intentionality such as in predicate adjective sentences (e.g., John is the AGEXP in John is happy). THEME is the object being acted upon, while INSTR is the object or tool used in performing the action of the verb. GOAL can be thought of as a receiver, which is not to be confused with the BENEF (beneficiary) of the action. For example, in John gave a book to Mary for Jane, Mary is the GOAL while Jane is the BENEF. We also have a LOC case which captures the location in which the situation is taking place (this case may be further decomposed into TO-LOC, FROM-LOC, and AT-LOC), and TIME which captures time information (this case may also be further decomposed). The final output of the SRM creates a number of filled and rated case frames which place the nouns of the input in all of their reasonable roles. So, for example, the input, mary like car, generates two different filled case frames (one with a rating of 18 and the other with a rating 9): The verb is like The total filled case rating is 18 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is car the caseFiller category is human with a rating of 6 The verb is like The total filled case rating is 9 The theme->caseName is car the caseFiller category is human with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 The SRM must have the ability to reason about the most likely way that the given words can fit into a case frame. In order to do this, a set of preference ratings are associated with each verb case frame which indicate possible ways of completing a case frame. In addition, these preferences are used to rank the filled case frames against each other. There are three different kinds of semantic case preferences: case filler preferences, case importance preferences, and higher-order case preferences. The case filler preferences are found in many NLP case-based systems and indicate preferences for the semantic categories that are most reasonable for filling out a given case (e.g., animate agents are preferred for most verbs), the case importance preferences indicate which cases are most important to fill (e.g., the agent case is generally more important to fill than the beneficiary case), and higher-order case preferences (which capture interactions between the way various cases are filled out). Preference ratings fall on a 1-4 scale: 4 signifies a high preference, while 1 signifies while acceptable, it is only appropriate in special cases. The case filler preferences are used in other semantic representations to indicate preferred filled roles of a particular verb case frame and are based on the case filler preferences described in Preference Semantics (Wilks, 1975). The kinds of objects that could fill a particular role are indicated, along with a rating for each possible role filler type. For example, the preference for filling the BENEF case for break is: ((human 3) (organization 2) (animate 2)). This indicates that given the choice, a human should fill this role, but an organization or an animate object are also reasonable alternatives. Case importance preference ratings indicate what cases are more important to fill for a particular verb. For example, with the case frame shown in Figure 4.1 for like, it is more likely that the role of AGEXP will be filled than any of the other cases. To indicate this, a higher value (four) is given as the preference of filling the AGEXP case, while lower values (one or two) are given as the preference for filling the other cases. The higher-order case preferences are used to compensate for exceptions-to-the-rule situations. For instance, if a non-human animate (e.g., dog) fills the AGEXP role for the verb eat, it is highly unlikely that an INSTR is being used by the dog to do the eating. Yet if a human fills the AGEXP role, this is very reasonable. The idea is to use these higher-order case preferences to compensate for this kind of situation by, in essence, lowering the case importance preference rating. verb - like agexp [toFillPref 4] [[human 3][organization 2][animate 1]] theme [toFillPref 2] [object 3] instr [toFillPref 1] [[cognitive 3] [tool 1]] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Figure 4.1 Specified Case Frame for the Verb LIKE These preferences are captured in a set of verb frames which are stored in one of the lexical resources available to LAD, which contain functions for retrieving the verb frames. For example, Figure 4.1 shows the verb case frame for like, the case filler preferences (indicated by toFillPref after each role) indicate a very high preference for filling the AGEXP case, and a preference for filling the THEME case over the other cases. The case filler preferences for each role are captured in the list following each toFillPref, they indicate that a human is preferred for the AGEXP role, however an organization or animate is also acceptable, and that an object is needed to fill the THEME role. Thus, given the input like mary car, mary can fill the roles of the AGEXP or BENEF (though the role of agexp is preferred over the BENEF) and car can fill the role of THEME. SRM and LAD (Knowledge Representations) As was discussed in Chapter 3, LAD relies on WordNet to retrieve semantic categories of words. One problem is that a particular application (e.g., SRM) may use a set of semantic categories which do not exactly match the categories that WordNet uses. This is in fact the case with the SRM since it is based on the case frames developed for the Compansion system (see Appendix C for the SRM's semantic categories). For instance, the SRM refers to a class inanimate, although WordNet does not classify any words as inanimate, but instead classifies them as inanimate_objects. To compensate for this, LAD uses its mapping function to map to and from the WordNet and the SRM's semantic categories. Thus it maps inanimate to inanimate_objects and allows the SRM to function as though WordNet stored the semantic information identically to its needs. Secondary Verb Frames vs. WordNet Verb Frames The verb case frame database has a small number of entries (88 to be exact), while the SRM needs LAD to return verb case frames for a much larger number of verbs. LAD accomplishes this task by not only searching the verb case frame database for case frames, but if that search is empty, proceeding to generate a list of synonyms for the verb and searching the secondary database for a synonymous entry. If an entry is found for one of the verb's synonyms, then that case frame is returned. Part of our future work will investigate how the system might recover if no synonymous verb has an entry. For example, one approach would be to generate a case frame from the WordNet verb frame data (see Appendix D). These case frames will not be as complete as the other case frames (i.e., they may not contain as many semantic categories in their preferences). However they would allow the SRM to continue reasonable processing. Again, as in the case of the semantic category mapping, the verb case frame search is done in a transparent manner. Semantic Output The output from the SRM is currently a list of filled case frames, indicating what role each word of input should play for a particular possible sentence generation, with a total rating on the goodness of the sentence. Figure 4.2 contains a list of the filled case frames for the input sequence break mary window hammer. The preference rating for a sentence is the summation of the ratings on the filled roles in the filled case frame (gotten by multiplying the to fill preference and the case filler preferences). The preference ratings for the filled frames shown in Figure 4.2 are 21, 12, and 12 respectively. From this, the preferred interpretation comes from the first frame which might be generated as Mary broke the window with the hammer. The generator might return all possible generations in such a manner that the sentence with the highest preference rating is returned first, the next highest rated sentence next, and so forth. The user could then select the preferred sentence from those generated, with the most likely generation being at the top of the list. The verb is break The total filled case rating is 21 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The verb is break The total filled case rating is 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The goal->caseName is mary the caseFiller category is human with a rating of 3 The verb is break The total filled case rating is 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 Figure 4.2 List of Filled Case Frames SRM Processing The SRM can perform very complex processing since many of the input words may appropriately fill several roles. The SRM is required to ensure that: 1) each word of input appears in some role in every generated filled case frame; 2) that no two words fill the same role in a filled case frame; and 3) that every filled case frame has a unique assignment of input words to roles. To handle this complex processing, the SRM uses the following algorithm shown in Figures 4.3 and 4.4. At the start of processing, the algorithm has a list of words, each with an associated rolelist. The rolelist indicates each role a word may play (according to the semantic instructions captured in the case frame) and a preference strength associated with each role that indicates how much that word filling that role would add to the filled frame's overall rating. For clarity, the "failure points" have been left out of the algorithm. For instance, if two words of input could each only fill the same role, the algorithm would not be able to generate any filled case frames. The algorithm has been separated into two figures to depict the two main loops of processing done by the SRM. 1. Check for all words that can only fill one role and place them on the mandatory rolelist. 2. Call ridConflicts to eliminate the roles filled in Step 1 from the remaining word's rolelists. 3. Go back to Step 1 until there are no conflicting roles or words with only one possible role to fill remaining. 4. If there are no words remaining with multiple possible roles, then generate a filled case frame using the words on the mandatory rolelist, otherwise go to Step 5 to handle the recursion required for further processing. Figure 4.3 Algorithm for the Generation of Filled Case Frames (Part 1) At this point, either the processing is completed (in the case where all the words of input can only fill one role once conflicts are eliminated), or else more complex processing needs to proceed (in the case where there are words that can fill multiple roles). 5. Take the first word with multiple possible roles off of the multiple role wordlist. 6. Pop a role off of this word's rolelist and put it on a temporary list. 7. If there are more words on the multiple role wordlist, go to Step 8, otherwise generate a filled case frame using the mandatory list and the temporary list. Put this frame on the list of filled case frames to be returned. If there are no more roles on this word's rolelist, then return the list of filled frames, otherwise go back to Step 6. 8. Check the rolelists of the next word on the multiple role wordlist to see if its first role conflicts with the roles on the temporary list, if it does, skip this role and move to the next role. Repeat Step 8 until a role is reached that does not conflict with any roles on the temporary list. 9. If there are more words on the multiple role wordlist, go back to Step 8, else if there are no more roles on this word's rolelist, add the current role from this word to the frames on the partially filled framelist, otherwise go to Step 10. 10.Generate a filled case frame for each entry remaining on this word's rolelist, and put the frame on a partially filled framelist. Figure 4.4 Algorithm for the Generation of Filled Case Frames (Part 2) This algorithm ensures each word of input is in every generated filled case frame without any words conflicting over a role, and that every generated frame is unique. The next chapter concludes this thesis with an analysis and evaluation of LAD, its design, implementation, testing results and its present resource limitations, and a discussion on future work. Chapter 5 CONCLUSIONS The Language Access Database was designed to give a user access to several on-line lexical and linguistic sources in a unified manner thus providing various different software applications with a convenient source of syntactic, semantic, and other lexical information. This information will enhance the capabilities of systems which attempt to understand natural language, to generate semantically and/or syntactically correct sentences, to produce speech synthesized language, and so on. LAD was designed using the principles of the object-oriented paradigm to allow for flexibility and extensibility. This final chapter examines the goals and accomplishments of LAD, an analysis of the LAD system (e.g., testing and evaluations, resource weaknesses, comparison of output from the SRM's LAD enriched system and the Compansion system), a discussion of the results of LAD, and concludes with a look into the future directions for LAD. Accomplishments The completed LAD design was based on providing traditional dictionary functionality as well as enhanced capabilities to incorporate the semantic and syntactic needs of NLP-based systems. The current implementation encompasses enough of the LAD functionality to be tested and evaluated before a complete implementation is done. In its current form, LAD has access to the WordNet database and a verb case frame database. The Semantic Reasoning Module, used as a testing and validation system, was designed with principles developed in the Compansion system. LAD Implementation Status LAD has been fully designed, but not fully implemented. The reason for this is that LAD has been designed as a fully extensible system. Thus, as new applications need to use LAD, everything is in place to add access to any additional lexical resources they might require. The completed portion of LAD is sufficient for testing its usefulness in the implementation of the SRM. The functionality required for this implementation utilize all of LAD's major design components (i.e., multiple database access, an application-specific database in the case frame database, mapping information required in mapping data from one database terminology to another, and sophisticated access to WordNet's semantic information). Thus the implemented portion of LAD provides significant access to needed information that was not previously available. In addition, the testing of LAD using the SRM validates the design principles upon which LAD is based. Now that the first stage of testing is completed, the results warrant further testing by using LAD in other applications. Currently implemented are the functions needed for the SRM processing. These perform the is-a, list-semantic-categories, has-a and case frame queries. The LAD implementation will proceed to incorporate more lexical resources, as its design intended, and more of the functions as applications and lexical databases warrant. WordNet Usage WordNet has proven to be an excellent choice as the main lexical database due to its broad coverage of words and sense information. It has limitations in certain areas of its lexicon, the most notable being proper names. However this will be compensated for by adding a mapping function using the Moby Words II proper names of people and proper names of places files to their equivalent expressions in WordNet. For instance, the mapping mechanism can be used to map male names like John to Tom (which is included in WordNet), and to map names of cities to equivalent cities (i.e., taking care to map seaports to seaports, major urban areas to other major urban areas, etc.). Moby Words II contains 21,986 names, including 4,946 commonly given female names and 3,897 commonly given male names in English speaking countries, and 10,196 places in the United States. This addition to WordNet will improve one of its weaknesses, although there is still a lack of proper names of products and businesses (e.g., Ford, Chevy, IBM, duPont). This final weakness could be corrected with a flat-file database of names of companies and their products, if needed or desired. SRM and LAD Results The results of the SRM implementation using LAD were very supportive of further work with LAD, and with other applications that will make use of it. There were a total of 32 input test strings processed by the SRM, and the expected results [FOOTNOTE: The SRM produced a list of filled case frames which comply with its complex processing requirements (see Chapter 4, Section SRM Processing), each frame is semantically logical, and at least one of the filled case frames could be used to generate a valid sentence.] were achieved from 29 of these test strings. The three input test strings which did not work in the SRM were due to failures in the mapping function used to map the semantic categories of the SRM to the names found in WordNet. For example, WordNet does not have a definition or sense of the word bone as in a bone a dog would eat. The SRM also uses the category name instrument and WordNet uses instrumentality for the word fork, and the SRM uses the category name time and WordNet uses season and time_of_year for the word summer. Presumably, the mappings in the mapping table (see Appendix C) could be extended to handle these cases. Some of the test strings included in the SRM test were: go mary tom store break mary window hammer break mary window hammer tom break mary hammer rock tom ask mary question hit tom mary dog buy mary book yesterday eat mary slice bread break mary thermostat utter mary sentence steal mary bread go mary restaurant yesterday fly mary europe order mary hamburger tom write mary paper chemistry go mary beijing eat mary hamburger fork remember mary swimming summer eat mary dog bone In the 29 successful test strings, LAD enabled the SRM to produce complete lists of filled case frames represented by the words of input, where each frame contains all of the words and every frame has a different pattern of words for the cases they fill. For example, consider the input string break mary window hammer. The cases that mary can fill are AGEXP (with the semantic category of human and a rating of 12), GOAL (with the semantic category of human and a rating of 3), and BENEF (with the semantic category of human and a rating of 3). The case that window can fill is THEME (with the semantic category fragile and a rating of 1). Finally, the cases that hammer can fill are THEME (with the semantic category object and a rating of 3), and INSTR (with the semantic category tool and a rating of 6). You will notice that mary can fill three different cases, while window can only fill the THEME case, and hammer will only be allowed to fill the INSTR case, as the THEME case will always be filled by window. Therefore the results from the SRM with these words of input should (and do) yield the correct filled case frames (as shown in the next section). The results of the SRM implementation using LAD were very good and show that LAD has a promising future for providing various applications with semantic knowledge. The mapping function needs to be further tested and refined to account for discrepancies, as became evident during the testing session. LAD has thus proven to be a useful tool for providing semantic knowledge. SRM Performance vs. Compansion Performance The SRM does not have the full functionality of Compansion because it cannot generate sentences, nor handle certain exceptional cases. However, a comparison of the SRM-generated filled case frames with the sentences generated from Compansion is feasible by using the same (or equivalent) test strings. The following is an example of the SRM testing results: input string: break mary window hammer The verb is break The total filled case rating is 30 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The verb is break The total filled case rating is 21 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The goal->caseName is mary the caseFiller category is human with a rating of 3 The verb is break The total filled case rating is 21 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 Compansion has a generator which generates sentences from the filled case frames its semantic parser builds. The generator may not be able to generate semantically and syntactically correct sentences from all of the frames, so the output from Compansion is only the sentences it is able to generate. The following is the Compansion output for the same test string as in the previous SRM example: input string: mary break window hammer output--> Mary breaks the window with the hammer. The first filled case frame from the SRM would be used by a generator to create the same sentence as the output from Compansion above. The following two filled case frames might be used to generate the sentence "The window was broken with the hammer for Mary." It is important to note that Compansion takes word order into account, so the previous sentence could not be generated from the ordered input string mary break window hammer. What is encouraging from these results is that for input strings which do not require the extra exception processing Compansion provides, similar results were produced. When Compansion was tested on the same 32 strings that were used to test the SRM the results were that it could successfully process 15 of the 29 strings that the SRM could process and one of the strings that the SRM could not process (mary eat pizza fork). This difference in results is mainly due to the fact that the SRM has a larger vocabulary than Compansion, because the SRM obtains its word knowledge from LAD. While Compansion is limited to a vocabulary of over 1,000 words, the SRM has access to the over 95,000 words in WordNet and once the proper names mapping is completed, it will have an additional 21,986 proper names access from the MobyWords data files. Some of the test input strings that the SRM could handle and Compansion could not were: break mary hammer rock tom break mary hammer rock write mary paper chemistry go mary beijing eat mary slice bread eat mary dinner tom break mary thermostat utter mary sentence steal mary bread corrupt mary morals build mary software design mary software go mary restaurant yesterday fly mary europe In all of the cases above, except for the first three, Compansion's failure was due to lack of access to semantic knowledge about one or more of the words in the input string. In the first three cases, the failure may have been due to a lack of semantic knowledge or due to the generator not being able to formulate a sentence which captured the semantic representation produced by Compansion's semantic parser. For the two cases which failed for both Compansion and the SRM, it was because of vocabulary limitations of Compansion and mapping failures in LAD. Discussion LAD has proven to be a valuable semantic tool, enabling the NLP-based semantic parser, SRM, by providing the knowledge it requires to process a large number of telegraphic inputs. The vocabulary size that LAD provides is significantly larger than the prototype Compansion system, and this greatly increases the variety of input strings the SRM can process. LAD now needs to be provided with access to additional databases and additional capabilities of the WordNet database so that it can be integrated with other applications. The initial results on LAD are very encouraging, and validate the need for future work. Future Work As mentioned previously, LAD has been fully designed, but only partially implemented. A great deal of the future work lies in completing the implementation of LAD, linking in more data sources, and implementing the undefined methods. Some specific features to be included in the future work will be detailed in the following paragraphs. Currently, LAD retrieves a verb case frame from a secondary database by default. On occasions where the verb is not represented in the secondary database, a case frame is generated by first searching synonyms of the verb from WordNet (e.g., crush) and then checking in the secondary database for these synonyms (e.g., beat, defeat, whip, trounce, vanquish, overcome, fragment, break, separate). Presently, if that search still fails, processing is halted. However, in the future we would like to be able to generate a case frame based on the WordNet verb frame (see Appendix D). These verb frames are very basic and lack detail (e.g., Somebody ---s something), but could be used to build a minimal specified case frame from which further processing could continue. The specified and filled case frames presently accessed by LAD are useful semantic tools. In the future it would be beneficial to have verb case frames that contain syntactic information as well, based on Beth Levin's work on diathesis alternations and verb classes. Using her verb classes, it is hoped that verb case frames can be developed that would incorporate syntactic information as well as semantic meaning. This would improve the syntactic generation of sentences from the filled case frames. The initial stages of this work is discussed in (Zickus, 1994). A number of enhancements are being planned that will increase the ultimate utility of LAD, including a compiler that will produce a more compact version of the database based on a specification of words. This would reduce the overall memory and disk space requirements when used in a practical system. For instance, one could imagine generating a database containing just pronunciation information for some specific list of input words, or some list of words which occur at or above a specified frequency in the Brown corpus. It is anticipated that this compiler could be used to generate a lexicon of specific information for an application that might run on a PC or AAC device with limited memory. In addition, while LAD is intended to be primarily used by programmers, it will also be necessary for non-technical people to enter new information into the system, and for this, a front-end program will be developed to facilitate this process. BIBLIOGRAPHY Albacete, P. L., Chang, S. K., Polese, G., and Baker, B. (1994). Iconic Language Design for People with Significant Speech and Multiple Impairments. ASSETS `94 - The First Annual ACM Conference on Assistive Technologies, 23-30. Allen, B. P. (1994). Case-Based Reasoning: Business Applications. Communications of the ACM (Vol. 37, No. 3), 40-42. Allen, J. (1987). Natural Language Understanding. The Benjamin/Cummings Publishing Company, Inc., Menlo Park. Alm, N., Todman, J., Elder, L., and Newell, A. F. (1993). Computer Aided Conversation for Severely Physically Impaired Non-Speaking People. Human Factors in Computing Systems: INTERCHI`93 Conference Proceedings, ACM Press, New York, 236-241. American Heritage Dictionary (1985). 2nd College Edition. Houghton Mifflin Company Publishers, Boston. Anson, D. (1993). The Effect of Word Prediction on Typing Speed. The American Journal of Occupational Therapy (Vol. 47, No. 11), 1039-1042. Archer, L. (1977). Blissymbolics: A Nonverbal Communication System. Journal of Speech and Hearing Disorders (Vol. 42), 568-579. Baker, B. (1982). Minspeak: A Semantic Compaction System that Makes Self Expression Easier for Communicatively Disabled Individuals. Byte (Vol. 7, No. 9), 186-202. Berwick, R. C., Jones, D., Cho, F., Kahn, Z., Kohl, K., Radhakrishnan, A., Sauerland, U., and Ulicny, B. (1994). Issues in Modern Lexical Theory: the (E)VCA Project. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 47-61. Budd, T. (1991). An Introduction to Object-Oriented Programming. Addison-Wesley Publishing Company, New York. Byrd, R. J. (1989). Large Scale Cooperation on Large Scale Lexical Acquisition. Workshop on Lexical Acquisition, IJCAI. Detroit. Chang, S. K., Costagliola, G., Orefice, S., Polese, G., and Baker, B. R. (1992). A Methodology for Iconic Sentences for Augmentative Communication. Proceedings of the 1992 IEEE Workshop on Visual Languages, 110-116. Chang, S. K., Orefice, S., Polese, G., and Baker, B. R. (1993). Deriving the Meaning of Iconic Language Design with Application to Augmentative Communication. Proceedings of the 1993 IEEE Workshop on Visual Languages, 267-274. Chen, Z. and Huang, H. (1994). Lexical Knowledge Organization in Machine Translation. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 108-111. Church, K. W. (1994). Just-In-Time Lexicons. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 1-10. Demasco, P. W., and McCoy, K. F. (1992). Generating Text From Compressed Input: An Intelligent Interface for People with Severe Motor Impairments. Communications of the ACM, (Vol. 35, No. 5), 68-78. Demasco, P., and Mineo, B. (1995). AAC Technology: Next Year and Next Decade. Presentation at the Pennsylvania Speech Hearing Language Association, Valley Forge, PA. March 1995. Demacso, P., Newell, A. F., and Arnott, J. L. (1994). The Application of Spatialization and Spatial Metaphor to Augmentative and Alternative Communication. ASSETS `94 - The First Annual ACM Conference on Assistive Technologies, 31-38. Fass, D., and Wilks, Y. (1983). Preference Semantics, Ill-Formedness, and Metaphor. American Journal of Computational Linguistics (Vol. 9, Nos. 3-4), 188-196. Fellbaum, C. (1993). English Verbs as a Semantic Net. CSL Report 43, July 1990, Revised March 1993. Fellbaum, C., Gross, D., and Miller, K. (1993). Adjectives in WordNet. CSL Report 43, July 1990, Revised March 1993. Fillmore, C. J. (1977). The Case for Case Reopened. Syntax and Semantics VIII Grammatical Relations, P. Cole and J. M. Sadock (eds.), Academic Press, 59-81. Foulds, R. A. (1980). Communication Rates for Nonspeech Expression as a Function of Manual Tasks and Linguistic Constraints. Proceedings of International Conference on Rehabilitation Engineering - Toronto, 83-87. Granger, R. H. (1983). The NOMAD System: Expectation-Based Detection and Correction of Errors During Understanding of Syntactically and Semantically Ill-Formed Text. American Journal of Computational Linguistics (Vol. 9, Nos. 3-4), 188-196. Gros, J., Zganec, M., Mihelic, F., and Pavesic, N. (1994). A Lexicon for Automatic Speech Recognition and Understanding. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 186-191. Gu, Y. (1994). Some Theoretical Considerations on the Lexicon with Reference to the Minimalist Program. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 62-65. Hamilton, K. (1994). Predictive Letter Scanner for Augmentative Communication. Proceedings of the RESNA'94 Annual Conference, M. Binion (ed.), RESNA Press, Arlington, VA, 121-123. Hendrix, G. G., Sacerdoti, E. D., Sagalowicz, D., and Slocum, J. (1978). Developing a Natural Language Interface to Complex Data. Systems (036-5915), ACM Press, 563-584. Hogan, C., and Levin, L. S. (1994). Data Sparseness and the Acquisition of Syntax-Semantic Mappings from Corpora. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 153-159. Ide, N., and Veronis, J. (1994). Machine-Readable Dictionaries: What Have We Learned, Where Do We Go? Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 137-146. Jackendoff, R. (1978). Grammar as Evidence for Conceptual Structure. Linguistic Theory and Psychological Reality, M. Halle, J. Bresman, and G.Miller (eds.), MIT Press, Cambridge, 201-228. Jensen, K., Heidorn, G. E., Miller, L. A., and Ravin, Y. (1983). Parse Fitting and Prose Fixing: Getting a Hold on Ill-Formedness. American Journal of Computational Linguistics (Vol. 9, Nos. 3-4), 147-160. John, B. E., and Morris, J. H. (1993). HCI in the School of Computer Science at Carnegie Mellon University. Human Factors in Computing Systems: INTERCHI`93 Conference Proceedings, ACM Press, New York, 49-50. Johnson, G. J. (1994). Of Metaphor and the Difficulty of Computer Discourse. Communications of the ACM (Vol. 37, No. 12), 97-102. Jubak, J. (1992). In the Image of the Brain: Breaking the Barrier Between the Human Mind and Intelligent Machines. Little Brown and Company, Boston. Koester, H. H., and Levine, S. P. (1994). Validation of a Keystroke-Level Model for a Text Entry System Used by People with Disabilities. ASSETS `94 - The First Annual ACM Conference on Assistive Technologies, 115-122. Knight, K., and Luk, S. K. (1994). Building a Large-Scale Knowledge Base for Machine Translation. Proceedings of the 12th National Conference on Artificial Intelligence. Part 1 (of 2), Seattle, WA, 773-778. Kraat, A. W. (1990). Augmentative and Alternative Communication: Does It Have a Future in Aphasia Rehabilitation? Aphasiology (0268-7838 Vol. 4, No. 4), 321-338. Levin, B. (1992). A Preliminary Analysis of (De)Causative Verbs in English. Workshop on the Acquisition of the Lexicon - University of Pennsylvania, January 1992, 1-36. Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation, The University of Chicago Press. Levin, B., and Hovav, M. R. (1992). The Lexical Semantics of Verbs of Motion: the Perspective from Unaccusativity*. Thematic Structure Its Role in Grammar, I.M. Roca (ed.). Foris Publications, New York, 247-269. Levin, B., and Pinker, S. (1992). Introduction. Lexical & Conceptual Semantics, B. Levin and S. Pinker (eds.), Blackwell Publishers, Cambridge, 1-6. Light, J. (1988). Augmentative Communication: State of the Art in North America. ICAART88 - Montreal, 536-540. Macleod, C., Grishman, R., and Meyers, A. (1994). Developing Multiply Tagged Corpora for Lexical Research. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 11-22. Magnusson, J.H., and Briem, S. (1994). Application of Isbliss and Blissgrammar Symbolic Processing Programs. Contribution to the International Conference Beyond Normalization Towards One Society for All. Reykjavik, Iceland, 1-11. McCoy, K. F., and Demasco, P. W. (1995). Some Applications of Natural Language Processing to the Field of Augmentative and Alternative Communication. The Fourteenth International Joint Conference on Artificial Intelligence, IJCAI 95, Montreal, August 1995. Submitted. McCoy, K. F., Demasco, P. W., Jones, M. A., Pennington, C. A., Vanderheyden, P. B., and Zickus, W. M. (1994A). A Communication Tool for People with Disabilities: Lexical Semantics for Filling in the Pieces. ASSETS `94 - The First Annual ACM Conference on Assistive Technologies, 107-114 McCoy, K. F., McKnitt, W. M., Peischl, D. M., Pennington, C. A., Vanderheyden, P. B., and Demasco, P. W. (1994B). AAC-User Therapist Interactions: Preliminary Linguistic Observations and Implications for Compansion. Proceedings of the RESNA `94 Annual Conference, M. Binion (ed.), RESNA Press, Arlington, VA, 129-131. McHale, M. L., and Crowter, J. J. (1994). Constructing A Lexicon From A Machine Readable Dictionary. Army Rome Laboratory Technical Report, #RL-TR-94-178, Rome Laboratory, Griffiss AFB, New York. McKevitt, P., and Guo, C. (1994). From Chinese Rooms to Irish Rooms: Perspectives on Language and Perception. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 160-173. Miller, G. A. (1993). Nouns in WordNet: A Lexical Inheritance System. CSL Report 43, July 1990, Revised March 1993. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. (1993). Introduction to WordNet: An On-Line Lexical Database. CSL Report 43, July 1990, Revised March 1993. Miller, G. A., and Fellbaum, C. (1992). Semantic Networks of English. Lexical & Conceptual Semantics, B. Levin and S. Pinker (eds.), Blackwell Publishers, Cambridge, 197-229. Mineo, B., Demasco, P., Gray, J., and Bender, R. (1994). Systematic Assessment of Picture-Based Language Performance Via Computer. Proceedings of the 1994 ISAAC Conference, Maastricht, The Netherlands: IRV, 111-113. Morris, C., Newell, A., Booth, L., Ricketts, I., and Arnott, J. (1992). Syntax PAL: A System to Improve the Written Syntax of Language-Impaired Users. Assistive Technology (Vol. 4, No. 2), RESNA Press, Arlington, VA, 51-59. Newell, A. F. (1987). How Can We Develop Better Communication Aids? AAC Augmentative and Alternative Communication (0743-4618/87/0301-0036), 36-40. Palmer, M. (1990). Customizing Verb Definitions for Specific Semantic Domains. Machine Translation 5, Kluwer Academic Publishers, 5-30. Palmer, M. and Polguere, A. (1994). A Lexical and Conceptual Analysis of BREAK: A Computation Perspective. Computational Lexical Semantics, P. Saint-Dizier and E. Viegas (eds.), Cambridge University Press, to appear in 1994. Pustejovsky, J. (1991). The Generative Lexicon. Computational Linguistics (Vol. 17, No. 4), 409-441. Rich, E., and Knight, K. (1991). Artificial Intelligence (second edition). McGraw-Hill, Inc. New York. Shneiderman, B. (1993). Designing the User Interface: Strategies for Effective Human-Computer Interaction (second edition). Addison-Wesley Publishing Company, New York. Small, S., and Rieger, C. (1982). Parsing and Comprehending with Word Experts (A Theory and its Realization). Strategies for Natural Language Processing. W.G. Lehnert and M.H. Ringle (eds.), Lawrence Erlbaum Associates publishers, Hillsdale, NJ, 89-147. Stum, G. M., and Demasco, P. (1992). Flexible Abbreviation Expansion. Proceedings of the RESNA International `92 Conference, J. J. Presperin (ed.). Washington, D.C. RESNA Press, Arlington, VA, 371-373. Suri, L. Z., and McCoy, K. F. (1993). Correcting Discourse-Level Errors in a CALL System for Second Language Learners. Technical Report 94-02, Department of Computer and Information Sciences, University of Delaware, Newark, DE. Sutcliffe, R. F. E., O'Sullivan, D., Sharkey, N. E., Vossen, P., Slator, B. E. A., McElligott, A., and Bennis, L. (1994). A Psychometric Performance Metric for Semantic Lexicons. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 75-88. Tin, E., and Akman, V. (1994). Computational Situation Theory. Sigart Bulletin (Vol. 5, No. 4), ACM Press, New York, 4-18. Vanderheiden, G. C. (1984A). Augmentative Communication: Trends and Priorities in Research and Delivery. Proceedings of the 2nd International Conference on Rehabilitation Engineering - Ottawa, 23-26. Vanderheiden, G. C. (1984B). High and Low Technology Approaches in the Development of Communication Systems for Severely Physically Handicapped Persons. Exceptional Education Quarterly. (Vol. 4, No. 4), 40-56. Vanderheyden, P. B., Pennington, C. A., Peischl, D. M., McKnitt, W. M., McCoy, K. F., Demasco, P. W., van Balkom, H., and Kamphuis, H. (1994). Developing AAC Systems that Model Intelligent Partner Interactions: Methodological Considerations. Proceedings of the RESNA `94 Annual Conference, M. Binion (ed.), RESNA Press, Arlington, VA, 126-128. VanDyke, J. A. (1991). Word Prediction for Disabled Users: Applying Natural Language Processing to Enhance Communication. Thesis for Honors Bachelor of Arts in Cognitive Studies, University of Delaware, Newark, DE. 1991. Velardi, P., Fasolo, M., and Pazienza, M. T. (1991). How to Encode Semantic Knowledge: A Method for Meaning Representation and Computer-Aided Acquisition. Computational Linguistic (Vol. 17, No. 2), 153-170. Venkatagiri, H. S. (1993). Efficiency of Lexical Prediction as a Communication Acceleration Technique. Augmentative and Alternative Communication (Vol. 9, September), 161-167. Vizetelly, F. H. (1915). The Development of the Dictionary of the English Language. Funk and Wagnalls publishers, New York. von Tetzchner, S. (1988). Aided Communication for Handicapped Children. Ergonomics in Rehabilitation, Mital & Karwowski (eds.), Taylor and Francis Ltd. Publishers, 233-252. Waller, A., Broumley, L., and Newell, A. F. (1992). Incorporating Conversational Narratives in an AAC Device. Presented at ISAAC-92. Abstract appears in Augmentative and Alternative Communication, 8. Weischedel, R. M., and Sondheimer, N. K. (1983). Meta-Rules as a Basis for Processing Ill-Formed Input. American Journal of Computational Linguistics (Vol. 9, Nos. 3-4), 161-177. Wilks, Y. (1975). An Intelligent Analyzer and Understander of English. Communications of the ACM (Vol.18, No.5), 264-274. Winograd, T. (1983). Language as a Cognitive Process. Volume 1: Syntax. Addison-Wesley Publishing Company, Reading, MA. Zernik, U. (1989). Lexicon Acquisition: Learning from Corpus by Capitalizing on Lexical Categories. DARPA Speech and Natural Language Workshop, February 1989, Philadelphia, 1556-1562. Zickus, W. M. (1994). A Comparative Analysis of Beth Levin's English Verb Class Alternations and WordNet's Senses for the Verb Classes HIT, TOUCH, BREAK and CUT. Proceedings of the Post-COLING94 International Workshop on Directions of Lexical Research, Tsinghua University, Beijing, 66-74. Zickus, W. M., McCoy, K. F., Demasco, P. W., and Pennington, C. A. (1995). A Lexical Database for Intelligent AAC Systems. RESNA 95. Vancouver. June 1995. To appear in 1995 proceedings. Appendix A LAD SOURCE DESCRIPTION Base Class: lad The lad class is a NULL base class. All the classes that will access different lexical and linguistic sources will be derived from the lad class. In this manner there will be a minimum of code changes necessary in order to increase LAD's usefulness. All of the functions are virtual functions will be overloaded in the derived classes. When new classes are added to LAD, they will be public derived from lad. If a new class has functions not already part of the LAD design, these new methods will need to be added to the list of virtual functions that are in the lad class, and their overloaded definitions will be in the derived class for which they are useful, and alterations will need to be made to main.cc (the driver for the LAD module) in order to enable access to the new class methods. Any matters concerning control will be handled either within the methods themselves or in the driver. Table A.1 lists the lad functions and specifies the allowable arguments for these functions. This table is followed by a description of the functions, and the actions they provide. Table A.1: lad Class Function Names Function Arguments lad() ~lad() virtual valList* listCategories char* word, char* typesearch virtual int isWordRole char* word, char* wordpos, char* typesearch, char* searchrole virtual int isInSD char* word virtual verbFrame* getVerbFrame char* word virtual valList* getWNVerbFrame char* word virtual verbFrame* makeVerbFrame char* word virtual verbFrame* getVerbFrameSyn char* word, int flag virtual valList* getVerbSynonym char* word virtual valList* getAllVerbSynonym char* word virtual int isan char* word, char* parent virtual int isav char* word, char* parent virtual int hasa char* word, char* part virtual char* mappings char* key All functions are virtual in the base class lad, except for the constructor and destructor. See below for detailed descriptions: lad::lad() Since this is a NULL base class, this constructor does nothing. lad::~lad() This destructor destroys nothing, since this is a NULL base class. virtual valList* lad::listCategories(char* word, char* typesearch) This function takes as arguments word, the word being searched for, and a char* representing the type of search to be done, typesearch. This function returns 0 in the lad class, as it is further defined in the derived wordnet class. virtual int lad::isWordRole(char* word, char* wordpos, char* typesearch, char* searchrole) This function takes as arguments a word with its wordpos (NOUN, VERB, etc.), a typesearch to indicate the kind of search you are doing (hypen/hypev for an "isa" search, partn for a "hasa" search, and coorn for a "coordinate term" search) and searchrole (the role name we are looking to see if word has in its hierarchy). This function returns 0, as being in the null base class lad, it is further defined in the derived wordnet class. virtual int lad::isInSD(char* word) This function takes as an argument word, and checks to see if it is in the secondary database containing verb case frames. This function returns 0, as being in the null base class lad, it is further defined in the derived SecDB class. virtual verbFrame* lad::getVerbFrame(char* entry) This function takes an argument, entry, and returns 0, since it is further specified in the derived SecDB class. virtual valList* lad::getWNVerbFrame(char* entry) This function takes a verb and searches the WordNet VERB database for that verb. It returns 0, since it is in the null base class and is further defined in the derived wordnet class. virtual verbFrame* lad::makeVerbFrame(char* entry) This function takes a string (which is a WordNet VERB frame, see Appendix D) and returns 0. This function is defined in the derived wordnet class. virtual verbFrame* lad::getVerbFrameSyn(char* entry) This function takes a verb as its argument and returns 0, as it is further specified in the derived wordnet class. virtual valList* lad::getVerbSynonym(char* entry) This function takes a verb as its argument and returns 0, as it is further specified in the derived wordnet class. virtual valList* lad::getAllVerbSynonym(char* entry) This function takes a verb as its argument and returns 0, as it is further specified in the derived wordnet class. virtual int lad::isan(char* word, char* parent) This functions takes a word and its parent-term as arguments and returns 0 in this base class, as it is defined in the derived wordnet class. virtual int lad::isav(char* word, char* parent) This functions takes a word and its parent-term as arguments and returns 0 in this base class, as it is defined in the derived wordnet class. virtual int lad::hasa(char* word, char* part) This functions takes a word and its part as arguments and returns 0 in this base class, as it is defined in the derived wordnet class. virtual char* lad::mappings(char* key) This functions takes a word as its argument and returns 0 in this base class, as it is defined in the derived wordnet class. Public Derived Class: wordnet : public lad class The wordnet class handles all the interactions with the WordNet library functions and databases. Every function in this class extracts the information needed for the LAD methods that WordNet can provide (e.g., hierarchical, sense, synonym, and coordinate term data). This class is not complete, as there is more information that can be extracted from WordNet than is currently being done (see Chapter 3, Functional Interface section). A list of the functions implemented in the wordnet class is shown in Table A.2. Table A.2: wordnet Class Function Names Function Arguments wordnet() ~wordnet() valList* listCategories char* word, char* typesearch int isWordRole char* word, char* wordpos, char* typesearch, char* searchrole valList* getWNVerbFrame char* word verbFrame* makeVerbFrame char* word verbFrame* getVerbFrameSyn char* word, int flag valList* getVerbSynonym char* word valList* getAllVerbSynonym char* word int isan char* word, char* parentTerm int isav char* word, char* parentTerm int hasa char* word, char* part char* mappings char* key The following are function descriptions of the wordnet class: wordnet::wordnet() Since this class has no private members, the constructor does nothing. wordnet::~wordnet() This destructor destroys nothing, since this class has no private members. valList* wordnet::listCategories(char* word, char* typesearch) This function takes in a key, word, and a string representing the type of search to be done, typesearch, as its arguments. It searches the WordNet NOUN database for word and then checks its isa hierarchical links looking to see if word can fill the semantic categories used in our semantic reasoning. This function calls the mappings function to enable a mapping of the semantic categories used by the SRM and Compansion systems to the category names used by WordNet. It returns a list of the possible semantic categories word can play. For example: listCategories(hammer, hypen) returns the list [object, inanimate, tool]. int wordnet::isWordRole(char* word, char* wordpos, char* typesearch, char* searchrole) This function takes as arguments a word, its wordpos (NOUN, VERB, etc.), a typesearch, to indicate the kind of search you are doing (hypen/hypev for an isa search, partn for a hasa search, and coorn for a coordinate term search) and the object of its searchrole. For example, isWordRole(window, NOUN, hypen, object) searches the WordNet NOUN database using the hypen pointer to determine if a window isa object. The output from this function is a integer, 0 for false and 1 for true. In the example shown, the output is a 1 for true. valList* wordnet::getWNVerbFrame(char* entry) This function takes a verb, entry, and searches the WordNet VERB database for that verb. It will locate the verb frames for that verb and then call the makeVerbFrame function to convert the WordNet verb frame to an appropriate specified case frame. This will return a list of the possible specified case frames for entry. verbFrame* wordnet::makeVerbFrame(char* entry) This function takes a string (which is the WordNet verb frame, see Appendix D) and convert this into a specified case frame. It is called by getWNVerbFrame and returns a specified case frame for entry. verbFrame* wordnet::getVerbFrameSyn(char* entry) This function takes a verb, entry, that is not in the secondary database, and will go to WordNet searching for synonyms of entry. Then it will search the secondary database for a specified case frame using the list of synonym terms. It is call by getVerbFrame and will return a specified case frame or a 0 if no frames are found for any of the verb's synonyms. valList* wordnet::getVerbSynonym(char* entry) This function takes a verb, entry, and searches the WordNet VERB database for that verb. It returns a list of synonyms for entry, or it will return a 0. This function is called by the getVerbFrameSyn function. valList* wordnet::getAllVerbSynonym(char* entry) This function takes a verb, entry, and searches the WordNet VERB database for that verb. It returns a list of synonyms for that verb and that verb's superordinates or it will return a 0. This function is called by the getVerbFrameSyn function. int wordnet::isan(char* word, char* parent) This function takes a noun, word, and a parent-term, parent, as input. It searches WordNet's NOUN database for word, and then looks to see if the parent-term is in the hierarchy for word. It if finds parent, it returns a 1 for true, else it returns a 0 for false. int wordnet::isav(char* word, char* parent) This function takes a noun, word, and a parent-term, parent, as input. It searches WordNet's VERB database for word, and then looks to see if the parent-term is in the hierarchy for word. It if finds parent, it returns a 1 for true, else it returns a 0 for false. int wordnet::hasa(char* word, char* part) This function takes a noun, word, and a part-term, part, as input. It searches WordNet's NOUN database for word, and then looks to see if the part-term is in the attribute list for word. It if finds part, it returns a 1 for true, else it returns a 0 for false. char* wordnet::mappings(char* key) This function is called by the listCategories function, while it is parsing the hierarchical lexical link of a word. As each word of this link is parsed, it is passed as the argument to the mappings function, so it can be compared to entries in the two-dimensional translation array (see Appendix C). If key is located in the first column of the array, the entry in the second column of trans at the same index is returned. In this manner, we are able to map the semantic categories used by the SRM or Compansion to the categories used by WordNet. Public Derived Class: SecDB : public lad class Presently the functions in this class are used to obtain verb case frame information. Table A.3 lists the functions implemented for this class and the valid arguments for these functions. Table A.3: SecDB Class Function Names Function Arguments SecDB() ~SecDB() verbFrame* getVerbFrame char* word int isInSD char* entry The following are functions descriptions of the SecDB class: SecDB::SecDB() Since this class has no private members, the constructor does nothing. SecDB::~SecDB() This destructor destroys nothing, since this class has no private members. case_fr* SecDB::getVerbFrame(char* word) This function searches for a case frame for a given verb, word in the secondary database. The path of control for this search is to first check in the secondary database for word by calling the isInSD function; if it is there, then the associated case frame is returned. If no case frame is found, then it searches WordNet for synonyms of word by calling the getVerbFrameSyn function. This wordnet class function will traverse this list of synonyms (checking the secondary database for a synonymous entry); if it finds an entry, it will return the associated case frame. If no case frame is found, then it searches WordNet for synonyms of the parent of word and again will check the secondary database for entries. If getVerbFrameSyn finds no case frame, then it searches WordNet for word and will generate a list of possible case frames from the verb frames by calling the getWNVerbFrame function. The output from this function is a list of case frames for word. int SecDB::isInLRD(char* entry) This function is used in the getVerbFrame function. It searches the secondary database to see if a word is entered there, and if it is it returns a 1, for true, otherwise it returns a 0, for false. Appendix B SRM SOURCE DESCRIPTION Base Class: filledCaseFrame : public val class The filledCaseFrame class defines the members and methods used to create filled case frames. These case frames are used by the SRM to provide the user with a list of frames that show all of the possible generations from the words of input. This class uses a struct, f_case, that contains a char* caseName, char* caseFiller, and an int caseRating, thus allowing the SRM to specify the role the case is filling (caseName), the word of input that will fill this role (caseFiller) and the rating given to the placement of that word into that case. (e.g., for the verb break, Mary would be given a rating of 12 for the role of agexp and a rating of 3 for filling either the goal or benef roles). Table B.1 provides a list of the private members of the filledCaseFrame class (e.g., the verb, total frame rating, and the eight f_case roles provided in every frame), the methods used to manipulate the data stored in these members, and the arguments that are valid for the methods. The filledCaseFrame class is derived from the val class (a class implemented to handle integer and string input in various manners (e.g., in registers, lists and alone)), to utilize the class methods that enable making lists of filledCaseFrame objects. In this manner, we have profited from the object-oriented advantage of re-usable code. Table B.1: filledCaseFrame Class PRIVATE MEMBERS FUNCTIONS FUNCTION ARGUMENTS char* caseVerb filledCaseFrame int filledCaseRating filledCaseFrame char* name f_case* agexp ~filledCaseFrame f_case* theme int isSet char* role f_case* instr int getType f_case* goal f_case* findCase char* role f_case* benef void setName char* role, char* name f_case* loc void setRating char* role, int rate f_case* timee void setFiller char* role, char* entry f_case* tense void setVerb char* newverb void setFilledCaseRating void fillCases char* role, int rate, char* cat, char* newword, char* newverb void print ostream &os The following function descriptions give detailed information about the arguments, actions, and output they provide. filledCaseFrame::filledCaseFrame() This constructor instantiates the f_case* members with 0 (NULL) for their caseName, caseFiller and caseRating items, the caseVerb to 0 (NULL) and the filledCaseRating to 0. filledCaseFrame::filledCaseFrame(char* name) This constructor instantiates the f_case* members with 0 (NULL) for their caseName, caseFiller and caseRating items, the caseVerb to name and the filledCaseRating to 0. filledCaseFrame::~filledCaseFrame() This destructor deletes the f_case* members and returns the caseVerb and filledCaseRating to 0 (NULL). int filledCaseFrame::isSet(char* role) This function finds the specific case designated by role, and returns a 1 if it has already been filled or else it returns a 0. This function is used to ensure that once a case role has been filled by a word, that another word does not overwrite the data. This function is called by the setName, setRating and setFiller methods. int filledCaseFrame::getType() This function is inherited from the val class. It is used internally by other functions to enable the system to get type information from an object. f_case* filledCaseFrame::findCase(char* role) This function returns a f_case* pointer to the case specified by role. This function is used by setName, setRating, and setFiller to locate the proper case before filling it with the proper data. void filledCaseFrame::setName(char* role, char* name) This function gets a pointer to the proper case by calling findCase(role), then initializes this case's caseName with name. void filledCaseFrame::setRating(char* role, int rating) This function gets a pointer to the proper case by calling findCase(role), then initializes this case's caseRating with rating. void filledCaseFrame::setFiller(char* role, char* entry) This function gets a pointer to the proper case by calling findCase(role), then initializes this case's caseFiller with entry. void filledCaseFrame::setFilledCaseRating() This function totals up the caseRating integers for all of the cases in the filled case frame and then initializes the filledCaseRating member with this total. void filledCaseFrame::setVerb(char* newverb) This function initializes the caseVerb member of the filled case frame with newverb. void filledCaseFrame::fillCases(char* role, int rate, char* cat, char* newword, char* newverb) This function calls setVerb(newverb), setRating(role, rate), setFiller(role, cat), and setName(role, newword) to initialize a case within a filled case frame. void filledCaseFrame::print(ostream &os) This function prints out the contents of a filled case frame. base class: verbFrame The verbFrame class contains the members and methods needed to represent and manipulate specified case frames. The individual cases within the frame are made up of slot* structures which contain a string caseName, to indicate the role name associated with that slot* case, an integer toFillPref, to provide the semantic preference placed on that slot* case, and a valList* fillWith, to provide the list of semantic categories that can fill that case and the ratings associated with each category on this list. The specified case frames are the backbone of the semantic reasoning for the SRM. They provide the semantic preference information, as well as details as to what kinds of words can fill a specific role in a sentence. There are twenty different types of verb specified case frames that are depicted by the twenty derived classes from the verbFrame class. They differ in the fillWith and toFillPref members of their slot* roles, and provide semantic reasoning for a variety of different kinds of verbs. These twenty derived classes are: relational, attributive, verbal, written, oral, material, erg_do (ergative do verbs), ani_do (animate do verbs), peop_do (people do verbs), ingest, drink, eat, inhale, PA_trans, mental, cognitive, sensory, tactile, visual and auditory. Their details are not included in this Appendix since they differ from the verbFrame class in the contents of the two previously mentioned slot* members, but inherit their functions from the verbFrame class. Table B.2 contains a list of the private members, the functions, and the valid arguments which make up the verbFrame class. Table B.2: verbFrame Class PRIVATE MEMBERS FUNCTIONS FUNCTION ARGUMENTS char* caseVerb verbFrame slot* agexp verbFrame char* name slot* theme ~verbFrame slot* instr valList* genFilledFrames valList* values, char* verb, valList* mand, valList* used slot* goal filledCaseFrame* filledFrames valList* mand, char* verb slot* benef valList* genMultFrames valList* tmp2list, valList* mandroles, char* verb slot* loc int complexList valList* mult, valList* role, valList* mandRoles, char* verb, valList* retlist, filledCaseFrame* frame, int recurFlag, int wdct, int recurctr slot* timee void removeCat valList* rolelist slot* tense void ridConflicts valList* used, valList* checking slot* get_frame char* role int isType void set_role char* entry char* get_role char* role void setToFillPref char* role, int pref void setFillWithPref char* role, valList* list int getToFillPref char* role valList* getFillWithPref char* role int getRegIntValue valReg* item int getCatRating char* role, char* category int numHighRating char* role int getMaxRating char* role valList* findAllRoles char* word valList* findBestRoleNRate char* word int findBestRate char* word valList* findBestRole char* word valList* getMaxCatStr char* role int getMaxFillWithPref char* role, char* word char* getStrAssocMaxPref char* role, char* word void print_frame The following descriptions give detailed information about the arguments, actions, and output the functions listed in Table B.2 provide: verbFrame::verbFrame() This constructor instantiates the eight slot* members with 0 (NULL) for their toFillPref, and fillWith items, their slot* member caseName to "agexp", "theme", "instr", "goal", "benef", "loc", "timee", and "tense" (respectively), and the caseVerb to 0 (NULL). verbFrame::verbFrame(char* name) This constructor instantiates the eight slot* members with 0 (NULL) for their toFillPref, and fillWith items, their slot* member caseName to "agexp", "theme", "instr", "goal", "benef", "loc", "timee", and "tense" (respectively), and the caseVerb to name. verbFrame::~verbFrame() This destructor deletes the slot* members and returns the caseVerb to 0 (NULL). valList* verbFrame::genFilledFrames(valList* values, char* verb, valList* mand, valList* used) This function is responsible for generating the filledCaseFrame* objects that are the final output of the SRM. The algorithm it follows is: 1) check for all words that can only fill one role from the values list and put them on the mand list; 2) call ridConflicts to eliminate the used roles from Step 1 out of the remaining word's rolelists; 3) go back to step 1 until there are no conflicting roles left on the rolelists of the remaining words; and 4) if there are no words left with multiple roles they can fill, then all the words are on the mandroles list, so call filledFrames to generate the filledCaseFrames* from the mandroles list, else call genMultFrames to handle the recursion required for further processing. filledCaseFrame* verbFrame::filledFrames(valList* mand, char* verb) This function takes a list of words that can only fill one case and generates and returns the filledCaseFrame* for this list. valList* verbFrame::genMultFrames(valList* tmp2list, valList* mandroles, char* verb) This function takes a list of words that can play multiple roles with their lists of roles, a list of roles that are already filled because one or more words can only fill one role, and the verb of input as arguments. It checks to determine if tmp2list is a one element list, if it is then the function simply generates filled frames for each role in the one element list making sure there are no conflicts (i.e., checking to ensure that these roles are not already filled by the mandroles list elements) and adds the categories that are in the mandroles list. If tmp2list is more than one element, then more complex reasoning needs to be done so the function calls complexList to handle the intricacies of finding all combinations of all of the roles on all of the lists that are on tmp2list without any conflicting roles occurring. int verbFrame::complexList(valList* mult, valList* role, valList* mandRoles, char* verb, valList* retList, filledCaseFrame* frame, int recurFlag, int wdct, int recurctr) This function has the job of finding all combinations of cases a word can play in a filled case frame when there is a multiple list of words, mult, that can each fill multiple cases, making sure there are no words filling the same case (this is a case conflict) and that all the words are used in every filled frame, while ensuring that the cases on mandRoles are not filled since they are already filled by words that can only fill one role and thus must fill these roles. This function is called recursively, this means that it is called within the processing of a call to this function, which is why the variables recurFlag, wdct, and recurctr are required. The final output, retList, is a list of filledCaseFrame objects. void verbFrame::removeCat(valList* rolelist) This function will remove a category from a list, rolelist. This is used in the ridConflicts function to remove categories from lists if they are mandatorily filled by words that can only fill one role. void verbFrame::ridConflicts(valList* used, valList* checking) This function searches the checking list for rolelists with entries that match entries of the used list. If matches are found, they are removed from the rolelists. The used list is a list of roles that must be maintained since they are the only possible role their associated word can fill. slot* verbFrame::get_frame(char* role) This function returns a slot* pointer to the case corresponding to role (e.g., get_frame(agexp) will return a pointer to the slot* agexp). int verbFrame::isType() This function returns the integer value associated with VERB, it is used to find type of a verbFrame* object. This is needed because there are twenty types of specified verb frames derived from this class, each one having its own enumerated value. void verbFrame::set_role(char* entry) This function sets the caseVerb to entry. char* verbFrame::get_role(char* role) This function returns the caseName of the case corresponding to role. void verbFrame::setToFillPref(char* role, int pref) This function locates the case associated with role and sets the toFillPref with pref. For instance, setToFillPref("agexp", 4) will find the slot* case for agexp and set its toFillPref to 4. void verbFrame::setFillWithPref(char* role, valList* list) This function locates the slot* case corresponding with role and sets it fillWith to list. For instance setFillWithPref("theme", category_list) will find the slot* case for theme and initialize its fillWith value to category_list. int verbFrame::getToFillPref(char* role) This function locates the slot* case that corresponds to role and returns its toFillPref integer value. valList* verbFrame::getFillWithPref(char* role) This function locates the slot* case that corresponds to role and returns its fillWith valList* value. int verbFrame::getRegIntValue(valReg* item) This function extracts the integer value from a valReg* item. There are functions in the valReg* class that extract the register contents, but as a val* item and for this class, we need the value returned as an integer, not a valInt* item. int verbFrame::getCatRating(char* role, char* category) This function locates the slot* case corresponding to role and searches the fillWith list for an entry that matches category. Then this method returns the integer value associated with category, if category was found, otherwise it returns 0. int verbFrame::numHighRating(char* role) This function locates the slot* case corresponding to role and searches the fillWith list for the all entries with rating equal to the maximum rating found from the function call to getMaxRating and returns the total number of these entries found. int verbFrame::getMaxRating(char* role) This function locates the slot* case corresponding to role and searches the fillWith list for the entry with the highest rating and returns this maximum rating. valList* verbFrame::findAllroles(char* word) This function searches WordNet NOUN database for word. The goal of this function is to locate all of the semantic categories word has that matches the semantic category roles that are in the fillWith lists associated with the eight cases for the verbFrame* object that calls this function, and to place these in the valList* return_list and then return this list. It needs to ensure that a specific semantic category is not added to the return_list more than once (since this category could be in more than one of the eight fillWith lists). The return_list is a list of valReg* registers containing the semantic category name and the integer rating associated with that name. valList* verbFrame::findBestRoleNRate(char* word) This function has not been implemented, it is intended to find the best role and rate for a word. int verbFrame::findBestRate(char* word) This function calls findAllRoles to find all the roles word can fill and the associated integer ratings with these different roles, and then returns the maximum of these ratings. valList* verbFrame::findBestRole(char* word) This function calls findBestRate to determine the maximum rating word can provide and then locates all the roles word can fill that have this rating and places these roles on a list. The output from this function is a list of maximum rated roles a word can fill in a case frame. valList* verbFrame::getMaxCatStr(char* role) This function calls getMaxRating(role) and numHighRating(role) in order to return a list of categories that have the same maximum rating for a specific case in a verb frame. int verbFrame::getMaxFillWithPref(char* role, char* word) This function calls getMaxRating(role) for a specific case corresponding to role and returns this maximum integer value. char* verbFrame::getStrAssocMaxPref(char* role, char* word) This function has not been implemented, it is intended to return the word associated with the maximum rating. void verbFrame::print_frame() This function prints out the contents of a verbFrame* specified case frame. Appendix C SEMANTIC CATEGORIES One of the major projects of the ASEL NLP lab, Compansion, has its verb hierarchy based on a systemic grammar that utilizes semantic categories to capture semantic preference ratings. Since it is envisioned that LAD will be interacting with several various lexical and linguistic resources, there is a need to chose a base of semantic categories that can then be mapped to the semantic categories of these multiple resources. We chose to use the semantic categories used in the current implementation of Compansion for the SRM as well. Below is the current list of semantic categories we use: animate communication inanimate writing instrument oral organization object fragile ergative human tool_box physical tool place solid time ingestible communicator food message description abstract In order for LAD to search the WordNet databases for these semantic categories, a mapping function was needed to map the above category names to the category names used in WordNet. To help facilitate this, the translation array was constructed as shown below: WordNet Category Compansion/SRM Category [inanimate_object] <-> [inanimate] [inanimate] <-> [inanimate] [visual_percept] <-> [visual] [visual_communication] <-> [visual] [visual] <-> [visual] [ergative] <-> [ergative] [human] <-> [human] [living_thing] <-> [animate] [organism] <-> [animate] [animal] <-> [animate] [animate] <-> [animate] [auditory_communication] <-> [auditory] [auditory_sensation] <-> [auditory] [auditory] <-> [auditory] [food] <-> [food] [foodstuff] <-> [ingestible] [nutrient] <-> [ingestible] [ingestible] <-> [ingestible] [dairy_product] <-> [drink] [beverage] <-> [drink] [drink] <-> [drink] [respiration] <-> [inhaled] [breathing] <-> [inhaled] [inhaled] <-> [inhaled] [cognitive_content] <-> [cognitive] [cognition] <-> [cognitive] [explanation] <-> [cognitive] [cognitive] <-> [cognitive] [business] <-> [organization] [organization] <-> [organization] [facility] <-> [place] [residence] <-> [place] [building] <-> [place] [business] <-> [place] [location] <-> [place] [place] <-> [place] [speech_act] <-> [oral] [oral] <-> [oral] [abstraction] <-> [abstract] [abstract] <-> [abstract] [toolbox] <-> [tool_box] [tool_box] <-> [tool_box] [instrument] <-> [instrument] [time] <-> [time] [thin] <-> [fragile] [brittle] <-> [fragile] [pane] <-> [fragile] [mirror] <-> [fragile] [fragile] <-> [fragile] [person] <-> [communicator] [causal_agent] <-> [communicator] [communicator] <-> [communicator] [message] <-> [message] [communication] <-> [communication] [writing] <-> [writing] [object] <-> [object] [tool] <-> [tool] [solid] <-> [solid] Appendix D WORDNET VERB FRAMES Something ----s Somebody ----s It is ---ing Something is ---ing PP Something ---s something Adjective/Noun Something ---s Adjective/Noun Somebody ---s Adjective Somebody ---s something Somebody ---s somebody Something ---s somebody Something ---s something Something ---s to somebody Somebody ---s on something Somebody ---s somebody something Somebody ---s something to somebody Somebody ---s something from somebody Somebody ---s somebody with something Somebody ---s somebody of something Somebody ---s something on somebody Somebody ---s somebody PP Somebody ---s something PP Somebody ---s PP Somebody's (body part) ---s Somebody ---s somebody to INFINITIVE Somebody ---s somebody INFINITIVE Somebody ---s that CLAUSE Somebody ---s to somebody Somebody ---s to INFINITIVE Somebody ---s whether INFINITIVE Somebody ---s somebody into V-ing something Somebody ---s INFINITIVE Somebody ---s VERB-ing Somebody ---s something with something It ---s that CLAUSE Something ---s INFINITIVE