A SOFTWARE ENGINEERING APPROACH TO DEVELOPING AN OBJECT-ORIENTED LEXICAL ACCESS DATABASE AND SEMANTIC REASONING MODULE by Wendy Mair Zickus A thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Computer and Information Sciences Spring 1995 Copyright 1995 by Wendy Mair Zickus All Rights Reserved Acknowledgments This work has been supported by a Rehabilitation Engineering Research Center Grant from the National Institute on Disability and Rehabilitation Research of the U.S. Department of Education (#H133E30010). Additional support has been provided by the Nemours Research Programs. I would like to thank my mother, Mary G. Mair, for being a surrogate mother to my daughter, Rebecca, while I was working on my graduate studies. I would also like to thank both her and my father, Robert D. Mair, for being supportive of my work over the past few years. To my advisors, Kathleen F. McCoy and Patrick W. Demasco, I am indebted for their guidance, patience, and faith in me. When I started working with them in the Natural Language Processing Lab at the Center for Applied Science and Engineering, located at the A. I. duPont Institute, I was a new graduate student in the Computer Science Department interested in Object-Oriented Programming. Now I am an Object-Oriented Natural Language Processing programmer with some expertise in Computational Lexical Semantics. I want to thank my daughter, Rebecca, for understanding all of the times when I could not be there due to projects and academic obligations, and especially for allowing me to help her celebrate her birthday in June instead of May due to finals. And last, but definitely not least, I'd like to thank my husband, Timothy E. Zickus for his love, support and seemingly endless editing of this thesis. His pride and belief in my abilities and capabilities were of immeasurable importance to the completion of my research work at The University of Delaware. TABLE OF CONTENTS LIST OF FIGURES viii ABSTRACT ix Chapter 1 INTRODUCTION 1 2 RELATED RESEARCH 7 Motivation 7 Application Domain: Augmentative and Alternative Communication 8 AAC Language Technology Overview 9 Some Natural Language Processing (NLP) used in AAC Applications 10 Need for Word Knowledge in NLP based AAC Systems 11 Compansion 13 Computational Lexical Semantic Overview 17 Dictionaries and Lexicons 20 Paper-Based 20 On-Line 21 WordNet 22 Efforts in Merging Lexical Resources 30 3 LAD DESIGN 34 Motivation 34 Functional Interface 36 Argument Structure 36 Functions for Semantic Knowledge 37 Functions for Syntactic Knowledge 44 Functions for Other Specialized Knowledge 45 LAD Resources 46 WordNet 47 Case Frames 49 Morphology 49 Phonetic Information 50 Syllabification Information 50 Frequency Information 51 LAD Mapping 51 Semantic Categories 52 Queries Requiring Multiple Databases 53 Complicated Queries 54 Application Areas 55 Object-Oriented Design 56 Extensibility 59 Implementation Status 60 4 SEMANTIC REASONING MODULE 61 Motivation 61 Case Frames and Semantic Reasoning 62 Semantic Categories 63 Preferences 63 SRM and LAD (Knowledge Representations) 67 Secondary Verb Frames vs. WordNet Verb Frames 67 Semantic Output 68 SRM Processing 69 5 CONCLUSIONS 73 Accomplishments 73 LAD Implementation Status 74 WordNet Usage 75 SRM and LAD Results 75 SRM Performance vs. Compansion Performance 77 Discussion 80 Future Work 81 BIBLIOGRAPHY 83 Appendix A LAD SOURCE DESCRIPTION 90 Appendix B SRM SOURCE DESCRIPTION 98 Appendix C SEMANTIC CATEGORIES 108 Appendix D WORDNET VERB FRAMES 111 LIST OF FIGURES Figure 2.1 Specified Case Frame for the Verb BREAK 15 Figure 2.2 List of 25 Unique Beginners for WordNet Nouns 23 Figure 2.3 Hyponymic Relations of Seven WordNet Unique Beginners 24 Figure 2.4 WordNet Verb Sentence Frames 28 Figure 3.1 Language Access Database Diagram 48 Figure 3.2 LAD Object-Oriented Class Hierarchy 58 Figure 4.1 Specified Case Frame for the Verb LIKE 66 Figure 4.2 List of Filled Case Frames 69 Figure 4.3 Algorithm for the Generation of Filled Case Frames (Part 1) 70 Figure 4.4 Algorithm for the Generation of Filled Case Frames (Part 2) 71 ABSTRACT The concept of enabling computers to understand and generate `natural' language has been enticing to mankind ever since we first saw and heard HAL in the movie "2001: A Space Odyssey." However, the state of the art in Natural Language Processing (NLP) is a long way from creating a reality of this concept. One of the major bottlenecks in the implementation of NLP is the lexicon: the place where the system's information about words is stored. There are difficulties in deciding what information should be stored in a lexicon, and even greater difficulties in acquiring this information. To date there have been several efforts which provide on-line lexical database resources with varying amounts of lexical information for NLP systems. One problem that remains, however, is that a particular NLP application may need information from several such sources, and there is no standard way to access and combine information from several lexical resources. This thesis describes the Language Access Database (LAD) system, a system designed to incorporate multiple lexical databases and tools under one consistent functional interface in order to facilitate systems requiring syntactic, semantic and lexical knowledge. The technical aspects of the design were mainly influenced by research in the areas of Computational Lexical Semantics, Augmentative and Alternative Communication, and Natural Language Processing. The implementational aspects of the design were influenced by the paradigm of Object-Oriented programming to create a system that is easily extendable and upgradable. This thesis presents LAD and the Semantic Reasoning Module (SRM), a semantic parsing system designed to test the LAD system's functionality. Chapter 1 INTRODUCTION In order for communication to exist, there must be some agreed upon representation of symbols, whether iconic, gestural, vocal, or written, and an agreed upon set of rules upon which to build combinations of symbols into sentences of meaning. Human beings have the capacity to learn how to communicate, to pass along the semantic and syntactic rules to their following generations, and eventually to expand upon and refine these symbols and rules. Through these representations, mankind has the ability to communicate, to generate new ideas and thoughts, to learn and to build upon the knowledge built up in the complex structure called language. According to the American Heritage Dictionary (1985), language is: The use by human beings of voice sounds, and often of written symbols that represent these sounds, in organized combinations and patterns to express and communicate thoughts and feelings. A system of words formed from such combinations and patterns, used by the people of a particular country or by a group of people with a shared history or set of traditions. Since the advent of computers, humans have been searching for ways to improve them. We strive to make them faster, increase their memory capacities, improve the user-interface peripherals and the methods we use to communicate with them. From a business perspective, there is a feeling of frustration at times, voiced about the computer not living up to its heralded increased productivity and decreased cost expectations. This frustration is partially caused by the high learning curve for employees to master the computer and its software, and partially due to the unnatural communication methods we use to describe our problem to the computer, solve it, and interpret the results. This is one impetus pushing computer scientists to give our programs the capability to understand or generate language in order to provide a more `natural' way for computer users to be able to communicate with their systems and vice versa. Natural Language Processing (NLP) is the area within computer science that deals with trying to create ways to process human language for various reasons and applications. In order to do this, there must be some means of representing language structure, and a mechanism for using this representation for generating and understanding language within the computer. Efforts to find solutions to these tasks have focused on three major areas of research: syntactic, semantic and pragmatic. The syntax of a language refers to the grammatical rules defining allowable ways to link words together into sentences. Semantics refers to the meanings of individual words and to the meanings formed by combining the words in a sentence. {FOOTNOTE: One NLP solution used to resolve word sense ambiguities is to look at the individual words in a sentence in relation to the other words of the sentence (i.e., the words in a sentence constrain the possible meanings of each other) (Small & Rieger, 1982).] Pragmatics refers to the overall context in which language is used (i.e., the situational context and past conversational experiences between conversing partners), creating expectations about future language interactions and context. This thesis will focus on the issues raised by the semantics of language. One of the difficulties faced by the area of semantics is that many words have multiple meanings (e.g., shoe can refer to protective covering worn on the foot, a restraining device used to help cars brake, or a container for cards in a casino). Once this is determined, a semantic system must reason about how individual word meanings can be used in resolving the meaning of a sentence. This has caused a focus on the kinds of information needed in order to do semantic parsing and reasoning. Do we need hierarchically linked relationships between words as well as syntactic and definitional information? How should we address the word sense disambiguation issue? [FOOTNOTE: This is one of the major questions faced today by researchers in Computational Semantics. It was the focus of the Post-COLING94 Workshop on Directions of Lexical Research, and is one of the main foci of the upcoming 1st Annual workshop for the IFIP Working Group for Natural Language Processing and Knowledge Representation to be held at the University of Pennsylvania in April 1995.] Some of this research has resulted in the specialized field of Computational Lexical Semantics, while other research has led to an increased understanding and improved methods and techniques necessary to handle the complex task of understanding and generating natural language. [FOOTNOTE: Though the state of the art in NLP still has restricted domains, there is nowadays broader application coverage for NLP systems (Allen, 1994).] Augmentative and Alternative Communication (AAC) research focuses on developing technologies to aid in spoken and written communication processes for people with motor and/or cognitive impairments. The abilities of people using AAC devices cover a broad spectrum of capabilities, from the mild to the more severely impaired individuals and from those with a single disability to those with a combination of cognitive, speech or language impairments. AAC devices are designed to enhance the language capabilities of this group and thus come in a variety of forms with a variety of technological options. Some AAC devices are non-electronic boards or books that may contain letters, words, concepts, or phrases written in traditional orthography or pictorial representations. Others are electronic, containing similar representations and usually accompanied by speech synthesis that provide a user with the capability to select and "speak" an utterance. Some electronic AAC devices include predictive capabilities to allow the user to create messages with a reduced physical load. This thesis will focus on the AAC devices that incorporate Artificial Intelligence and Natural Language Processing techniques and principles. Such systems require knowledge attached to the words in their system dictionaries. The dictionary requirements for NLP-based AAC systems reflect the needs of the three major NLP research areas: syntactic, semantic and pragmatic. For the syntactic area, there is a need for grammars (i.e., a set of rules that refer to word categories and word patterns), morphological information (e.g., +S for plural, +ING), and word category information (e.g., NOUN, VERB, PREPOSITION). Syntactic knowledge is needed for systems that use word prediction, grammar checkers, and/or language tutoring (McCoy & Demasco, 1995). Semantic knowledge is useful for systems that process and/or generate language. Examples of semantic knowledge needed would include selectional restrictions, case frames, conceptual meaning for words (e.g., the semantic categories for Mary are human and animate, window are fragile and physical object, hammer are tool and physical object), relational knowledge (e.g., a car is-a vehicle, a canary is-a bird), attributive knowledge (e.g., a canary is small and yellow with wings, feathers and a beak), functional knowledge (e.g., knives are used for cutting, trees are used for shade, energy and building) and knowledge of parts (e.g., a car has a windshield, a hood, brakes, fenders, tires, steering wheel). Semantic knowledge has proven useful to AAC systems which try to understand the meaning the user of the system is attempting to convey. And finally, for the area of pragmatics, knowledge concerning situational context and communication history, conversational information (i.e., turn-taking rules, organizational patterns for story-telling) is needed. This pragmatic information is important for systems that model typical conversational patterns (Alm et al., 1993). There is a variety of applications that require large amounts of lexical knowledge. While there are several on-line lexical resources available today, no single one contains all the information which may be necessary for AAC/NLP applications. [FOOTNOTE: Some of the efforts to develop such on-line lexical resources will be discussed in Chapter 2.] In the Language Access Database (LAD), we attempt to provide such a resource by creating a virtual dictionary which combines information from several available lexical and linguistic sources in a transparent, seamless manner. This project will provide a uniform dictionary interface to a variety of lexical resources (e.g., WordNet, collocative information, verb case frame data, word frequency data, phonetic information, syntactic information, category information). LAD provides the user (i.e., application program) with a functional interface through which they may pose a variety of queries. The LAD program will search the appropriate lexical resource in order to handle a query. LAD shields the user program from the task of understanding which (set of) lexical resource(s) actually contains the necessary information. Many different kinds of lexical information can be obtained through this consistent user interface. LAD has the capability to extract information needed by systems to enhance their capabilities to understand language, to generate semantically and syntactically correct language, to process language and/or to produce speech-synthesized language, depending on the needs of the individual systems. The focus of this thesis is the LAD system and its design. Before describing LAD in detail, Chapter 2 provides the technical background. It first motivates the need for a system like LAD by describing the application domain of Augmentative and Alternative Communication concentrating on NLP-based AAC applications. It also covers the area of Computational Lexical Semantics and work in the area of dictionaries (both paper-based and on-line, and efforts at merging lexical resources). The chapter highlights current research, knowledge base needs of systems, and available knowledge that current lexical sources provide. Chapter 3 details the LAD design: its functional interface, lexical resources, mapping capabilities, application areas, object-oriented design, extensibility, and implementation status. This chapter provides a complete description of LAD including the different kinds of lexical information that it can provide, a listing of applications that can benefit from LAD, and the flexibility the LAD design provides with respect to future enhancements and/or additional lexical resources. Chapter 4 details an application written to test and evaluate LAD called the Semantic Reasoning Module. It describes the kinds of information the module needs, its semantic reasoning, and how LAD provides the information needed for it to function properly. Finally, Chapter 5 concludes this thesis with an analysis of the testing and evaluation of LAD, a discussion regarding what LAD does and does not provide, and a look into the future directions of the LAD research project. Chapter 2 RELATED RESEARCH Motivation A great deal of knowledge is required in order to understand language. There are many areas in Computer Science and other professions researching the different `kinds' of knowledge needed for this task, such as Computational Lexical Semantics, Linguistics, Psycholinguistics, Augmentative and Alternative Communication (AAC), and Natural Language Processing (NLP), just to name a few. Some of these research areas concentrate on the fundamental structure of language, its syntactic and semantic properties, while other areas concentrate on what is needed in order for a computerized system to represent the meaning of textual input to facilitate computer understanding and/or generation of language. They all agree that the task set before them is difficult and may end up being impossible to accomplish in its most complete form. The progress made in this area has already led to vast improvements in computerized understanding and processing within restricted domains and applications (Chen & Huang, 1994). The need for and benefits of systems that will allow for unconstrained input justifies the continued efforts of researchers in this field (McHale & Crowter, 1994). The following section provides an overview of the application area of AAC which was the catalyst of this work. We focus on NLP-based AAC systems and their lexical semantic requirements. The following sections provide background into the area of Computational Lexical Semantics, on which this thesis has relied for technical information about words (i.e., information that must be associated with words for various kinds of processing). The final section covers research on dictionaries, both paper-based and on-line, for further guidance on the kind of information that should be available on words. Application Domain: Augmentative and Alternative Communication The field of Augmentative Alternative Communication (AAC) is devoted to developing alternative technology for people with disabilities [FOOTNOTE: These disabilities are primarily caused by afflictions such as cerebral palsy, ALS, etc.] that prevent normal means of communication. The technology may provide an alternative means of communication and/or may be used in the assessment and training for these individuals (Demasco et al., 1994). The research has centered around the areas of input (e.g., eye gaze and gesture), language (e.g., representation, organization and processing of language), and output (e.g., speech synthesis) (Demasco & Mineo, 1995). This thesis will focus on the language area in the following subsections, giving an overview into the current research in AAC and NLP-based AAC technology. We will look at the knowledge required by these systems and discuss an example of NLP-based AAC technology called Compansion (Demasco & McCoy, 1992). AAC Language Technology Overview The majority of the research in the language area of AAC can be grouped in the three categories of representation, organization and processing of language. The representation group research concerns vocabulary design (e.g., icons, word board) (Mineo et. al., 1994; Chang et al., 1993; Magnusson & Briem, 1994; Albacete et al., 1994). The organization of language group includes organizational models (Demasco et al., 1994) and access issues (e.g., scanning, direct access) (Koester & Levine, 1994). The processing group includes research on rate enhancement techniques (e.g., letter prediction, word prediction) (Venkatagiri, 1993; Hamilton, 1994; Demasco et al., 1994; McCoy et al., 1994B; Vanderheyden et al., 1994) and language correction (Suri & McCoy, 1993; Morris et al., 1992). In early research, little attention was paid to the syntactic, semantic or language issues involved (Kraat, 1990) in developing AAC technologies. However, since the late nineteen-eighties, a subarea in AAC research that concentrates on applying Natural Language Processing techniques to enhance AAC communication devices has emerged. The University of Delaware and the University of Dundee are forerunners in this area of research (McCoy & Demasco, 1995). One of the major problems involved in developing intelligent AAC devices is that suddenly the system designer/user need not only be concerned about the usability issues, placement and access to words, but also needs to provide the information about words which is necessary to do intelligent reasoning. Some Natural Language Processing (NLP) used in AAC Applications The use of Artificial Intelligence (AI) and Natural Language Processing (NLP) techniques in the development of AAC systems and devices continues to grow both in research laboratories and, more recently, in commercial products. The use of AI/NLP methods in any application area requires significant language knowledge about words and their various morphological forms, syntactic categories (e.g., NOUN, VERB, CONJUNCTION), thematic roles (e.g., AGENT, THEME, LOCATIVE) (Fillmore, 1977) and control information (McHale & Crowter, 1994). Within AAC, the need to support relatively unconstrained message production (in contrast to structured communication as in a database query language) requires that this knowledge be broad as well as detailed. Some of the underlying NLP work looks at many of the variables involved for a system to understand ill-formed language (Weischedel & Sondheimer, 1983; Jensen et al., 1983). This research includes developing methods to resolve issues such as, unknown words, word sense disambiguation, resolving referents, etc. (Granger, 1983; Fass & Wilks, 1983). In order to handle well-formed input, a system can generally resolve most of these variables, as long as the domain of the application is specific. In the case of ill-formed input, however, there are additional obstacles to consider (e.g., filling in missing words and/or punctuation, ad hoc abbreviations, lack of tense agreement, causality violation, goal violation, object or event referenced out of context, subject-verb disagreement, homonyms confused). In order to deal with irregularities of input, NLP researchers have been required to develop new methodologies (Small & Rieger, 1982). Need for Word Knowledge in NLP based AAC Systems Many word-based AAC systems contain words and word phrases written in traditional orthography. Other systems are based on graphic or icon-based conceptual representations. The communication devices designed for children with disabilities tend to be icon-based; one of their functions is to enable children to understand that language can be used to communicate desires and needs, such as getting help, obtaining objects, expressing likes and dislikes (von Tetzchner, 1988). However, as the child progresses, she/he requires new methods to communicate at her/his increased vocabulary levels. One method to help accommodate this is through spelling, another would be through learning more creative and complicated sequences of icons. There are two predominate icon-based languages used in AAC today. One is the Blisssymbolic [FOOTNOTE: Blissymbolics is a communication system designed by Charles Bliss based on symbols. It was originally thought to be an international symbol system to minimize misunderstandings between people of different language groups (e.g., Chinese, English, and Japanese speakers). Since 1971, Blisssymbolic has been used on communication devices by children with speech disabilities (Magnusson & Briem, 1994).] language (Archer, 1977) and the other is the iconic language of the Minspeak system which is based upon the principle of semantic compaction [FOOTNOE: Semantic compaction is the process involving mapping concepts on to multi-meaning icon sequences and using these sequences to retrieve messages stored in a computerized system. It was developed by Bruce Baker. (Baker, 1982). In particular, Minspeak provides an excellent example of a system developed with linguistic and syntactic principles in mind. It will be the focus of the following discussion. One major focus in the design of the Minspeak language is finding conceptually associated iconic representations and then combining these icons with syntactically bound icons. For instance, a sequence of the APPLE icon and the VERB icon will produce the word eat, while a sequence of the APPLE icon and the NOUN icon will produce the word food. Minspeak considers the icon images to act as indexes to concepts and these concepts translate to words depending on the conceptual mapping of the iconic sequences (Chang et al., 1992). The researchers involved with developing the iconic languages used by Minspeak have recently been studying the use of AI techniques to introduce semantics to the language design. They have introduced some components from Conceptual Dependency Theory [FOOTNOTE: Conceptual Dependency (CD) is a theory of Natural Language Processing developed by Roger Schank that provides computer programs with the capability of "understanding" natural language sentences by representing sentences in a series of primitives. The goal of CD is to represent language such that it aids drawing inferences from sentences and is independent of the language (e.g., Chinese, English, French) (Rich & Knight, 1991).] to enhance the development of appropriate iconic sequences for use in programming the device (Chang et al., 1993; Albacete et al., 1994). Kraat's concerns about the aphasic community (Kraat, 1990) has pointed out that AAC has not addressed the issue of semantic meanings in language; this concern is being addressed by researchers such as Baker and Magnusson who are beginning to incorporate AI techniques into Minspeak and Blisssymbolic. The BlissGrammar program, for example, will correct sentences made with Blissymbols so that the text-to-speech converter will speak or print out easily understood sentences (Magnusson & Briem, 1994). Other AAC researchers have employed AI techniques to word prediction (VanDyke, 1991), letter-based abbreviation expansion (Stum & Demasco, 1992), word-based systems (McCoy & Demasco, 1995), and sentence retrieval techniques (Waller et al., 1992). In order to incorporate the AI/NLP techniques to these methods, a great deal of knowledge is required. Word and letter prediction systems need to be able to access frequency information. [FOOTNOTE: Most frequency information is derived from either bigram and trigram frequencies or from corpora information. A study of simulated experiments at the Applied Science and Engineering Laboratories showed that using corpora frequency data, predictions after the first letter were correct 44.8% vs. a correct rating of 27.9% using bigram prediction (Hamilton, 1994).] Letter-based abbreviation expansion systems require a set of rules and frequency information. One of the more sophisticated NLP-based AAC systems, Compansion (Demasco & McCoy, 1992), was developed at the University of Delaware and the Center for Applied Science and Engineering. Compansion is an example of a word-based system that requires a great deal of knowledge in order to do its processing. That system and its necessary knowledge is described in the following section. Compansion One example of an intelligent AAC technique is Compansion, an approach that takes telegraphic input from a user and expands it into a syntactically and semantically well-formed sentence. The Compansion technique assumes a communication system based on words, pictures, or icons (i.e., non-spelling) and attempts to enhance the user's message production rate [FOOTNOTE: While the Compansion techniques has been primarily described as a rate enhancement technique, it also has potential applications in helping users learn how to produce grammatical sentences.] by requiring only the selection of content words. One advantage of such a system is that it reduces the need to represent morphological information (e.g., verb inflections, endings). This is potentially very beneficial for systems that use picture-based representations. The Compansion system uses both syntactic and semantic knowledge in its processing. The system's processing consists of three major phases. The first phase is the "word order" parser and relies on a syntactic grammar which captures the regularity of the telegraphic input in this initial parsing phase. The word order parser is responsible for grouping words into sentence-sized chunks and indicating each word's part of speech. It is also responsible for attaching modifiers (e.g., compound possessives, adjectives and adverbs) to the word they are most likely modifying. The second phase consists of the "semantic" parser which reasons about the meaning of the content words of the sentence-sized chunks and develops a semantic representation for each chunk. The final phase is the translator/generator which takes the output of the semantic parser and generates an English sentence using a syntactic grammar of English. Here we focus on the knowledge used by the semantic parser since it requires the most sophisticated knowledge sources. The semantic parser takes a set of words and attempts to fit these items into a well-formed semantic structure, thus determining the intended meaning. In the current implementation, processing is non-incremental; all of the input words are taken together and a semantic representation is created which best accommodates the set of words as a whole. Generally there will be at most one word identified as the main verb in the input words; the parser must determine which semantic role is being played by the other words. Consider the processing of the input John break window hammer. Once break is identified as the verb, the parser must decide which word of the input represents the agent/experiencer (i.e., person or thing doing the action), which represents the theme (i.e., thing being acted upon), and so on (Fillmore, 1977). This information is represented in the semantic parser in the form of a case frame that includes preference information (Fass & Wilks, 1983) to help the system reason about the possible intended use for each word of input. For example, a simplified form of the specified case frame for break is shown in Figure 2.1 below: [Figure Diagram] verb - break agexp [toFillPref 4] [[human 3] [animate 2] [ergative 2]] theme [toFillPref 3] [[physical 3] [fragile 4] [object 1]] instr [toFillPref 2] [[tool_box 3] [tool 3] [solid 1]] goal [toFillPref 1] [human 3] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Figure 2.1 Specified Case Frame for the Verb BREAK The above frame captures two kinds of preferences used by the system. The first kind of preference, the case importance preference, is indicated by the toFillPref rating. This indicates which of the roles are most important to fill for a particular verb. In the frame above, the important roles to fill are the AGEXP (agent/experiencer), THEME, and INSTR (instrument). This is indicated by their respective ratings of four, three and two. A four indicates a very high preference while decreasing numbers indicate less preferred, but possible, options.The second kind of preference rating is the case filler preferences which are motivated by Preference Semantics (Wilks, 1975). They indicate what kinds of objects could fill a role and indicate a rating on the "goodness" of each type of filler. The frame in Figure 2.1 indicates that the AGEXP role is preferred to be filled by a human with a rating of three, but that any animate object or ergative object (e.g., a car) would also be acceptable with ratings of two. The THEME role is preferred to be filled by a fragile object, but a physical or abstract object could also serve as a filler, with ratings of four, three, and one respectively. A third kind of preference used by the system, but not directly illustrated in the Figure 2.1, is the higher-order case preferences. This case preference is intended to account for interactions between cases and their fillers. For instance, if a non-human animate (e.g., dog) is the AGEXP of a material process (e.g., eating), then it is highly unlikely that an INSTR is being used; however if the AGEXP is a human, an INSTR would be very likely. In the semantic parser, this preference is captured by a rule which subtracts from the overall goodness of an interpretation when these conditions are found. An interpretation is given a rating by adding together the numbers obtained by multiplying the case filler and case importance ratings for each word's case (and then subtracting off the number indicated by higher-order case preferences). So for the input John break window hammer, we prefer that John be the AGEXP (with a rating of 4 times 3), window be the THEME (with a rating of 3 times 4) and hammer be the INSTR (with a rating of 2 times 3), giving us an overall preference rating of 30. The idiosyncratic case constraints are not used to add or subtract from a case frame's preference rating, but are intended to capture mandatory and forbidden cases within a frame. For instance, a mandatory feature of the verb hit is the THEME, while GOAL is a forbidden feature for this same verb. These constraints are placed directly on the verb frames by the absence or presence of these cases (McCoy et al., 1994A). The basic idea of the semantic parser is to fit the non-verb words of input into the case frame in the best way possible. In order to do this, the semantic parser must access type-information associated with each word. For instance, it must be able to tell that John is a human, that hammer is not a human but a physical object, and that a window is fragile. With this information the semantic parser can reason about the words of input and generate the sentence John breaks the window with a hammer. A major problem for Compansion as a viable AAC system is the amount of information it requires about each word it may encounter. This information is not readily available in any standard database or on-line dictionary, and it must currently be specified by an NLP researcher. This, of course, greatly hampers the extendability and customization of the system. Computational Lexical Semantic Overview Computational Semantics is a cross-disciplinary research area focusing on enabling computers to reason about a language's semantics. This area is comprised of NLP computer scientists, linguists, and psycholinguists and the focus is on developing methods for determining the "meaning" of natural language sentences. Many Computational Semanticists have devoted their time to determining appropriate meanings for component parts (Jackendoff, 1978; Levin & Pinker, 1992; Pustejovsky, 1991) (e.g., the various meanings for prepositional "location" phrases). Because many believe that the meaning of a sentence is primarily influenced by the main verb, much of the work has been devoted to determining the meanings of particular verbs and how they influence sentence meanings (Levin, 1992; Palmer & Polguere, 1994; Levin & Hovav, 1992). Others have studied how language and word/sentence meanings are acquired (Levin & Pinker, 1992; Jubak, 1992). The methodology in Computational Lexical Semantics is to study the acquisition and use of component pieces of language and identify how these pieces influence the meaning of the whole sentence. Because these component pieces are made up of words, much work has been devoted to the study of appropriate lexical meaning that could explain their meaning. The ultimate goal of lexical semantics is to develop lexicons (containing appropriate meaningful information), methods, and tools that will produce multi-purpose systems capable of handling unrestricted language (McHale & Crowter, 1994). Current research into the role that phrasal syntactic properties play in the understanding of words has resulted in a unifying focus, leading the community to believe that the core elements of semantic representations are becoming clearer (Levin & Pinker, 1992). This has in turn led to increased attention being focused on lexicon representations (Church, 1994; Macleod et al., 1994), the application of computational and statistical techniques to corpora and machine-readable dictionaries (Hogan & Levin, 1994), and to new tools for the study of lexical representation (Berwick et al., 1994; Zickus, 1994). NLP systems require knowledge about the words they must process. This knowledge is difficult to acquire and represent in a machine, and has definite effects on system performance, leading to the term "lexical bottleneck" (Byrd, 1989). Current Computational Lexical Semantics (CLS) research aimed at resolving this bottleneck has led to areas of specialization that may or may not need to be applied in conjunction with each other (i.e., the problem is too big to be attacked from one front, so the work has been split up into various areas in the hope that by combining efforts from one or more of these specialized groups, a solution will be found). There are groups seeking to improve the quality of computerized lexicons by basing them on psycholinguistic principles (Miller et al., 1993; Miller & Fellbaum, 1992). Another group is attempting to further refine these improved lexicons and/or other CLS tools by applying them to each other, such as using Beth Levin's work on diathesis alternations in order to further improve WordNet or vice versa (Berwick et al., 1994; Zickus, 1994; Gu, 1994). Other groups are looking for ways to improve syntactic and/or semantic analysis methods through corpus-based lexical research (Church, 1994; Zernik, 1989). Some combine this corpus-based research with improved lexicons such as WordNet for the purpose of syntactic parsing (Macleod et al., 1994). There are groups focusing on defining and isolating the specific needs of Machine Translation systems (Chen & Huang, 1994), and finally, there are groups researching the lexical needs of integrated systems (Gros et al., 1994; McKevitt & Guo, 1994). While these separate groups are focusing on their specific research, they also maintain a focus on the end result of their work, resolving the lexical bottleneck. There have been discussions at various international and national workshops [FOOTNOTE: The International Post-COLING94 Workshop on the Directions of Lexical Research in Beijing, China was devoted to these issues, as is the upcoming 1st Annual Workshop for the IFIP Working Group for Natural Language Processing and Knowledge Representation in April 1995 at the University of Pennsylvania. Other discussions include the SIGLEX Workshop at Berkeley in June 1991 and a workshop on the Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generative at the AAAI 1995 Spring Symposium Series at Stanford University in March 1995.] and conferences concerning what is needed for lexical processing, what is currently available, and what needs to be done, if it is feasible. There are two notable research efforts that have been incorporated by many Computational Lexical Semantic groups recently: the work by Beth Levin on defining classes of English verbs based on diathesis alternations, and WordNet, an example of an improved lexicon developed by researchers at Princeton University. The work by Beth Levin separates verbs into distinct verb classes based on the syntactic phenomena in which a verb may participate (Levin, 1993). This class differentiation research is on-going and could prove to be an important tool for generating systems that can extract semantic meaning from language and/or generate syntactically and semantically correct language. WordNet is the main lexical resource accessed by LAD and will be described in the next section on dictionaries. Dictionaries and Lexicons Dictionaries have been around for a long time and have developed relatively standardized formats through years of publishing (Vizetelly, 1915). Though the shape, size and organizational structure has changed over the centuries, the main role of the dictionary has not. Dictionaries provide users with a way of locating definition, morphological, relational, antonymical, syllabic and phonetic information about words. This section will look at the format and function of paper-based dictionaries, on-line dictionaries, lexicons and WordNet, and conclude with a discussion of other efforts in merging lexical resources. Paper-Based While there are various paper-based versions of dictionaries available today (e.g., Websters, American Heritage), they all have similar organizational formats. The words are stored in alphabetical order, and they contain a wide range of information about words such as spelling, pronunciation, inflected and derivative forms, etymology, parts of speech, definitions, illustrative uses of alternate senses, synonyms, antonyms, and special usage examples. This information has proven useful to people over the years. There are, however, some difficulties in locating certain types of information in paper-based dictionaries. For instance, if you look up the word poplar, you will find out that a poplar is a kind of tree, but in order to locate coordinate [FOOTNOTE: Coordinate terms, or sister terms, are terms that are derived from the same parent, or superordinate. They can be thought of as kind-of terms of that parent. (e.g., poplar, ash, maple are all coordinate terms and are kind-of trees).] terms for poplar (an example of a lateral kind of search), you need to start at the beginning of the dictionary and look at the definitions of every word to see if they are also types of trees. Thus using paper-based dictionaries, you can search vertically for superordinate terms or subordinate terms without too much difficulty, but to search laterally takes a great deal of effort. On-Line With the advent of computer database systems, it was a logical step to create on-line dictionaries. The data can be stored in ways to link dictionary items not only in a hierarchical manner, but also in a lateral manner, thus reducing the time for lexical searches and increasing the flexibility of the searches. It wasn't long, though, before researchers decided to augment the information maintained within on-line lexicons to provide information needed for systems designed to process language. This led to the on-going debate concerning what approach to take in order to store both conceptual meaning [FOOTNOTE: Conceptual meaning is the cognitive content of words. This captures the expression of phenomena that are deeply embedded in the language.] and collocative meaning. [FOOTNOTE: Collocative meaning is the communication of meaning via associations between words or word classes. It attempts to explain words in terms of their relations with other lexical items and word classes.] One school of thought believes that the best way to encode semantic knowledge is through corpora analysis (Velardi et al, 1991). These researchers don't store words as in a normal dictionary, but use corpora to produce conceptually related concepts which they then use to interpret the meaning of words taken in context. Examples of conceptually related concepts (CRC's) are an activity is related to a location which is a place, a change is related to a final state which is a product, and farming is related to a location which is a field. They build semantic knowledge bases of these CRC's, in place of lexicons, from corpora analysis. The resulting semantic knowledge base can then be used in semantic processing for domain-specific applications. The other school of thought is to design on-line lexicons that incorporate additional knowledge necessary for deriving conceptual and collocative meanings (Miller et al., 1993). There are two on-line lexicons commonly being used by the research community, these are the Longman's Dictionary of Contemporary English (LDOCE) and WordNet. The next section describes WordNet in more detail. WordNet WordNet (Miller et al., 1993) is an on-line dictionary/thesaurus developed by a group of psycholinguists at Princeton University with over 95,000 lexical entries, 51,000 being simple words and 44,000 being collocations (e.g., attorney general). WordNet has the ability to distinguish a number of different senses of words and to produce synonym sets, called synsets, [FOOTNOTE: A synset is a grouping of synonymous words (i.e., words with the same meaning). An example of a synset is [merchandise, wares, product].] and sentence glosses for these differing senses. The primary purpose of a dictionary is to provide definitions for words. WordNet provides definitions for all synsets and thus for every sense of every word. WordNet incorporates pyscholinguistic theories and research attempting to mimic how people organize lexical memory (Miller et al., 1993) in its design. The central organizing feature for the lexical entries is a lexical matrix: a mapping between words and the synsets to which they belong. There are four different word databases associated with WordNet: one for nouns, one for verbs, one for adverbs and one for adjectives; each one having a slightly different hierarchical representation. This hierarchical organizational method is very natural for nouns, as it is based on psycholinguistic findings that people store nouns in their memories in a hierarchical manner from specific to general (Miller, 1993). WordNet has isolated 25 unique beginner synsets (see Figure 2.2), which are used to build up the hierarchical structures for nouns (see Figure 2.3). [act, action, activity] [animal, fauna] [artifact] [attribute, property] [body, corpus] [cognition, knowledge] [communication] [event, happening] [feeling, emotion] [food] [group, collection] [location, place] [motive] [natural object] [natural phenomenon] [person, human being] [plant, flora] [possession] [process] [quantity, amount] [relation] [shape] [state, condition] [substance] [time] Figure 2.2 List of 25 Unique Beginners for WordNet Nouns [Figure Diagram] Figure 2.3 Hyponymic Relations of Seven WordNet Unique Beginners WordNet uses a lexical inheritance system in its hierarchical structuring of nouns. This allows for semantic components to be inherited from their superordinate/parent terms. The base synset for most hierarchies is [thing, entity]. As mentioned previously, WordNet stores its words in synsets. In so doing, a noun is stored in as many synsets as there are different senses to the word. For instance, bank is in 6 separate synsets, (e.g., Sense 1, 2, 3 - [bank], Sense 4 - [depository financial institution, bank], Sense 5 - [bank, supply, reserve], and Sense 6 - [bank, bank building]). Notice how the first three synsets look identical on this level. However, if you look at their superordinate terms, the differing senses of the word become evident. The superordinate term synset for Sense 1 of bank is [slope, incline], for Sense 2 is [ridge] and for Sense 3 is [array, arrangement]. By placing nouns into synsets in this manner, WordNet can be used by systems to enhance word sense disambiguation. One of WordNet's strengths is the fact that it has all of its over 95,000 words broken into their different senses. However, this strength is also a weakness in that there is no agreed upon standardization among the lexical research community as to the number of senses any given word has, or what label should be given to these senses (Ide & Veronis, 1994). As an example, Martha Palmer is a researcher who has spent years defining the different senses of the word break (Palmer, 1990; Palmer & Polguere, 1994). Her work has merged the 30 different paper-based dictionary meanings into four core senses for break. WordNet, on the other hand, has 23 different senses for break, although some of the discrepancies between WordNet's different senses are difficult to distinguish (e.g., Sense 1 - violate, fail to agree with, go against, break, be in violation of; Sense 2 - transgress, violate, go against, breach, break, be in violation of). It is important though, to have some way to disambiguate word meanings in order to facilitate semantic reasoning. For instance, if we are talking about the word bank, are we talking about the financial institution where we make monetary transactions or are we talking about the bank by the river where we go fishing or swimming or picnicking? To this end, the sense information in WordNet is a valuable lexical tool. As mentioned previously, there are some limitations with traditional dictionaries. One such shortcoming is that the information stored with a word is often incomplete. When one looks up a noun, for example platypus, one learns that it is a semiaquatic, egg-laying mammal, but unless one is an expert on mammals, there is no way, other than by looking up mammal, to find out if a platypus has hair. Dictionaries are ordered alphabetically and not grouped semantically, therefore such searches can be cumbersome. This weakness in contemporary dictionaries demonstrates one of the major strengths of WordNet: its semantic and lexical relations. By using the WordNet on-line lexicon, it is easy to discover the attributes of a given noun by traversing the semantic links of its superordinate term (i.e., its parent). WordNet's semantic and lexical links also enable searching for various relationships to other words in the English language. One of these relationships is the is-a hierarchical relationship, which is one of the most pivotal in giving meaning to words that are related to each other. For instance if we know what a vehicle is, then it helps us to understand what a car is by knowing that a car is-a vehicle. This information is useful for applications that need semantic relationships between words, such as systems designed to teach English as a second language. Another important lexically-linked feature needed by many lexical systems is a means by which you can discover what parts or attributes a word possesses. This is often referred to as the has-a relationship. As in the case of the is-a relationship, the WordNet database has semantic links to help derive this information. For instance, using has-a knowledge you can determine that a car has an accelerator, gun, throttle, automobile engine, boot, luggage compartment, trunk, fender, bumper, car door, car mirror, car window, mudguard, first gear, low gear, floorboard, glove compartment, gear box, transmission, radiator grill and so on, all using WordNet's semantic and lexical links. Another feature lacking in contemporary dictionaries, but available through WordNet, is information about coordinate terms (i.e., sister terms). Someone looking for information about other mammals would be forced to search the dictionary from beginning to end looking for terms that are classified as mammals. The prototypical lexical entry for a word points to its superordinate term, not laterally to its coordinate terms or downwards to its hyponyms (i.e., its children or subordinate terms). This is one of the strengths of WordNet: its ability to reach related terms easily through its direct links to superordinate, coordinate and hyponymic terms makes searches of such information routine. The familiarity rating of a word is a useful indication for the probability of use for a word. In order to incorporate familiarity into WordNet, the designers have attached a syntactically tagged index of familiarity to each word. Generally, frequency is used as an indication of familiarity for words, however, WordNet's designers note that the relationship between frequency of occurrence and polysemy has been well-documented ever since 1945 (Miller, 1993). They go on to contend that polysemy predicts the amount of lexical access as well as frequency does, since the more frequently a word is used, the more different meanings it has in dictionaries, therefore, WordNet uses polysemy as its familiarity index. The familiarity rating is a useful way for users to locate a suitable alternative choice for a word simply by traversing a word's hierarchical links and finding the word with the highest familiarity rating. For instance, by traversing the WordNet hierarchical links for the word diary which has a familiarity count of 2, you can find the alternative choice of writing or writings which has a familiarity count of 7. This is a useful feature for systems interested in writing, generation, or teaching English as a second language. This familiarity feature is not only available for nouns in WordNet, but also for the verb hierarchy. Due to the nature of verbs, the complexity of their predicate argument structures (e.g., noun phrases), and other verb features (e.g., entailment, polysemy), it is unreasonable to represent verb entries in the lexicon merely with synsets (Miller & Fellbaum, 1992). To compensate for the complex nature of verbs, WordNet organizes them into senses based both on synsets and other verb features. It also provides short glosses to indicate the "meaning" of each sense (Fellbaum, 1993). WordNet was not designed to recognize the syntactic regularities that are a part of the semantic meaning of verbs; to compensate for this, WordNet includes one or more generic verb sentence frames for each verb synset. These frames distinguish features of the verbs by demonstrating the kinds of sentences in which they may participate. These frames indicate specific features of verbs that have been highlighted by researchers such as Beth Levin (e.g., argument structure, prepositional phrases and adjuncts, sentential complements and animacy of the noun arguments) (Miller & Fellbaum, 1992). The sentence frames are limited to a standard, generic format that is shown in Figure 2.4 (for a complete list of the WordNet verb frames, see Appendix D). Somebody ----s something Somebody ----s somebody Something ----s Somebody ----s Something is ----ing PP Somebody ----s something to somebody Something ----s something Somebody ----s that CLAUSE Something ----s somebody Figure 2.4 WordNet Verb Sentence Frames WordNet is an exceptional on-line dictionary, containing a great depth and breadth of lexical and semantic knowledge. It is currently being used by many researchers (Church, 1994; Macleod et al., 1994; Berwick et al., 1994; Zickus, 1994; Sutcliffe et al., 1994) as a tool to help solve some of the basic road-blocks in Computational Lexical Semantics and NLP, especially as a means of word sense disambiguation. WordNet does have some weaknesses. The morphology information within WordNet is minimal and it only goes in one direction: stripping off endings or searching exception lists to find the root form of a word (e.g., went is changed to go during a WordNet database search). There is no morphological mechanism for adding endings to root words (e.g., being able to pluralize story to stories). Morphological information is important to systems involved in Natural Language Processing/Generation. Additionally, if someone desires phonetic information (e.g., speech synthesis systems), information on non-noun/verb/adjective/adverb terms (e.g., word-based systems), proper nouns, or information on function words; they need to go to another source. WordNet makes a nice base for developing a multi-purpose linguistic tool. It has the traditional dictionary information (e.g., definitions, example sentences, part of speech tags, illustrative uses of alternate senses, synonyms, antonyms), and it takes advantage of computer technology by incorporating semantically related links within its lexicon to provide additional lexical information. WordNet is the primary dictionary resource accessed by the Language Access Database (LAD) which will be described in Chapter 3. Efforts in Merging Lexical Resources There are other current research efforts in merging lexical resources besides the LAD project. While the two discussed here have similarities with LAD, there are significant differences in approach, design and functionality. The first work is being developed by Kevin Knight and Steve Luk at USC/Information Sciences Institute and the second work is by Michael McHale and John Crowter at Griffiss Air Force Base in New York. Since both works use the Longman's Dictionary of Contemporary English (LDOCE), a brief description of it is provided before further discussion proceeds. LDOCE is designed to be a learner's dictionary of English as a second language. It contains over 27,000 words, chosen for their core vocabulary aspects and frequency of current usage, and over 70,000 word senses. It includes short definitions, examples of usage, syntactic categories (e.g., adj followed by to), semantic categories (e.g., human, animate object), and pragmatic categories (e.g., economics and business). One of the nice features of LDOCE, from the angle of incorporating it into other computerized systems, is that it has a controlled defining vocabulary of approximately 2000 basic words. This allows systems to need less word knowledge while processing the definitions; however, this can also lead to more complex syntactic representational structures in order to fully define words with the limitations of a constrained vocabulary. Knight and Luk (1994) are building a large-scale knowledge base for machine translation using semi-automatic methods for manipulating and merging existing resources. Their resources are: 1) the PENMAN Upper Model which is a network of about 200 nodes to influence linguistic choices; 2) the ONTOS model from Carnegie Mellon University which is an ontology designed to support machine translation; 3) LDOCE; 4) WordNet; and 5) the Harper-Collins Spanish-English bilingual dictionary containing thousands of Spanish words with English translations. Their methodology is to merge these on-line dictionaries, semantic networks and bilingual resources through semi-automatic methods (e.g., conceptual matching of semantic taxonomies, definitions). By using the resources chosen, they have access to much of the semantic and syntactic information needed by NLP-based systems. These include semantic categories (e.g., human, inanimate object, tool), sense information, and phonetic information (though they do not mention it, LDOCE provides pronunciation data on its words). This work has not been completed. Most of their work so far has been in developing the merging algorithms needed in order to match up the different resources. No complete testing has been done, therefore evaluation of this system is not possible. The goal of the second research group is to construct a lexicon from a machine readable dictionary. The resources used are: 1) a Principle-Based Parser (PBP), based on Chomsky's Government-Binding Theory; 2) LDOCE; and 3) Roget's International Thesaurus, with approximately 225,000 words. A PBP is based on a lexical theory of grammar and needs more information than most parsers require (e.g., morphological forms, part of speech, syntactic complements, thematic roles and control information). The combination of LDOCE and Roget's seems to provide the information necessary for the PBP. One of the limitations of this system is with regard to thematic roles (e.g., AGENT, THEME, BENEFACTIVE, EXPERIENTIAL, and LOCATIVE). LDOCE does not explicitly provide thematic role information. McHale and Crowter (1994) used a method of searching for repeated patterns of words in the word definitions in LDOCE in order to extract this information (e.g., the pattern to cause to is indicative of an AGENT role). This only proved successful for two-third's of the verbs; of these, ten percent had conflicting patterns and had to be re-analyzed by hand. They were also limited to the five thematic roles listed above. They developed a mapping facility that correctly mapped sixty-three percent of the word senses in Roget's Thesaurus to the word definitions in LDOCE. This work is not completed. However they did implement a lexical browser (limited to aerospace terminology) that will allow a user to select a word which produces the word's definition and links to any aerospace terms. The browser also has speech output in the form of uttering the word and a sample sentence from LDOCE. The system being designed by Knight and Luk (1994) is for the specific domain of machine translation and has not addressed the needs of other domains. Their design of merging the five specified resources together limits their system's extensibility. In other words, if they wish to add a feature, such as frequency, they need to re-work a great deal of their system in order to merge in additional resources. They have addressed the domain-specific needs of machine translation systems, but have not provided for the needs of other NLP-based systems. While McHale and Crowter (1994) were aiming at providing NLP technology with the ability to handle unconstrained language, they had to cut back on these aims due to personnel and time restrictions and thus have fallen short of their original goals. Their system is designed to provide a large, general lexicon for a broad coverage, domain-independent, syntactic parser (PBP); however, it does not address the needs of some of the NLP-based systems discussed earlier. What is needed is a system with a wider scope of information, a more extensible architecture (for incorporation of future lexical and linguistic resources), that contains semantic and syntactic information capable of supplying the needs of several real-world applications. Chapter 3 LAD DESIGN Motivation One of the limitations of AAC devices today is the size and lack of information available in their dictionaries. While some may contain an adequate amount of words, none of them contain sufficient information on these words to do semantic and syntactic reasoning, such as the word information needed for a semantic parser. In addition, while there is substantial interest in the development of natural language interfaces within the general software community, there currently do not exist any lexical databases that provide both a broad coverage (in terms of numbers of words) and sufficient depth of information (e.g., case frames) for individual words (Zickus et al., 1995; McHale & Crowter, 1994). There are several lexical tools and systems available on the market today; however, there are limitations with most, if not all, of them. They each have their own strengths, weaknesses, and specializations. In order to take advantage of more than one tool at a time, there needs to be a centralized interface system that will extract and glean the desired information in a consistent, understandable and functional manner. Thus the idea behind the Language Access Database (LAD) was developed. The approach to designing LAD has been to create an implementation with C++ and Lisp [FOOTNOTE: In our NLP laboratories, we often use Lisp to develop prototypes and C++ for commercial application development.] interfaces that allows a programmer to access several different databases (or lexical resources) in a seamless manner. LAD provides the programmer with a set of functions which can be used to retrieve various types of information. LAD then queries appropriate databases, retrieves the desired information, and returns it to the user. Thus the user is given the impression that a single database containing a variety of information is available, while LAD may actually access several different lexical resources to handle a query. At the same time, the programmer is given as much or as little control as they need. For instance, a user can simply query LAD about the frequency of a word and LAD will return the frequency rating found for the most generally accepted meaning of that word in some default corpus. Alternatively, if the programmer prefers, LAD provides mechanisms for the user to specify a specific "sense" of the word they are interested in and/or specify which corpora they would like to use. Thus it provides ways for users to override its default settings and to exhibit a great deal of control over the way a specific query is processed. LAD accesses several different lexical resources, the most unusual of these being the on-line dictionary/thesaurus WordNet, that was discussed in the previous chapter. It is WordNet that contains much of the semantic information needed for intelligent AAC applications. This chapter will describe the functionality of the LAD design, the lexical resources available to LAD, the LAD mapping functionality for different application areas, the object-oriented design of LAD, its extensibility, and finally its current implementation status. Functional Interface LAD is designed to be a lexical resource for a variety of applications. A goal of this work is to provide a functional interface that allows a user to query for several different types of lexical information. Often the methods provided allow for optional arguments that give the programmer more control over the way the query is evaluated. In this section, the LAD design functionality will be described. This section has been broken into four sub-sections; the first one describes the argument structure of the LAD methods and the next three reflect the "kind" of lexical knowledge being returned (e.g., semantic, syntactic, and other specialized knowledge) by the LAD functions (for more detailed LAD specifications, see Appendix A). Argument Structure LAD was designed based on the object-oriented paradigm. This gives the system the capability of having a flexible argument structure, providing LAD with an adaptability and robustness that otherwise would have been difficult to obtain. All of the functions described in the next three sub-sections have the over-loaded argument capability of being able to add more detailed query specifications. This means that every function, besides being able to provide the more generic information requested for a word, can also return information for a specific sense of a word, part of speech for a word, and lexical source used for the information retrieval. For instance, an is-a(book, dramatic_composition) would search all senses of the word book to see if dramatic_composition is in its hierarchical lexical links, while is-a(book, dramatic_composition, 2) would only search (WordNet) sense 2 of book for this superordinate. Frequency information searches might add either part of speech and/or source information into its query, such as frequency(and, CONJUNCTION) or frequency(writes,VERB, mobywords). This concept of providing specific information for queries is an important one for the LAD system. It enables the capacity to return both generic and specific information on words from its lexical resources. The next three sections describe the functions of LAD; the kind of argument(s) they take and the output they yield is generally illustrated with examples. Functions for Semantic Knowledge is-a(word, parent) This function takes two arguments; the word that is-a information is needed for and the parent-term being considered for the is-a query. For instance if we know what a vehicle is, then it helps us to understand what a car is by knowing that a car is a vehicle. The output from this function is an integer, 0 for false and 1 for true. In the example shown, the output is a 1 for true. If the query was for the input book, dramatic_composition, and 2 (indicating sense 2 of book), the answer would be 0 for false, since dramatic_composition is not part of the hierarchy for book as in "record, recordbook, book." is-hierarchy(word) This function takes one argument, word; it returns the is-a information associated with the word. For instance, you might know that a poplar is a tree, but you might want to know what other superordinate terms it has. In this case, the output is a character string containing all of the hierarchical links for poplar (e.g., "poplar, poplar tree => angiospermous tree, flowering tree => tree => woody plant, ligneous plant => vascular plant, tracheophyte => plant, flora, plant life => life form, organism, being, living thing => entity"). This tells the user system what other categories poplar can be considered as an is-a term. If multiple hierarchies exist, they are separated, and if sense is specified, only the hierarchy for that sense of the word is returned. list-semantic-categories(word) This function takes one argument, the word for which semantic category information is needed. Semantic category information is related to the is-a information described in the previous example. In this case, however, the categories returned are limited to those used by the Compansion system (listed in Appendix C). For instance, if we ask for the list of semantic categories hammer belongs to, it returns the list, "object, inanimate, tool." The output from this query is a character string. has-a(word, attribute) This function takes two arguments, the word that has-a information is needed for and the attribute being searched to determine if the word contains it. For instance, if we want to know if a car has-a windshield, we will use this query. The output from this function is a integer, 0 for false and 1 for true. In the example shown, the output is a 1 for true. has-parts(word) This function takes one argument, the word that we want has-parts information about. For instance, you might want to know all of the attributes a car has. The output from this function is a character string. In this case, the output is a character string containing "car - windshield, engine, trunk, wheels, steering wheel, mirror, fender, accelerator, bumper, floorboard, glove compartment, roof, suspension, tail pipe, turn signal, hood, radiator grille, gears,..." The output for sense 3 of car (as in railway car) would return "suspension, suspension system." semantic-properties(word, property) This function takes two arguments, a word and a semantic property. The goal of this function is provide semantic information about a word; for instance, given the noun tree, what is its means of reproduction? In this case the word would be tree, the property would be reproduction, and the output would be the character string "seeds." Other semantic queries that could be made of the noun tree would be utility yielding "shade, protection from the wind, produces oxygen, provides fuel, provides wood for construction," or height yielding "trees are generally tall." The output of this query is a list of character strings. definitions(word) This function takes one argument, the word for which a definition is required. For instance, if the argument is tree, the output will be "Sense 1=>tree, tree diagram - a figure that branches from a single root; `genealogical tree,' Sense 2=>tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms." If the function is given the arguments platypus and 1 (requesting sense specific information), the output would be "a platypus is a semiaquatic, egg-laying mammal." familiarity(word) This function takes one argument, the word needing familiarity information. There is a great deal of research into the polysemous [FOOTNOTE: In general, polysemy refers to the multiple meanings some words have, especially verbs.] nature of words. Some researchers contend that polysemy is an indication of the frequency of a words usage, while others contend that it has to do with the familiarity of a word. The LAD familiarity rating is based on polysemy. As we have defined familiarity, it indicates a prediction on lexical access. The output is in the form of an integer ranging from 1 (not familiar) to higher ratings, such as 48 (very familiar). The familiarity rating for the word break is 45. sense-info(word) This function has one argument, the word for which sense information is needed. The output is the number of senses found for that word and is an integer. For example, the output for the sense-info query for the word break is 23. This argument does not have a sense specific overloaded function, as it would not be logical to ask how many senses sense 1 of a word has. alternate-word(word) This function takes as an argument a word for which an alternate choice is needed. Often times when someone is writing a paper, letter or thesis, it is useful to get a good alternative for a word (e.g., when writing about broncos, another acceptable term would be horse, its generic type). LAD uses the index of familiarity to provide useful alternative words. The output of this query is a character string. For example, given the input bronco, LAD will return "horse." coordinate-terms(word) This function will take one argument, the word looking for its coordinate (or sister) terms. The output is in the form of a character string. So, if it is given the argument watch, it will return "clock, hourglass, sandglass, sundial, timer, atomic clock, ticker." If it is given a specific sense of the word, as in ash and the sense is 2 (for a type of wood or tree), the output will include, "yellowwood, balsa wood, boxwood, acacia, redwood, bamboo, poplar, birch..." is-coordinate-terms(word, sister) This function will take two arguments, the word being queried about and a possible sister term. The output is an integer, 0 representing false and 1 representing true, indicating if the sister argument is indeed a coordinate term of the word or not. For example, if the first argument is ash and the second argument is poplar, the result would be 1, or true. synonyms(word) This function takes one argument, the word for which synonyms are needed. The output of this function is a character string. For instance if the argument is the word shoe, the output would be "footwear, footgear, horseshoe, U-shaped plate, brake shoe." If there is a sense specifier argument to this function, for example, break and 4, the output will be "disperse, dissipate, scatter, break, spread out, break up, come apart, separate, part, split, move apart." parent-terms(word) There is one argument to this function call, the word requiring parent-term information. As in previous examples, if a sense specifier argument is provided, then the parent-term for that specified sense of the word will be returned, else it will return all of the parent-terms for all senses of the word. The output from this function is a character string. For instance, if there is one argument, shoe, the output will be "{footwear, footgear}, {plate, scale, shell}, {restraint}" and if the arguments are platypus and 1, the output will be "mammal." children-terms(word) There is one argument for this function call, the word looking for its children-terms. By adding a sense specifier argument, it searches for the specified sense of the word, otherwise it searches for all senses of the word. For instance, if the argument is lake with no sense specified, the output will be "{reservoir, artificial lake}, bayou, Great Lakes, Lake Erie, Lake Huron, Lake Ontario, Lake Michigan, Lake Superior, {lagoon, laguna, lagune}, {pond, pool}, Caspian Sea)." The results from this search are character strings. In the case of multiple strings for different senses, they will be returned in a list format to provide a separation between the different strings. antonyms(word) There is one argument for this function, the word for which antonym data is needed. All antonyms for the word will be returned, unless there is a sense specifier argument, then the function will return sense specific information. For example, if one argument, male, is given, the output will be "woman, female." The output for this function is a character string. In the case of multiple antonyms for different senses, they will be returned in a list format to provide a separation between differing antonym strings. nominal-relationship(word) There is one argument to this function, the word for which nominal relationship information is needed. The output of these searches is either a character string or a list of character strings, depending on whether sense specific information is requested. For instance, if the word is Amazon and sense 1 is specified, the output will yield "river." If there are multiple answers, as in the case of no sense specification, the answers are separated by being placed in a list. reverse-nominal-relationship(word) Alternatively, LAD also has been designed to include a reverse nominal relationship function. This will take one argument, the familiar category name. It will return a character string for its output. For example, if the argument is satellite, the output string will be "moon, astronomy satellite, communications satellite, space station, Salyut, Skylab, sputnik, spy satellite, weather satellite, meteorological satellite." case-frames(verb) This function has one argument, the verb for which a specified case frame is desired. The output is a verbFrame, specified verb case frame object. The particular case frame contains the information needed for the SRM project described in the next chapter. There are twenty distinct verb case frames (see Appendix B), each having different ratings and semantic category lists. For instance, given the verb break, the output would look like: verb - break agexp [toFillPref 4] [[human 3][animate 2][ergative 2]] theme [toFillPref 3] [[physical 3][fragile 4][object 1]] instr [toFillPref 2] [[tool_box 3] [tool 3] [solid 1]] goal [toFillPref 1] [human 3] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Functions for Syntactic Knowledge is-part-of-speech(word) This function takes two arguments, the word that is being searched and a part of speech tag (an overloaded argument). The output is an integer, 0 for false and 1 for true. For instance, if the arguments are bow and NOUN, the output will be 1. Similarly, if you have bow and VERB, the output would also be 1, for true. However, if the arguments were bow, VERB, 9, the output would be 0, since bow has 10 senses as a NOUN, but only 4 as a VERB. get-part-of-speech(word) This function takes one argument, the word for which you are searching for part of speech information. The output is a string indicating the parts of speech for the word. For instance, if the argument is bow, the output would be "NOUN, VERB, ADJECTIVE." word-compounds(word) This function takes one argument, the word for which you are searching for compound usages. The output is a character string. For instance, if the argument is dandelion, the output would be "common dandelion, dandelion, dandelion green, dwarf dandelion, fall dandelion, krigia dandelion, russian dandelion." morphology(word, tense, kind) This function can take up to three arguments. The first argument is the word for which morphological information is desired. The next two arguments specify the kind of morphological information wanted for that word. The output of this function will be a character string. For instance if the arguments are be, present, first-person-singular, the output would be "am." The third argument is optional, since not all morphological searches require a third argument, for example, candy, plural would yield "candies." Functions for Other Specialized Knowledge pronunciation(word) This function has one argument, the word for which phonetic information is requested. For instance, if the arguments are lead, NOUN, sense 9, the output will be the phonetic string for the word as used in the sentence I need more lead for my pencil. Whereas, if the arguments are lead and VERB, the output will be the phonetic string for lead as in I will lead the girl scout troop tomorrow. syllabification(word) There is one argument for this function. It is the word requiring syllabification information. The output will be a character string. For instance if the argument is category, the output will be "cat-e-gor-y." frequency(word) This function will take one argument, the word requiring frequency information. The output will be a floating point integer. For instance, the output for and, and CONJUNCTION would be 1.68549, or if the arguments were writes, VERB, and Moby Words, the output would be 0.211624. LAD Resources The LAD project aims at providing a great deal of functional capabilities. While many lexical resources are available today, not one of them provides all of the information LAD requires. Therefore, in order for LAD to accommodate the functionality just discussed, it needs to have access to several linguistic and lexical database resources. The primary LAD resource is the WordNet database. Some of the other sources that LAD can access include an internally developed verb case frame database (where verb frames such as the one from our previous example are stored), a morphology database, a file containing phonetic information, a syllabification file, and databases containing various information derived from the Brown corpus and the Carterette corpus. The following sections cover some of the lexical resources of LAD. It should be noted that LAD has been designed for extensibility and because of this, augmenting it with additional databases is a fairly straightforward process. Figure 3.1 shows the overall structure of LAD and the resources it accesses. One important function of LAD is the integration of multiple lexical resources and/or files. These resources are shown on the right part of the figure. The architecture is extensible in that new lexical resources can be added without modification of the database engine. WordNet LAD derives a majority of its semantic knowledge from WordNet. The is-a and has-a relationships, word definitions, familiarity, sense-information, alternate-term-information, coordinate terms, synonyms, parent-terms, children-terms, antonyms, nominal-relationships, reverse-nominal relationships, word-compounds and most of the semantic-property information, are all derived from the WordNet lexicon. The hierarchical structure of the WordNet data, as described in Chapter 2, is one of the major reasons so much semantic knowledge can be extracted. This aspect of WordNet combined with its broad lexical coverage (over 95,000 lexical entries), has provided LAD with a vast amount of available knowledge and has proven to be an excellent choice as its primary lexical source. [Figure Diagram] Figure 3.1 Language Access Database Diagram Case Frames The case frame information is contained in a secondary database consisting of a flat-file with over 85 verbs and their attached case frames. The case frames contain data used for the semantic reasoning done in the Semantic Reasoning Module (SRM) described in the next chapter. The SRM application was developed to test the LAD design, including the extensibility required for adding additional resources. In addition, the case frame database illustrates how multiple lexical resources can be used to retrieve desired information. For instance, the number of verb case frames that LAD has access to is larger than 85 due to the synonym sets in WordNet. For example, some systems may need verb frame information on verbs that do not have frames (e.g., pummel). By default, LAD currently searches for a verb frame from the case frame database. In the case where the verb is not represented in the secondary database, LAD searches for synonyms of the verb from WordNet (e.g., crush) and then checks in the case frame database for one of the synonyms. Morphology An important aspect of computerized language generation and understanding is the ability to do morphological reasoning about words. For example, if the word of input is trees, a system will need the ability to reduce it to the root word tree in order to search for information associated with the word in various databases. [FOOTNOTE: It is normally the case that word information is associated with root words in databases. There are instances, however, such as WordNet which has a morphological processor that reduces words to their root form before it searches the database. ] Likewise, for the generation of language, if the input is John be happy or They be happy, morphological information is needed to find the third person present singular or third person present plural form of the word be in order to generate either John is happy or They are happy. As you can see from these examples, morphology is an important lexical tool. Much of the needed morphological information is contained in collections of databases provided by Moby Words. However, one might imagine a database which contains only exception data and a processor which handles other morphological information in a rule-based system. [FOOTNOTE: There is presently a rule-based morphology program with exception tables presently being used in a different application at the Natural Language Processing Laboratory.] Incorporating such a system would be straightforward given the LAD design. Phonetic Information The phonetic information LAD will access is contained in the Moby Pronunciator II file from the Moby Word II collection. It contains over 175,000 phonetic entries using the standard International Phonetic Alphabet. Moby Pronunciator II contains many common names and phrases borrowed form other languages, as well as a large number of common English words, compound words and phrases. This resource will provide phonetic information for applications interested in producing speech synthesis or phonetic information. Syllabification Information The syllabic information LAD will access is contained in the Moby Hyphenator II file from the Moby Word II collection. It contains 187,175 hyphenated single and compound words. This resource will provide syllabic information for applications requiring hyphenated word information. Frequency Information Corpora have been used by various NLP applications in place of large lexicons to enable semantic and contextual word meaning (Velardi et al., 1991). The Brown Corpus is one of the more prominently analyzed and used corporas. It was compiled by W. Nelson Francis and Henry Kurcera at Brown University using written American English sources sampled in 1961. It contains 1,014,294 words from sources that include the press, religion, popular lore, biographies, essays, general fiction, science fiction, adventure fiction, romance stories and humor. It comes in both a syntactically tagged version and an untagged version. The frequency information database will access frequency information from the Brown Corpus, the Carterette Corpus and Moby Word II, which includes a frequency file of the 1,000 most frequently used English words, a file with the 1,000 most commonly used English words on the Internet computer network in 1992, and a file with the 1,185 most frequently occurring substrings in the King James Version Bible. LAD Mapping LAD is intended to be a useful semantic and lexical tool for multiple systems, which are designed independently of LAD and LAD's resources. It is anticipated that queries may require some mapping between various sources in order to be fulfilled. For instance a query may require LAD to access one database in order to get information needed to access another database to properly retrieve the necessary information. In addition, there may be queries which require retrieving similar information from two different databases, but the information itself may be categorized differently in the two databases. In order to handle these naming discrepancies and complex searches, LAD has a mapping capability which allows information between different databases to be used consistently from one to another, even though the information may be given different names in the different databases. The mapping and availability of multiple database accesses are done in a transparent manner to the functional interface. Semantic Categories A semantic category is a way of expressing attributes of a word or the roles a word can fill (e.g,. song -> auditory, auditory communication, auditory sensation). The Semantic Reason Module (SRM), described in Chapter 4, requires a list of semantic categories a word can fill in order to semantically reason about possible word roles. LAD facilitates the SRM by searching the WordNet database to determine what categories can be attributed to a word. However, there are some semantic categories in the SRM that have different names in the WordNet database (e.g., SRM - ingestible, WordNet - foodstuff, nutrient; SRM - visual, WordNet - visual_perception, visual_communication). To compensate for discrepancies between specifications for LAD data and user-system's data, LAD uses the mapping capability previously described. For example, when the SRM wants to know if a carrot is an ingestible item, LAD will access the mapping table to find ingestible's translation into WordNet categories. Finding that ingestible translates into foodstuff or nutrient, LAD will search the WordNet database for foodstuff and/or nutrient as elements of the hierarchy for carrot and return a true or false answer to the SRM (see Appendix C for the complete WordNet to SRM mapping). In this manner, LAD transparently handles discrepancies in system specifications and its accessible source specifications. Queries Requiring Multiple Databases NLP systems often are designed with more than one knowledge base. These multiple knowledge bases enable complex reasoning about different "kinds" of data. For example, a system requiring corpora knowledge using LAD would get information from both the Brown and Carterette corpora. However, if they specified a specific corpus, then they would receive data from just one corpus. [FOOTNOTE: Different corpora are gathered from different sources. Depending on the sources in the corpora, it can influence statistical data retrieved.] For instance, frequency information can be determined in a number of ways, one being the word count of a word in a large corpus. LAD will have access to several corpora. These corpora differ in the type of information found in them; for instance, the Carterette Corpus contains samples of spoken language and the Brown Corpus contains samples of written language from a wide variety of sources. LAD provides multiple frequency information queries to enable users to chose specific databases from where they obtain frequency information, if they have a stronger preference for one corpus over the other. However, if no database is specified, LAD will use a default path to retrieve frequency information. Alternately, LAD could combine frequency information from several different sources using mapping information to decide if and how the information can reasonably be combined. Another example of LAD requiring multiple databases is the searching methodology for verb case frames mentioned previously. If LAD cannot find a case frame for the verb it is processing, then it accesses information from WordNet that will enable it to continue its search in the verb case frame database. Complicated Queries At times, NLP systems require data that necessitates LAD to search in one or more databases before searching a final database to find the answer. For instance, in the case of a system requiring corpora statistics for a specific sense and part of speech information on a word, LAD would first need to access the WordNet database for the part of speech specified and then locate the specific sense of that word. Finally, LAD would need to call its corpora mapping function with the located word sense in order to retrieve the desired information. Another example of a complicated query would be one required for a computerized system with speech synthesis capabilities which relies on phonetic data to enable it to generate speech. LAD will be able to search the phonetic database for a given word and part of speech tag and return the pronunciation string. Phonetic information will be available for LAD through the phonetic database, but it will require a mapping function from a specific part of speech category for a word to the corresponding WordNet database (e.g., if the word is a noun, then it needs to access that word in the WordNet noun database). Then it needs to locate the intended word sense (based on semantic information it gets a word sense, e.g., lead as a metal), and conclude its search by using a mapping utility that maps from WordNet to the pronunciations located in the phonetic database. This will enable LAD to accommodate different pronunciations of the same word (i.e., lead as in He is the Project lead vs. lead as in I need lead for my pencil). Application Areas LAD is designed to interact with multiple lexical databases in a transparent manner, so that the user-system treats LAD as a single dictionary. The resulting system will be a useful tool for various NLP-based AAC applications. Some applications that could benefit from LAD would be a system with a speech synthesizer needing pronunciation information, and a system with a syntactic-based word predictor using morphological information to predict correct verb forms. LAD will also be able to provide frequency information for systems designed to enable word and/or letter prediction. These systems rely on frequency information to determine appropriate choices for the next word/letter(s) of input. LAD is currently being tested with a semantic parser based on the reasoning principles used in Compansion (discussed in the Section, Application Domain: Augmentative and Alternative Communication, in Chapter 2). This application requires verb case frame information as well as semantic category information and is discussed in more detail in the next chapter. LAD can also provide the necessary information for other NLP-based systems, such as machine translation or syntactic parsers, as discussed at the end of Chapter 2. Object-Oriented Design The object-oriented paradigm is a set of theories, standards, and methods that together represent a way of organizing knowledge (Budd, 1991). Actions occur when a message is sent to an object, the object interprets the message and then performs some method in response to the message. All objects are instances of a class and if the same message is passed to multiple objects from the same class, they perform the same method. The sender of the message to the object does not need to know or even care to know how the action is accomplished, just that it is done (Budd, 1991). To take this into the realm of everyday living, if I call my local florist, Bill, and order a bouquet of red roses for my mother, I do not care about the details of how the florist actually fulfills my request; I just care that the order is filled. In this example, Bill, an instance of the class florist, gets a message, an order of a bouquet of red roses to be sent to Mary Mair, requiring him to use a method, create_red_rose_bouquet, a function providing actions to make up the bouquet of red roses, and another method, delivery, to deliver my order to my mother. In addition, I can be assured that if I call Bill to deliver a vase of pink carnations to my neighbor, Lorraine, Bill will recognize the different variables and use the appropriate methods to carry out this different request. This notion of objects and classes is the basic building block of object-oriented programming. In order to build upon this to increase the power and flexibility of the paradigm the additional notion of inheritance is utilized. Classes can be organized into hierarchical structures that can inherit attributes and methods from the class (or classes) that it is derived from. To take this back into my previous example, Bill and I are both humans and thus inherit a great deal of attributes and knowledge from the class human. I am also a Computer Scientist and thus have a great deal of knowledge from the class Computer_Scientist, while Bill is also a member of the Florist class and thus has a great deal of knowledge associated with the Florist class. In the human class there might be a function detailing the basic actions needed in order to create a bouquet of red roses, such as get a vase, fill it with water and place red roses in the vase. The Florist class needs more detailed information than this. Therefore they would have a method that overrides the basic method in the human class with one that goes into more detail, such as cutting the stems at an angle of 30 degrees while the stem is under water, placing the roses into a vase filled with specially treated nutritious water, and so on. With this brief discussion and example into what the object-oriented paradigm is, the following paragraph will describe the LAD object-oriented design as depicted in Figure 3.2. LAD is based on the object-oriented paradigm. Its base class, lad, contains the virtual methods needed to perform its functions. Presently derived from the lad class are the wordnet class and the case_frame class. Their respective classes contain methods that require more detailed instructions than the basic methods in the lad class. For instance, the is-a function requires information from WordNet in order to return an answer of substance. The lad method for is-a only indicates that it returns 0 (false/NULL). On the other hand the wordnet class is-a method actually searches the WordNet databases and retrieves the is-a information associated with the specified word. The case_frame class is-a method knows how to handle is-a information about particular verb words. For instance, the query break is-a relational would return false, since break is a material verb and not a relational verb. [Figure Diagram] Figure 3.2 LAD Object-Oriented Class Hierarchy As depicted in Figure 3.2, the level directly under the base class, lad, is the first level of derived classes that contain their own methods and data specifications. This enables LAD to inherit some functionality from the base class and overload those methods within their own classes that need to provide additional functionality. The level beneath the case-frame class depicts the two classes that form the case frame class (for more detailed information, see Appendix B); specified case frames, which hold the case and role choices for specific verbs, and filled case frames, which hold the words that have filled a case and its preference rating. The dotted-lines below the specified-case-frame class indicate several derived classes for the different types of verbs (e.g., relational, attributive, oral, written). These classes specify to the system what the toFillPref ratings are on each case of a specified frame, and what categories (with their associated ratings) can fill each case. This hierarchial structure gives LAD the ability to grow and change with relative ease. Extensibility LAD has been designed using an object-oriented paradigm to increase its extensibility and robustness. By using this paradigm, information not currently available to LAD can be added easily without any major changes to the existing code. Instead, a new class can be created which knows how to handle the information accessible in the new lexical resources. If this class is derived from the LAD hierarchy, it will then be readily accessible to LAD. As an example, pronunciations are not available in WordNet or the case-frame secondary database which are currently accessible by LAD. By creating a pronunciation class, instances of pronunciation can be stored in a secondary database and accessed in a means similar to the way verbs with case frames are currently accessed in the case-frame database. The functionality of LAD is general enough to be used by many different derived modules. A resource having functional capabilities not presently available in the LAD design, can be included in LAD with minimal code changes (e.g., add the new virtual functions to the base lad class, derive the new class from lad, define the methods associated with this class, and add calling capabilities to the LAD driver in order to initiate the newly declared and defined functions). In this manner, LAD not only has the ability to increase its lexical and linguistic resources, but it can incorporate new design features with a minimal amount of effort. This is a notable difference and improvement over the design of the system developed by Knight and Luk (1994) mentioned in Chapter 2. They merge their five lexical resources together, and would require substantial alterations to the existing code in order to incorporate new resources, if they were desired. Implementation Status LAD has not been fully implemented, as we felt it was important to complete a full analysis of LAD's functional design, implementing the WordNet and case frame classes with enough functionality so that testing and then evaluation could be performed. Currently implemented are the functions that perform the is-a, list-semantic-categories, has-a and case frame queries (for more details on LAD's functions, see Appendix A) which are needed by the Semantic Reasoning Module (SRM). The following chapter will go into the SRM in more detail. Chapter 4 SEMANTIC REASONING MODULE Motivation LAD was designed to be a useful lexical resource for multiple applications. In order to test LAD's capabilities, the Semantic Reasoning Module [FOOTNOTE: The SRM was based on principles developed in Compansion, but implemented in C++, instead of Lisp, to test out the theoretical and functional design of LAD.] (SRM) was designed and implemented. The SRM takes as input uninflected content words intended as a telegraphic message. [FOOTNOTE: In the input, the verb is identified and the other words are limited to be nouns which play some role with respect to the verb.] The SRM returns a list of all the possible filled case frames which capture possible intended messages associated with the input words. Each filled case frame indicates which role each of the words play with regard to the verb, and has a "case rating" associated with it that can be used to compare the relative goodness of the various generated filled case frames. In order to accomplish this, the SRM obtains case frame and semantic information from LAD. The following sections will describe certain requirements of the SRM and how LAD is able to satisfy them (for more detailed SRM function specification, see Appendix B). Case Frames and Semantic Reasoning The SRM uses two case frame representations to perform semantic reasoning. The first type of case frame is the specified verb case frame, previously discussed in Chapter 2. The second is the filled case frame, described in more detail in the section on Semantic Output in this chapter, which will be used to generate syntactically and semantically correct sentences. The SRM takes as input a verb and one or more uninflected nouns, which are intended to fit together into a well-formed sentence. The system's goal is to determine the proper roles of the content words with respect to the verb, and to generate filled case frames that reflect all possible combinations of these roles. The first interaction between LAD and the SRM consists of the extraction of a verb case frame for the input verb. The information contained within the case frame specifies what roles can be filled, how important it is to fill each of these roles, and the kinds of words that can be used to fill each role. This case frame provides semantic expectations for the remaining words of input. The next interaction between LAD and the SRM is to determine what semantic categories the words of input can fill. In some cases these semantic categories must be mapped by LAD onto categories appropriate for the verb frame by LAD's mapping mechanism (as discussed in Chapter 3). Using the information retrieved by LAD, the SRM can generate all possible filled case frames and present them to the user. The filled case frames (along with their goodness rating) is the SRM output. These could then be taken by another component to generate syntactically and semantically correct sentences for the words of input, and could be returned to the user for feedback as to which generated sentence best matches their intended meaning. Semantic Categories The semantic categories used by the Semantic Reasoning Module capture information that can be associated with words. This information provides distinctions between classes (e.g., animate vs. inanimate) and is motivated by the roles that objects can play with respect to various verbs. These semantic categories are the same categories employed by the Compansion system (see Appendix C). Preferences The job of the SRM is to determine likely sentence meaning. For the SRM, this sentence meaning is captured in a case frame representation based on work by Fillmore (1977). In this representation, the verb is central and the nouns are said to play a small number of roles with respect to the verb. The roles used include: AGEXP (agent/experiencer) is the object doing the action. For us, the AGEXP does not necessarily imply intentionality such as in predicate adjective sentences (e.g., John is the AGEXP in John is happy). THEME is the object being acted upon, while INSTR is the object or tool used in performing the action of the verb. GOAL can be thought of as a receiver, which is not to be confused with the BENEF (beneficiary) of the action. For example, in John gave a book to Mary for Jane, Mary is the GOAL while Jane is the BENEF. We also have a LOC case which captures the location in which the situation is taking place (this case may be further decomposed into TO-LOC, FROM-LOC, and AT-LOC), and TIME which captures time information (this case may also be further decomposed). The final output of the SRM creates a number of filled and rated case frames which place the nouns of the input in all of their reasonable roles. So, for example, the input, mary like car, generates two different filled case frames (one with a rating of 18 and the other with a rating 9): The verb is like The total filled case rating is 18 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is car the caseFiller category is human with a rating of 6 The verb is like The total filled case rating is 9 The theme->caseName is car the caseFiller category is human with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 The SRM must have the ability to reason about the most likely way that the given words can fit into a case frame. In order to do this, a set of preference ratings are associated with each verb case frame which indicate possible ways of completing a case frame. In addition, these preferences are used to rank the filled case frames against each other. There are three different kinds of semantic case preferences: case filler preferences, case importance preferences, and higher-order case preferences. The case filler preferences are found in many NLP case-based systems and indicate preferences for the semantic categories that are most reasonable for filling out a given case (e.g., animate agents are preferred for most verbs), the case importance preferences indicate which cases are most important to fill (e.g., the agent case is generally more important to fill than the beneficiary case), and higher-order case preferences (which capture interactions between the way various cases are filled out). Preference ratings fall on a 1-4 scale: 4 signifies a high preference, while 1 signifies while acceptable, it is only appropriate in special cases. The case filler preferences are used in other semantic representations to indicate preferred filled roles of a particular verb case frame and are based on the case filler preferences described in Preference Semantics (Wilks, 1975). The kinds of objects that could fill a particular role are indicated, along with a rating for each possible role filler type. For example, the preference for filling the BENEF case for break is: ((human 3) (organization 2) (animate 2)). This indicates that given the choice, a human should fill this role, but an organization or an animate object are also reasonable alternatives. Case importance preference ratings indicate what cases are more important to fill for a particular verb. For example, with the case frame shown in Figure 4.1 for like, it is more likely that the role of AGEXP will be filled than any of the other cases. To indicate this, a higher value (four) is given as the preference of filling the AGEXP case, while lower values (one or two) are given as the preference for filling the other cases. The higher-order case preferences are used to compensate for exceptions-to-the-rule situations. For instance, if a non-human animate (e.g., dog) fills the AGEXP role for the verb eat, it is highly unlikely that an INSTR is being used by the dog to do the eating. Yet if a human fills the AGEXP role, this is very reasonable. The idea is to use these higher-order case preferences to compensate for this kind of situation by, in essence, lowering the case importance preference rating. verb - like agexp [toFillPref 4] [[human 3][organization 2][animate 1]] theme [toFillPref 2] [object 3] instr [toFillPref 1] [[cognitive 3] [tool 1]] benef [toFillPref 1] [[human 3][organization 2][animate 2]] loc [toFillPref 1] [place 4] timee [toFillPref 1] [time 4] Figure 4.1 Specified Case Frame for the Verb LIKE These preferences are captured in a set of verb frames which are stored in one of the lexical resources available to LAD, which contain functions for retrieving the verb frames. For example, Figure 4.1 shows the verb case frame for like, the case filler preferences (indicated by toFillPref after each role) indicate a very high preference for filling the AGEXP case, and a preference for filling the THEME case over the other cases. The case filler preferences for each role are captured in the list following each toFillPref, they indicate that a human is preferred for the AGEXP role, however an organization or animate is also acceptable, and that an object is needed to fill the THEME role. Thus, given the input like mary car, mary can fill the roles of the AGEXP or BENEF (though the role of agexp is preferred over the BENEF) and car can fill the role of THEME. SRM and LAD (Knowledge Representations) As was discussed in Chapter 3, LAD relies on WordNet to retrieve semantic categories of words. One problem is that a particular application (e.g., SRM) may use a set of semantic categories which do not exactly match the categories that WordNet uses. This is in fact the case with the SRM since it is based on the case frames developed for the Compansion system (see Appendix C for the SRM's semantic categories). For instance, the SRM refers to a class inanimate, although WordNet does not classify any words as inanimate, but instead classifies them as inanimate_objects. To compensate for this, LAD uses its mapping function to map to and from the WordNet and the SRM's semantic categories. Thus it maps inanimate to inanimate_objects and allows the SRM to function as though WordNet stored the semantic information identically to its needs. Secondary Verb Frames vs. WordNet Verb Frames The verb case frame database has a small number of entries (88 to be exact), while the SRM needs LAD to return verb case frames for a much larger number of verbs. LAD accomplishes this task by not only searching the verb case frame database for case frames, but if that search is empty, proceeding to generate a list of synonyms for the verb and searching the secondary database for a synonymous entry. If an entry is found for one of the verb's synonyms, then that case frame is returned. Part of our future work will investigate how the system might recover if no synonymous verb has an entry. For example, one approach would be to generate a case frame from the WordNet verb frame data (see Appendix D). These case frames will not be as complete as the other case frames (i.e., they may not contain as many semantic categories in their preferences). However they would allow the SRM to continue reasonable processing. Again, as in the case of the semantic category mapping, the verb case frame search is done in a transparent manner. Semantic Output The output from the SRM is currently a list of filled case frames, indicating what role each word of input should play for a particular possible sentence generation, with a total rating on the goodness of the sentence. Figure 4.2 contains a list of the filled case frames for the input sequence break mary window hammer. The preference rating for a sentence is the summation of the ratings on the filled roles in the filled case frame (gotten by multiplying the to fill preference and the case filler preferences). The preference ratings for the filled frames shown in Figure 4.2 are 21, 12, and 12 respectively. From this, the preferred interpretation comes from the first frame which might be generated as Mary broke the window with the hammer. The generator might return all possible generations in such a manner that the sentence with the highest preference rating is returned first, the next highest rated sentence next, and so forth. The user could then select the preferred sentence from those generated, with the most likely generation being at the top of the list. The verb is break The total filled case rating is 21 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The verb is break The total filled case rating is 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The goal->caseName is mary the caseFiller category is human with a rating of 3 The verb is break The total filled case rating is 12 The theme->caseName is window the caseFiller category is object with a rating of 3 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 Figure 4.2 List of Filled Case Frames SRM Processing The SRM can perform very complex processing since many of the input words may appropriately fill several roles. The SRM is required to ensure that: 1) each word of input appears in some role in every generated filled case frame; 2) that no two words fill the same role in a filled case frame; and 3) that every filled case frame has a unique assignment of input words to roles. To handle this complex processing, the SRM uses the following algorithm shown in Figures 4.3 and 4.4. At the start of processing, the algorithm has a list of words, each with an associated rolelist. The rolelist indicates each role a word may play (according to the semantic instructions captured in the case frame) and a preference strength associated with each role that indicates how much that word filling that role would add to the filled frame's overall rating. For clarity, the "failure points" have been left out of the algorithm. For instance, if two words of input could each only fill the same role, the algorithm would not be able to generate any filled case frames. The algorithm has been separated into two figures to depict the two main loops of processing done by the SRM. 1. Check for all words that can only fill one role and place them on the mandatory rolelist. 2. Call ridConflicts to eliminate the roles filled in Step 1 from the remaining word's rolelists. 3. Go back to Step 1 until there are no conflicting roles or words with only one possible role to fill remaining. 4. If there are no words remaining with multiple possible roles, then generate a filled case frame using the words on the mandatory rolelist, otherwise go to Step 5 to handle the recursion required for further processing. Figure 4.3 Algorithm for the Generation of Filled Case Frames (Part 1) At this point, either the processing is completed (in the case where all the words of input can only fill one role once conflicts are eliminated), or else more complex processing needs to proceed (in the case where there are words that can fill multiple roles). 5. Take the first word with multiple possible roles off of the multiple role wordlist. 6. Pop a role off of this word's rolelist and put it on a temporary list. 7. If there are more words on the multiple role wordlist, go to Step 8, otherwise generate a filled case frame using the mandatory list and the temporary list. Put this frame on the list of filled case frames to be returned. If there are no more roles on this word's rolelist, then return the list of filled frames, otherwise go back to Step 6. 8. Check the rolelists of the next word on the multiple role wordlist to see if its first role conflicts with the roles on the temporary list, if it does, skip this role and move to the next role. Repeat Step 8 until a role is reached that does not conflict with any roles on the temporary list. 9. If there are more words on the multiple role wordlist, go back to Step 8, else if there are no more roles on this word's rolelist, add the current role from this word to the frames on the partially filled framelist, otherwise go to Step 10. 10.Generate a filled case frame for each entry remaining on this word's rolelist, and put the frame on a partially filled framelist. Figure 4.4 Algorithm for the Generation of Filled Case Frames (Part 2) This algorithm ensures each word of input is in every generated filled case frame without any words conflicting over a role, and that every generated frame is unique. The next chapter concludes this thesis with an analysis and evaluation of LAD, its design, implementation, testing results and its present resource limitations, and a discussion on future work. Chapter 5 CONCLUSIONS The Language Access Database was designed to give a user access to several on-line lexical and linguistic sources in a unified manner thus providing various different software applications with a convenient source of syntactic, semantic, and other lexical information. This information will enhance the capabilities of systems which attempt to understand natural language, to generate semantically and/or syntactically correct sentences, to produce speech synthesized language, and so on. LAD was designed using the principles of the object-oriented paradigm to allow for flexibility and extensibility. This final chapter examines the goals and accomplishments of LAD, an analysis of the LAD system (e.g., testing and evaluations, resource weaknesses, comparison of output from the SRM's LAD enriched system and the Compansion system), a discussion of the results of LAD, and concludes with a look into the future directions for LAD. Accomplishments The completed LAD design was based on providing traditional dictionary functionality as well as enhanced capabilities to incorporate the semantic and syntactic needs of NLP-based systems. The current implementation encompasses enough of the LAD functionality to be tested and evaluated before a complete implementation is done. In its current form, LAD has access to the WordNet database and a verb case frame database. The Semantic Reasoning Module, used as a testing and validation system, was designed with principles developed in the Compansion system. LAD Implementation Status LAD has been fully designed, but not fully implemented. The reason for this is that LAD has been designed as a fully extensible system. Thus, as new applications need to use LAD, everything is in place to add access to any additional lexical resources they might require. The completed portion of LAD is sufficient for testing its usefulness in the implementation of the SRM. The functionality required for this implementation utilize all of LAD's major design components (i.e., multiple database access, an application-specific database in the case frame database, mapping information required in mapping data from one database terminology to another, and sophisticated access to WordNet's semantic information). Thus the implemented portion of LAD provides significant access to needed information that was not previously available. In addition, the testing of LAD using the SRM validates the design principles upon which LAD is based. Now that the first stage of testing is completed, the results warrant further testing by using LAD in other applications. Currently implemented are the functions needed for the SRM processing. These perform the is-a, list-semantic-categories, has-a and case frame queries. The LAD implementation will proceed to incorporate more lexical resources, as its design intended, and more of the functions as applications and lexical databases warrant. WordNet Usage WordNet has proven to be an excellent choice as the main lexical database due to its broad coverage of words and sense information. It has limitations in certain areas of its lexicon, the most notable being proper names. However this will be compensated for by adding a mapping function using the Moby Words II proper names of people and proper names of places files to their equivalent expressions in WordNet. For instance, the mapping mechanism can be used to map male names like John to Tom (which is included in WordNet), and to map names of cities to equivalent cities (i.e., taking care to map seaports to seaports, major urban areas to other major urban areas, etc.). Moby Words II contains 21,986 names, including 4,946 commonly given female names and 3,897 commonly given male names in English speaking countries, and 10,196 places in the United States. This addition to WordNet will improve one of its weaknesses, although there is still a lack of proper names of products and businesses (e.g., Ford, Chevy, IBM, duPont). This final weakness could be corrected with a flat-file database of names of companies and their products, if needed or desired. SRM and LAD Results The results of the SRM implementation using LAD were very supportive of further work with LAD, and with other applications that will make use of it. There were a total of 32 input test strings processed by the SRM, and the expected results [FOOTNOTE: The SRM produced a list of filled case frames which comply with its complex processing requirements (see Chapter 4, Section SRM Processing), each frame is semantically logical, and at least one of the filled case frames could be used to generate a valid sentence.] were achieved from 29 of these test strings. The three input test strings which did not work in the SRM were due to failures in the mapping function used to map the semantic categories of the SRM to the names found in WordNet. For example, WordNet does not have a definition or sense of the word bone as in a bone a dog would eat. The SRM also uses the category name instrument and WordNet uses instrumentality for the word fork, and the SRM uses the category name time and WordNet uses season and time_of_year for the word summer. Presumably, the mappings in the mapping table (see Appendix C) could be extended to handle these cases. Some of the test strings included in the SRM test were: go mary tom store break mary window hammer break mary window hammer tom break mary hammer rock tom ask mary question hit tom mary dog buy mary book yesterday eat mary slice bread break mary thermostat utter mary sentence steal mary bread go mary restaurant yesterday fly mary europe order mary hamburger tom write mary paper chemistry go mary beijing eat mary hamburger fork remember mary swimming summer eat mary dog bone In the 29 successful test strings, LAD enabled the SRM to produce complete lists of filled case frames represented by the words of input, where each frame contains all of the words and every frame has a different pattern of words for the cases they fill. For example, consider the input string break mary window hammer. The cases that mary can fill are AGEXP (with the semantic category of human and a rating of 12), GOAL (with the semantic category of human and a rating of 3), and BENEF (with the semantic category of human and a rating of 3). The case that window can fill is THEME (with the semantic category fragile and a rating of 1). Finally, the cases that hammer can fill are THEME (with the semantic category object and a rating of 3), and INSTR (with the semantic category tool and a rating of 6). You will notice that mary can fill three different cases, while window can only fill the THEME case, and hammer will only be allowed to fill the INSTR case, as the THEME case will always be filled by window. Therefore the results from the SRM with these words of input should (and do) yield the correct filled case frames (as shown in the next section). The results of the SRM implementation using LAD were very good and show that LAD has a promising future for providing various applications with semantic knowledge. The mapping function needs to be further tested and refined to account for discrepancies, as became evident during the testing session. LAD has thus proven to be a useful tool for providing semantic knowledge. SRM Performance vs. Compansion Performance The SRM does not have the full functionality of Compansion because it cannot generate sentences, nor handle certain exceptional cases. However, a comparison of the SRM-generated filled case frames with the sentences generated from Compansion is feasible by using the same (or equivalent) test strings. The following is an example of the SRM testing results: input string: break mary window hammer The verb is break The total filled case rating is 30 The agexp->caseName is mary the caseFiller category is human with a rating of 12 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The verb is break The total filled case rating is 21 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The goal->caseName is mary the caseFiller category is human with a rating of 3 The verb is break The total filled case rating is 21 The theme->caseName is window the caseFiller category is fragile with a rating of 12 The instr->caseName is hammer the caseFiller category is tool with a rating of 6 The benef->caseName is mary the caseFiller category is human with a rating of 3 Compansion has a generator which generates sentences from the filled case frames its semantic parser builds. The generator may not be able to generate semantically and syntactically correct sentences from all of the frames, so the output from Compansion is only the sentences it is able to generate. The following is the Compansion output for the same test string as in the previous SRM example: input string: mary break window hammer output--> Mary breaks the window with the hammer. The first filled case frame from the SRM would be used by a generator to create the same sentence as the output from Compansion above. The following two filled case frames might be used to generate the sentence "The window was broken with the hammer for Mary." It is important to note that Compansion takes word order into account, so the previous sentence could not be generated from the ordered input string mary break window hammer. What is encouraging from these results is that for input strings which do not require the extra exception processing Compansion provides, similar results were produced. When Compansion was tested on the same 32 strings that were used to test the SRM the results were that it could successfully process 15 of the 29 strings that the SRM could process and one of the strings that the SRM could not process (mary eat pizza fork). This difference in results is mainly due to the fact that the SRM has a larger vocabulary than Compansion, because the SRM obtains its word knowledge from LAD. While Compansion is limited to a vocabulary of over 1,000 words, the SRM has access to the over 95,000 words in WordNet and once the proper names mapping is completed, it will have an additional 21,986 proper names access from the MobyWords data files. Some of the test input strings that the SRM could handle and Compansion could not were: break mary hammer rock tom break mary hammer rock write mary paper chemistry go mary beijing eat mary slice bread eat mary dinner tom break mary thermostat utter mary sentence steal mary bread corrupt mary morals build mary software design mary software go mary restaurant yesterday fly mary europe In all of the cases above, except for the first three, Compansion's failure was due to lack of access to semantic knowledge about one or more of the words in the input string. In the first three cases, the failure may have been due to a lack of semantic knowledge or due to the generator not being able to formulate a sentence which captured the semantic representation produced by Compansion's semantic parser. For the two cases which failed for both Compansion and the SRM, it was because of vocabulary limitations of Compansion and mapping failures in LAD. Discussion LAD has proven to be a valuable semantic tool, enabling the NLP-based semantic parser, SRM, by providing the knowledge it requires to process a large number of telegraphic inputs. The vocabulary size that LAD provides is significantly larger than the prototype Compansion system, and this greatly increases the variety of input strings the SRM can process. LAD now needs to be provided with access to ad