Using Local Focus to Correct Illegal NP Omissions [1] (A Ph.D.. Proposal) Linda Z. Suri Technical Report No. 93-07 March, 1992 Linda Z. Suri Department of Computer and Information Sciences 103 Smith Hall University of Delaware Newark, DE 19716 suri@udel.edu 1 Introduction Correcting text which is ill-formed with respect to grammar and/or discourse strategies is a challenging problem. We are working on this problem from the perspective of helping deaf writers produce text which conforms to the standard rules of English. [2] This perspective may prove to be particularly interesting since the native language of some deaf writers is American Sign Language (ASL), which differs from English in both its syntax and its discourse strategies and thus may have an interesting influence on written English. ASL is a visual-gestural language whose grammar is distinct and independent of the grammar of English or any other spoken language [Sto60], [Lid80], [Ing78], [BP78], [BC80], [Bak80], [Pad81], [Pad82], [HS83], [KB79], [BPB83]. In addition to sign order rules, ASL syntax includes systematic modulations to signs as well as non-manual behavior for morphological and grammatical purposes [BC80], [Lid80], [Pad81], [KB79], [KG83], [Ing78], [Bak80]. The modality of ASL encourages simultaneous communication of information which is not possible with the completely sequential nature of written English. The work described in this proposal is part of a much larger project. The long term goal of the overall project is to develop an instructional writing tool which will take a writing sample from a deaf person, analyze it to identify deviations from standard English, engage the user in a corrective tutorial dialogue, and generate text which is correct with respect to the context. The overall system can be seen as having two phases. The identification phase will rely on a grammar of English which has been augmented with a set of syntactic and semantic error productions ([Sle82], [WS83], [WVJ78]) which extend the language accepted by the parser and semantic interpreter to include the types of deviations we expect. The interactive tutorial and correction phase will be driven by annotations on the error productions as well as discourse information which will be tracked through the dialogue. The work being proposed in this document is part of the correction and tutorial phase. In particular, it focuses on discourse information which must be tracked in order to generate a correction to a particular class of errors. The particular solution that we are proposing is motivated by the hypothesis that the underlying source of these errors is the transfer of a discourse strategy from ASL to English. This hypothesis is substantiated by an analysis of writing samples and a comparison of ASL and English which has led us to conclude that language transfer, if defined broadly enough, can explain many of the errors we have uncovered [3]. This explanation has led us to an algorithm for correcting the particular class of errors we are concentrating on in this thesis. The algorithm works for every instance we have so far uncovered in our samples. The proposed thesis work focuses on one error class which we have found to be particularly prevalent and interesting: the illegal omission of (both pleonastic [4] and contentful) NP's . The question under investigation for the proposed thesis is how these omissions can be corrected. In this proposal, we first substantiate the claim of language transfer by briefly describing the sample analysis which motivated our current beliefs. This includes a brief exploration of LT (see [Sur91] for a much more thorough background on LT), a characterization of how LT might manifest itself, and a description of the results of the analysis. While previous work on LT has mostly concentrated on the sentence level, we hypothesize that LT may also occur at the discourse level. That is, not only may a writer transfer syntactic structures and lexical items from a native to a second language, but discourse and cohesion strategies may be transferred as well (and we believe our analysis substantiates this claim). We describe classes of ASL verbs and a discourse strategy of ASL. The rich verb agreement of one class of verbs explains some empty categories (EC's) in ASL, and the discourse strategy (Topic-NP Deletion) explains EC's occurring with ASL verbs with no agreement morphology. We use these facts about ASL (and the existence of other instances of LT between ASL and English) to explain illegal NP omissions in deaf writing samples. Next, we explain how we can use this information about ASL to correct illegal NP omissions in deaf written English. Basically, the analysis of ASL verbs and EC's indicates that the topic is most likely to be the referent of the omitted NP. Thus an algorithm that tracks topic might successfully fill in the missing NP's. We provide examples of illegal NP omissions in deaf writing samples which we think can be explained by this analysis. Having chosen this approach to illegal NP correction, we discuss focus (topic) tracking research which has been done in NLP. We explain why we chose one of these (Sidner's focus tracking algorithm ([Sid79], [Sid83]) ) as the basis for the algorithm we will use. Next, we describe the proposed focusing algorithm, and show how the algorithm predicts the correct missing referent for three deaf writing samples with illegal NP omissions. Each of the examples motivates extensions that were needed by Sidner's algorithm. We discuss these along with some open questions about how to further extend the algorithm. The major contribution of this thesis will be the development of the focusing algorithm. The tracking of focus is important to Natural Language Processing since the recognition of focus is important for understanding discourse, and it is particularly important for anaphora (including ellipsis) resolution. Thus, this focusing algorithm should be useful not only for the larger project of correcting the written English of ASL natives, but for other NLP tasks as well. 2 Writing Sample Analysis The purpose of this section is to describe the writing sample analysis which has motivated our conjecture that many errors in the writing of ASL natives can be explained as LT. We first describe some background on LT and then explain our analysis. 2.1 LT The term Language Transfer is commonly used to refer to the use of knowledge of one's native language (L1) in the production and/or comprehension of a second language (L2). Transfer may be positive (in the sense that it may speed the acquisition of the L2), however it may also result in deviations in L2 production in places where the L1 and L2 differ. While the existence of LT has been a rather controversial subject over the years (see [McL87],[GS83a],[Sur91]), much recent research has provided convincing evidence of LT resulting in the transfer of L1 lexical items, syntax rules and pragmatic production rules to L2 (e.g., [Sch82] and [SR79] as described in [McL87]; [Kle77], [Hak76], and [Gas79] as described in [Gas84]; [GS83a]; and [McL87]). Given that transfer has been documented between spoken languages, one might ask whether or not LT could occur between ASL (a visual-gestural language) and written English. [5] At first glance, transfer may seem surprising since the components of ASL grammar and written English grammar are different [Sto60], [BP78], [Pad82], [HS83], [KB79], [BPB83]. ASL grammar components include sign order, morphological modulations of signs, and non-manual behavior which occurs simultaneously with the manual signs [BC80], [Lid80], [Pad81], [KG83], [Ing78], [Bak80]. Written English grammar components include word order, morphological modulations of words, and punctuation, but nothing that clearly corresponds to the simultaneous non-manual behavior found in ASL. On the surface, the fact that ASL and written English occur in different modalities seems problematic as well. However, research shows that much of ASL processing occurs on the side of the brain primarily used for processing spoken and written languages (the left-side of the brain), as opposed to the (right) side of the brain primarily used for visual and spatial functions [Sac90]. Thus, we expect transfer is likely to occur between ASL and written English, but it is not immediately clear how the transfer will manifest itself (particularly with respect to the non-manual component of ASL grammar). 2.1.1 A Characterization of LT Because of the differences (in grammar and modality) between ASL and English, we have attempted to abstractly characterize how languages could differ in a way which is independent of the grammar components. We have identified several ways in which languages may differ which might lead to (negative) transfer. o Two languages may differ in when they mark a particular feature. As a result the marking of that feature in the L2 may seem either redundant or overly concise/imprecise in the native language. For example, in ASL it is usual to establish tense at the beginning of a discourse, and then not to mark it again until the time frame changes. Of course, in English tense is marked (on the verb) in every finite clause. So, marking tense in every finite clause in English may seem redundant to the ASL native. Transfer of such a feature (i.e., when to mark tense) might explain omission errors (in this case, of tense markings) in the written English of ASL natives. o Two languages may differ in how they mark a feature. For example, in ASL, Yes/No questions are distinguished from declarative statements with non-manual markers (facial expression and body shifts). This is radically different from the word order changes which typically mark Yes/No questions in written English. Thus LT might explain errors in Yes/No question formation by the ASL user. o Languages differ in regard to requiring morphological changes or additional lexical items for strictly syntactic reasons. For example, adding an "s" to a present tense verb for a third person singular subject in English (typically) conveys no extra information, but is a syntactic requirement. There is not a close counterpart to this in ASL, which may explain omissions of this morphological item in the written English of ASL natives. o As with any two languages, English often has two or more words or phrases which correspond to a single ASL sign (or sign sequence), and vice versa. For example, ASL uses the same sign (i.e.., lexical item) for "other" and "another". Thus, LT might explain why ASL learners of written English might take some time to learn which word ("other" or "another") to use in English . 2.1.2 Examples of Error Classes Attributable to LT We collected writing samples from a number of different schools and organizations for the deaf. We concentrated on eliciting samples from Deaf people who are (native) ASL signers. This was done in order to increase the probability of finding errors specific to the deaf population and to allow us to target a more homogeneous population. Thus far, we have analyzed forty-eight Freshman and Sophomore writing evaluation samples from Gallaudet University, a liberal arts university for the deaf, seventeen writing evaluation samples from the National Technical Institute for the Deaf (NTID), twelve first draft papers from students in the high school program at the Margaret S. Sterck School, a deaf school in Delaware, and five letters and essays written by ASL natives contacted through the Bicultural Center in Washington, DC. The total sample size is approximately 25,000 words. ====================================================================== o Conjunctions: 4 - Omitted conjunction: 1 - Inappropriate conjunction: 3 o Prepositions: 66 - Omitted preposition: 26 - Inappropriate preposition: 27 - Extra Preposition: 13 o Determiners: 63 - Omitted determiner: 35 - Inappropriate determiner: 9 - Extra Determiner: 19 o Incorrect Number on Noun: 23 o Incorrect Subject-Verb Agreement: 11 o Tense and Aspect: 70 - Dropped tense: 5 - Other tense/aspect problems: 65 o Mixing up English words or phrases which share a single ASL sign: 12 o BE, HAVE (non-Auxiliary): 16 - Omitted BE: 9 - Lack of BE/HAVE distinction: 7 o Other Omitted Main Verbs: 7 o Incorrect WH-phrase: 4 o Adjective Problems: 13 - Incorrect Adjective Choice: 3 - Incorrect Adjective Formation: 10 o Incorrect Nominalization: 5 o Relative Clauses: 14 - Relative pronoun deletion: 4 - Resumptive pronoun: 1 - Other: 9 o Pronouns: 12 - Incorrect pronoun choice (including pleonastic): 7 - Inappropriate pronoun use (where full definite descriptions are required): 4 - Lack of pronoun use (overuse of definite descriptions): 1 o Pleonastic Pronoun Deletion: 10 - Object: 5 - Subject: 5 o Focus/Discourse Structuring Problems: 49 - Omission of focused element (subject: 4; object: 4): 8 - Problems carrying over general/specific description strategies: 5 - Structuring problems with "because": 8 Ambiguous modifier attachment: 1 - Other (possibly related to carry-over of topic-comment strategies): 27 o Redundancy Problems: 2 o Not Enough Sentence Breaks: 6 o Other:104 (23% of errors in database) Table 1: Error Taxonomy ====================================================================== Table 1 contains the error taxonomy [6] we have derived from the analysis of 79 writing samples. Also included in the table is the number of sentences (out of 214 sentences) which contained a deviation which could be explained by each classification. These numbers are based on only the 17 samples (3313 words) which have already been added into our database. [7] The intent of the numbers is to give the reader an idea of how often the various classes occur in relation to each other. In Appendix B. we give an example for each error class listed in Table 1. We will show how many of the error classes uncovered in our analysis can be explained as following from one (or more) of the 4 above categories of differences between ASL and written English, and thus may be explained by LT. For each illustrated error class, we provide examples of the error class and then explain how it could be captured by our characterization above. (More detail can be found in [Sur91].) Conjunctions o Conjunction Deletion: - "He taught _ directed, for almost 30 years ..." [8] While researchers have identified several kinds of conjunctive markings in ASL from body shifts to particular lexical items ([Pad81], [BS88]), there are many places where an explicit conjunction would be required in written English, but not in ASL. For instance, conjoined verbs do not require an explicit separate lexical item; instead the verbs are signed one after the other [Fan83]. Therefore, it is not surprising that an ASL signer would omit 'and' between (the final and next-to-final) conjoined verbs in written English. This omission could be the result of the marking seeming redundant or radically different to the ASL native. Subject-Verb Agreement o Incorrect Subject-Verb Agreement: - "My brother like to go..." In ASL, not all verbs mark subject agreement for person and number. For certain verbs (some directional and classifier verbs) subject agreement is indicated by a change in handshape, a change in movement, or (rarely) the use of an overt NP where it would not normally be needed [Pad81]. There is a large class of verbs in ASL which do not vary in form for person and number of the subject (see [Pad81]). In addition, some directional verbs vary in form according to the person and number of the object [BC80], [Fan83]. That subject-verb agreement is a syntactic constraint in English, coupled with the difference in when and how agreement is marked in the two languages, might explain deviations in marking subject-verb agreement in the written English of ASL natives. Tense and Aspect o Dropping tense on verbs (within or across sentences): - "We went to see Senator Biden's office ... Then we go to see the Vietnam memorial ...." o General tense/aspect problems: - "Many students rather live at college, than living at home." (Correction: "Many students would rather live at college than live at home. ") In our data we found missing and incorrect tense markings, and missing and incorrect aspect markings. These deviations might be explained by the differences between when and how ASL and English mark tense and aspect. Some ways that English marks tense and aspect are through the use of modals and auxiliary verbs, and through morphological changes to the verbal elements. ASL does not use auxiliaries and it does not modulate verb signs for tense. [9] In ASL, tense is generally established once at the beginning of a discourse (e.g., by using a time sign), and that time frame is understood to persist until the next time frame is established. The dropping of verb tenses in the written English of the deaf might be explained by this; marking tense by modulating every verb in English might seem redundant to someone fluent in ASL. Similarly, the problems we found with the formation of English verb tenses might also be explained by the fact that ASL marks time in a radically different manner from the way that English does (i.e., using a specific time indicator instead of adding phonemes and auxiliary verbs). English also uses the sequential addition of auxiliaries, prepositional phrases, adverbs, etc. for aspectual distinctions. ASL signs are modified, through changes in movement of the underlying sign, for aspectual changes [KB79]. Adverbial modification is often achieved through facial expression [BS88]. Since the methods of achieving aspectual distinctions in English are so radically different from those of ASL, the above aspectual deviations can be explained by LT. BE, HAVE o missing BE: - "Once the situation changes they _ different people." o lack of distinction between "be" and "have" (as main verbs): - "... some birth controls are side-effect." (Correction: "... have side-effects...") - "I wish to go to Hawaii because it is beautiful and nice weather." (Ellipsis implies "... and is nice weather." Correction: " ... and has nice weather") ASL does not have a "be" sign which could explain its omission in the written language of the deaf. In ASL, the idea of being is conveyed by radically different means, for instance, by a topic-comment structure. Generally, a topic is set up, and then properties are attributed to the topic (the topic and comment are distinguished non-manually). While ASL does have a "have" sign, it is often omitted if it can be assumed from the context. Lexical Items o Mixing up English words or phrases which share a single ASL sign: - "Somehow, I am interesting in ASL and I want to learn it." A single sign in ASL corresponds to both "interesting" and "interested." Thus this error is attributable to LT. 2.2 LT Summary While we have only explained a few examples, at least 82% of our error codes (which represent a finer distinction than that given in Table 1), and at least 76% of the individual errors reported in Table 1, are attributable to LT in a similar manner. [10] That so many errors fit the characterization above suggests LT is an important predictor of errors and we should use this characterization to hypothesize new error classes. Our characterization can be useful in developing language tutoring systems for second language learners of other languages as well. 2.3 Discourse Level Errors and LT We believe that several of the error classes we have uncovered in the written English of ASL natives are the result of transfer of discourse strategies from ASL to written English. We term the resulting errors discourse errors. They manifest themselves either at the sentence level (resulting in a sentence which is ill-formed syntactically or semantically), or they may only be apparent in a longer stretch of text. Wilbur [Wil77] reports that the language instruction of the deaf has concentrated on the sentence level and suggests that many errors could result from the writer not understanding when and how to use particular structures (such as relative clauses and pronouns). We believe that the concentration on sentence level teaching may contribute to the discourse errors we have found, and it may result in discourse level errors persisting longer than sentence level errors. Discussions with educators of the deaf and ASL researchers confirmed the idea that a skilled deaf writer may develop his/her writing skills to a point where he or she produces text which lacks discourse cohesion, even though the individual sentences are grammatically correct. Much of the research on LT has concentrated on how differences in L1 and L2 syntax affect the surface syntactic form of L2 production. Other researchers have explored the effects of features which are more discourse-related; we summarize their work here. Odlin claims that comprehension and production problems in an L2 may arise due to lack of familiarity with a discourse pattern, or lack of familiarity with culturally specific knowledge. See [Odl89] for many examples of such problems. Some particularly interesting examples are the less accurate recollection by Americans (as compared with Japanese) of information in a passage written in an indirect form (common in Japanese) [Odl89]; differences in value judgments of writing due to cultural differences with respect to indirectness [Odl89]; and comprehension problems for British students caused by the inclusion of supernatural events in a story from the Kathlamet [Odl89]. These examples point to the facts that different cultures organize information differently, different languages exhibit significantly different levels of directness, and culturally specific knowledge may play a greater role in comprehension than one might expect. Zobl believed LT could occur as the prolongation of the use of L1 pragmatic strategies involving given versus new information [GS83a]. Thus, LT might result in overuse of a particular pragmatic strategy in an L2 that does not make much use of that particular pragmatic strategy. Koch and Bartelt have documented similar transfer with respect to overuse of repetition. Koch saw evidence suggesting that Arabic discourse may encourage Arab students to repeat words and phrases in English [Odl89]; and Bartelt saw evidence of overuse of repetition in the English writing of Navajo and Apache students. [Odl89] Rutherford argues that whether one's native language is topic-prominent, subject-prominent, or neither, and the extent to which one's native language tends to use word order to express pragmatic information, influences L2 production. The first feature seems to influence how often one uses topic-comment structures, and the second influences use of dummy subjects (in languages that have dummy subjects). We might also see lack of use of pragmatic strategies or lack of sensitivity to pragmatic strategies in the L2 production of a language which is heavily influenced by pragmatics by someone whose native languages is less influenced by pragmatics. Gass ([GS83b], [Gas89]) argued that one must learn not only the possible word orders of an L2, but to what extent surface word order is determined by pragmatic factors and to what extent it is ordered by grammatical factors. She studied the L2 production of Italian speakers learning English, and English speakers learning Italian. In Italian, surface word order is largely determined by pragmatics in English, surface word order is primarily determined by grammatical relationships. She noted that only Italian speakers at advanced levels of English acquisition seemed to recognize the importance of syntax in English. The work described above suggests that LT can explain the effects of an L1 on an L2 in terms of word order, use or lack of use of dummy-subjects, use or lack of use of topic-comment structures, use or lack of use of pragmatic strategies, indirectness, repetitiveness, and use of cultural information. Thus, we propose that differences in language structuring and cohesion strategies must also be examined as sources of potential difficulty for referent formation and anaphora resolution. We will briefly describe the differences between ASL and English with respect to some discourse strategies, in order to explain how we believe LT may manifest itself between ASL and written English in terms of referent formation and anaphora resolution. 3 ASL and English Discourse Strategies and Deletions 3.1 Loci and inflecting verbs Before describing ASL discourse strategies, it is important to understand two related aspects of ASL establishing a locus in space as associated with an NP or a referent, and inflecting verbs for agreement. (The following description is largely based on [LM91] since she so clearly and concisely described these features of ASL.) In ASL, the locus associated with a referent that is present is the location of the referent. Pronominal reference to a present referent is achieved by indicating its locus. A locus may be indicated by pointing or gazing at the locus, or by using that locus as the starting or end point of a verb. Reference to 1st, 2nd and (a present) 3rd person are achieved in this manner. For referents which are not present, abstract loci can be associated for each referent in the signing space in front of the signer's body. "This is accomplished by producing the sign for the referent at some arbitrary locus in space, or making the sign and then pointing to the locus with the index finger, or by eyegaze in the direction of the locus while making the sign." (p. 25-6, [LM91]) Subsequent pronominal reference to a non-present referent assigned to a locus is achieved in the same manner as that for pronominal reference to present referents. Abstract loci associations persist in discourse until a new framework is established, and the number of such loci, while theoretically unlimited, typically does not exceed 5. ASL has four types of verbs: 1) inflecting, inflectable or agreeing verbs, 2) plain or non-agreeing verbs, 3) spatial, and 4) classifier verbs [LM91]. (We will discuss only the first two.) Inflecting verbs have very rich subject and object agreement morphology, and plain verbs have no agreement morphology for subject and object. Inflecting verbs are those that are marked for subject and/or object agreement. Subject or object agreement is achieved by a change in the movement of the underlying sign. Specifically, the movement of the sign begins at the locus of the subject, and ends at the locus of the object. [11] For example, if one signs, "John kicks Mary", the starting point of of the sign KICK is at the locus for John, and the ending point of the sign is at the locus for Mary. "Thus, ASL verbs do not indicate the common person and number agreement, but agreement with actual referents." (p. 29, [LM91]) Padden claims that subject agreement is optional, though there may be restrictions on the optionality [LM91]. If the verb is not inflected for the subject, then the starting point of the verb is in a neutral location in front of the signer's body. On the other hand, object agreement is obligatory if the locus for the object has been established. Plain or non-agreeing verbs do not change in any way based on the subject or object. 3.2 Null Argument Structures in ASL Lillo-Martin [LM91] identifies two kinds of null-argument structures in ASL: those involving inflecting verbs, and those involving plain verbs. The inflecting verbs in ASL are rich in morphology, like verbs in Italian (which is considered a pro-drop language). Lillo-Martin argues that when an empty category occurs with such a verb (that is, when a subject or object is not explicitly signed for such a verb), the referent of the empty category (EC) is determined by the agreement morphology. In terms of ASL, the agreement morphology causes the verb sign to begin at the locus of the subject, and end at the locus of the object. The arguments of the verb can thus be recovered from these locations. Thus, Lillo-Martin's analysis of EC's with respect to inflecting verbs in ASL is consistent with the analysis of EC's in Pro-drop languages (which share the characteristic of having strong morphological markings). The plain verbs of ASL have no morphology, and thus one would not expect dropped NP's because the referents couldn't be recovered through verb morphology. However, these verbs do allow null arguments in some contexts. The same is true of Chinese verbs. Lillo-Martin argues that the deletions with respect to these verbs are similar to deletions in other languages (like Chinese) termed discourse-oriented-languages which do not contain morphological markings yet do allow NP deletions that can be recovered from context. Her analysis of deletions/EC's with respect to plain verbs in ASL is based on Huang's analysis of EC's in Chinese. Specifically, Lillo-Martin argues that ASL allows Topic NP Deletion, i.e., for the topic of a sentence to be "deleted under identity with a topic of a preceding sentence" [Hua84], and that ASL plain verbs may have an EC subject/object as the result of topic movement. Thus, as Huang argues for Chinese verbs, null argument structures of plain verbs arise when the signer topicalizes the sentence, thus creating an EC subject/object coindexed with the topic, and deletes the topic under coindexation with a topic of a preceding sentence. Languages that allow Topic NP Deletion are said to be discourse-oriented languages. Discourse-oriented languages are very sensitive to pragmatics. English is said to be a sentence-oriented language [LM91]. Sentence-oriented languages do not allow Topic NP Deletion. The point of central importance is that the deletion of an NP that co-refers with the topic of a previous sentence (or discourse topic) is not permissible in English, though it often is in ASL. 4 Transfer of Discourse strategies from ASL to English The central hypothesis of the proposed thesis is that the differences between ASL and English at the discourse level may explain some of the cohesion errors in deaf writing noted both in our initial analysis and informally by others. Of particular interest to us are the NP deletion errors which might result from the writer carrying the discourse-oriented aspects of ASL over to English even though English is a sentence-oriented language. In this case, the writer appears to believe that the NP can be omitted because it is a topic of the preceding text, even though English does not allow such omissions. Examples of NP omissions which we think are related to the transfer of ASL discourse strategies include the following [12]: o "I think that Gallaudet College should require all deaf students to take speech and speechreading courses. Therefore, they can improve their oral skills for their future use. I am going to tell you that why the deaf student should take_" o "There are many things I like about NTID. They offer supporting services like interpreters and notetakers for mainstream classes which I had experiences through my public schools. Now NTID/RIT offers same thing that my school offered but only better supporting services. That is why I like about NTID. But one thing worries me that most about NTID/RIT is financial problems. I hope I could find some ways to solve_." o "First, in summer I live at home with my parents. I can budget money easily. I did not spend lot of money at home because at home we have lot of good foods, I ate lot of foods. While living at college I spend lot of money because _ go out to eat almost everyday. At home, sometimes my parents gave me some money right away when I need _. While in college, I could not ask my parents for money right away because I live in Washington DC and my parents live in Illinois. It is too far." For each of the above examples, discussions with ASL informants confirm that the omitted items would be understood, and that the corresponding ASL discourse would be acceptable/grammatical if the omitted NP were not signed, pronominally referenced, or indicated by verb agreement. We propose that these (and other NP deletions like them) can be corrected if we track the topic, or, in computational linguistics terms, the local focus of discourse. We propose to do this by developing a modified version of Sidner's focus tracking algorithm [Sid79], [Sid83]. Sidner's algorithm tracks both a local focus and an actor focus. A claim of this work is that the deleted referent can be recovered by using focusing data structures and rules similar to those developed by Sidner for recovering referents of definite pronouns and definite noun phrases. 5 Focusing Algorithm 5.1 Related Work on Focusing There have been several other approaches to tracking focus of one kind or another through a discourse. Grosz [Gro77], Grosz and Sidner [GS86], and McCoy and Cheng [MC88] describe algorithms for tracking discourse focus, as opposed to local focus. Discourse focus is intended to capture the broad set of things that the discourse is about. These algorithms are concerned with a level of focusing which is too global for tracking the omitted NP's. This belief is supported by tests of our local focusing algorithm on sample texts. In addition, Grosz's work relies on a task model and requires that the structure of the discourse reflect the structure of the task. The discourses that we must handle do not fall under "task oriented" dialogues for which the task model is evident. While we cannot use either Grosz and Sidner's or McCoy and Cheng's models in our work, we should note that Sidner's algorithm and our extensions to her algorithm are intended to track local focus, and are consistent with these other models. ====================================================================== CF Current Focus AF Actor Focus PFL Potential Focus List PAFL Potential Actor Focus List EC Empty Category Figure 1: Key to Abbreviations ====================================================================== [?] describe an approach for tracking local focus. While the model is similar to Sidner's model in that it has a backward looking center which roughly corresponds to Sidner's CF, and a set of forward looking centers which roughly corresponds to Sidner's potential focus list, the model (as described) leaves many unanswered questions. For instance, it does not include a record of past forward and backward looking centers corresponding to the focus stack in Sidner's algorithm. The stacking mechanism (which we have found the need to extend beyond what Sidner specified) is necessary for recovering the referents of deleted NP's. [Dah86] describes a focusing algorithm which uses syntactic rather than thematic criteria to determine focus preferences. Sidner's algorithm uses thematic, syntactic and pronominal information to determine focus preferences, and we have found these focus preferences, in the context of our expansion of Sidner's algorithm, to be useful in correcting illegal NP omissions. Thus, we have chosen to explore expansion of Sidner's algorithm (which is relatively well-defined) rather than exploring expansion of another approach. This approach has shown success thus far. We should note that while Sidner's algorithm requires inferencing to confirm the co-specification of an anaphor, any system will need inferencing at least to confirm a co-specification, and confirming a co-specification by inferencing is far easier than selecting a co-specifier by inferencing (as other systems have done). Since Sidner's approach relies heavily on linguistic knowledge, it also relieves us of the burden of representing and reasoning with a significant amount of world knowledge beyond that needed for confirming the co-specification of an anaphor. 5.2 Overview of our Algorithm Our focusing algorithm is basically what is described in [Sid79] and [Sid83] to track local and actor foci, but we have had to augment the algorithm to handle complex sentence types, and to track some additional information. The proposed focusing algorithm works by recording certain information as a discourse progresses from one sentence to the next. In each (simple) sentence, the actor focus (AF) is identified with the thematic agent of the sentence. (If the sentence has no agent, then the previous AF is retained.) The Potential Actor Focus List (PAFL) contains all NP's that specify an animate element of the database and do not occur in agent position. If a sentence has a pronoun in agent position, the previous AF is chosen for its co-specification. [13] Tracking local focus requires some additional machinery. The first sentence in a text can be said to be about something. That something is generally different from the actor focus and is called the current focus (CF) of the sentence. [14] The CF can generally be identified via syntactic means, taking into consideration the thematic roles of the elements in the sentence (see Appendix A for description of algorithm). In addition to the CF, an initial sentence introduces a number of other items (any of which can go on to become the focus of the next sentence). Thus, these items are recorded in a potential focus list (PFL). [15] After the first sentence, at any given point in a well-formed text, the writer has a number of options: o Continue talking about the same thing; in this case, the CF doesn't change. o Talk about something just introduced; in this case, the CF becomes a member of the previous sentence's PFL. o Return to a topic of previous discussion. In this case that topic must have been the CF of a previous sentence, or must have been on the PFL of a previous sentence. [16] The decision (by the reader/hearer/algorithm) as to which of these alternatives was taken is based on the thematic roles (with particular attention to the agent role) held by the anaphoric elements of the current sentence, and whether their co-specification is the CF of the previous sentence, a member of the PFL of the previous sentence, or an element in the CF stack or the PFL stack. Confirmation of co-specifications requires inferencing based on general knowledge and semantics. At each sentence in the discourse, the CF and PFL of the previous sentence are stacked for the possibility of subsequent return. When one of these items is returned to, the stacked CF's and PFL's above it are popped, and are thus no longer available for return. 5.3 Filling in a Missing NP We propose using information from the focus algorithm to identify the referent of an illegally Omitted NP (and extending the focus algorithm to calculate the CF and PFL in the presence of an illegally omitted NP). To identify the referent of a missing NP, we treat the omitted NP (whose position in the sentence will be identified by syntactic analysis) as an anaphor which, like Sidner's treatment of full definite NP's and personal pronouns, co-specifies with an element recorded by the focusing algorithm. We define preferences among the focus data structures which are similar to Sidner's [Sid79], [Sid83]. Essentially, we expect the omitted item to co-specify with a focus item of higher priority than the co-specifier of a pronoun or definite NP. More specifically, when we encounter an omitted NP in other than agent position, we first try to fill the deleted NP with the CF of the immediately preceding sentence. If semantics and general knowledge inferencing cause this co-specification to be rejected, we then consider members of the PFL of the previous sentence for filling the deleted NP. If these too are rejected, we consider stacked CF's and elements of stacked PFL's, taking into account preferences (yet to be determined) among these elements. When we find an element that is acceptable according to syntax, semantics and general knowledge, we fill the empty NP with that element. When we encounter an omitted NP in agent position in a simple sentence or a sentence-initial clause, we first test the previous the AF as co-specifier, then members of the PAFL, the previous CF, and finally stacked AF's, CF's and PAFL's. To identify the missing agent NP in a non-sentence-initial clause, our algorithm will first test the agent of the previous clause, and then follow the same preferences just given. Further preferences are yet to be determined, including those between the stacked AF, stacked PAFL, and stacked CF. While filling in an NP is much like finding the co-specifier of any other anaphor, we place the additional constraint that a missing NP in a clause should be filled before the co-specifiers of other anaphora are calculated. We impose this constraint because of the following. First, we are assuming that an omitted NP is the most focused element in the sentence. Second, we are assuming [17] that if there is an omitted NP in a clause, all NP's that co-reference that NP in that clause are also omitted (since we think they could also be omitted for the same reason that the first omitted NP was omitted, i.e., they will be understood from the context). [18] Thus, if we were to first compute the co-specifier of another anaphor, that anaphor would be assigned the most focused possible co-specifier, and the omitted NP would not be able to co-specify with that most focused co-specifier (under the second assumption) and will thus be forced to co-specify with a less co-specified item (violating the first assumption). 5.4 Computing the CF We must decide how to compute the CF in the presence of illegally omitted NP's. As is specified in the algorithm mentioned in section 5.2, the CF of a sentence (in a coherent discourse) will be related to the elements contained in the data structures maintained by the focusing algorithm: it will be the same as the CF of the last sentence, an item introduced in the last sentence and thus a member of the PFL of the last sentence, or an element of the stacked CF's or stacked PFL's. The decisions as to which one of these moves is taken is determined by the anaphoric elements in the sentence and their co-specifications. When computing the CF, we treat illegally omitted NP's as anaphora since they (implicitly) co-specify something in the preceding discourse. Sidner's algorithm orders the focusing data structures, giving preference to the previous CF and then the previous PFL, and finally considering the focus stacks, and takes the first such element that has a co-specifier in the current sentence. Exceptions to this ordering occur for either thematic reasons or due to the type of anaphor used. In keeping the AF and CF different, the algorithm prefers a non-agent anaphor co-specifying a PFL member over an agent co-specifying the CF. The anaphora themselves are prioritized: pronouns are considered better indicators of focus than definite NP's. Thus, the algorithm prefers a PFL member co-specified by a pronoun to the CF co-specified by a full definite description. In determining how the algorithm should compute the CF in the presence of omitted NP's, it is important to remember that discourse-oriented languages allow deletions of NP's that are the topic of the discourse. Thus there is strong evidence that a deleted NP (in the writing of an ASL native) is the intended topic. Note that Sidner prefers pronouns to a definite descriptions as the likely CF since pronouns are strong indicators of focus. We include that preference, and add the preference that we prefer omitted NP to pronouns since they are yet stronger indicators of focus (at least in discourse-oriented languages). Thus, in adapting Sidner's algorithm to handle the omitted NP's, we want to prefer the deleted non-agent as the focus, as long as it closely ties to the previous sentence. Thus, we prefer the co-specifier of the omitted non-agent NP as the (new) CF if it co-specifies with either the last CF or a member of the last PFL. If the omitted NP is in agent position, we prefer for the new CF to be a pronominal (or, as a second choice, full definite description) non-agent anaphor co-specifying either the last CF or a member of the last PFL (allowing the deleted agent NP to be the AF and keeping the AF and CF different). [19] If no anaphor meets these criteria, then the members of the CF and PFL focus stacks will be considered, testing a co-specifier of the omitted NP before co-specifiers of pronouns and definite descriptions at each stack level. The description above applies to simple sentences. When we have a complex sentence, we compute the CF and PFL for each clause as if the clause occurred as a simple sentence in isolation, and then use this information from each clause to compute the CF and PFL of the entire sentence (as briefly described in section 8). This aspect of the algorithm will be one of the major contributions of this thesis. A fuller description of the Focusing Algorithm can be found in Appendix A. 6 Overview of the Algorithms The algorithm that we will implement and use to track focus and fill in missing NP's is composed of several smaller algorithms (two of which were discussed in sections 5.4 and 5.3). A high-level description of the modules of this algorithm is in figure 2. The Discourse Initial Algorithm selects the CF and PFL for the first sentence of a discourse. Next a loop is entered to process the remaining sentences in the discourse. The Co-Specification Algorithm takes the CF, AF, PFL, PAFL and focus stacks and uses a set of preferences to determine the co-specifiers of the definite NP's, definite pronouns, and empty NP's in the current sentence (discussed in section 5.3). We should note that the implemented algorithm will only compute the co-specifiers of the empty anaphora. We will not calculate the non-empty anaphora, but we will assume we are given their co-specifications via a set of oracles. Such co-specifications could be calculated by anaphora resolution algorithms similar to Sidner's definite anaphora resolution algorithms. We will not reproduce Sidner's anaphora resolution algorithms here; basically they impose preferences among the focusing data structures and the kinds of co-specification relationships that an anaphor could have with each of the data structures. (Carter extended Sidner's algorithms to handle intrasentential anaphora.) The Focus Tracking algorithm (partially discussed in section 5.4) calculates the new CF for each (non-discourse-initial) sentence, and stacks the previous CF and PFL. The Actor Focus Algorithm selects and AF and PAFL, and stacks the previous AF and PAFL. The Potential Focus Algorithm calculates the PFL for non-discourse-initial sentences. More complete descriptions of these algorithms can be found in Appendix A. The remaining thesis work will continue to flesh out these algorithms. For instance, the algorithms in the appendix do not handle the complex sentence and related extensions that will be discussed below. Discourse Initial Algorithm - % establishes CF, PAL, AF, PAFL of the first sentence LOOP Co-Specification Algorithm - % selects co-specifications using % the CF, AF, PFL, PAFL, and focus % stacks from the previous sentence Focus Tracking Algorithm - % updates the CF and CF stack to % reflect the current sentence Actor Focus Algorithm - % updates the AF, PAFL, AF stack and % PAFL stack to reflect the current sentence PFL Calculation - % updates the PFL and PFL stack to reflect the % current sentence GOTO LOOP Figure 2: Flow of Algorithm Processing ===================================================================== 7 Example 1 Below, we describe the behavior of the extended algorithm on an example from our collected texts containing both a deleted object and deleted subject. Example: "(S1) First, in summer I live at home with my parents. (S2) I can budget money easily. (S3) I did not spend lot of money at home because at home we have lot of good foods, I ate lot of foods. (S4) While living at college I spend lot of money because _ go out to eat almost everyday. (S5) At home, sometimes my parents gave me some money right away when I need_. (S6) While in college, I could not ask my parents for money right away because I live in Washington DC and my parents live in Illinois." After the Discourse Initial Algorithm is applied to S1, the CF is HOME, and the PFL is SUMMER, and the LIVE VP, the AF is I, and the stacks are empty. Focus Data Structures after S1: CF HOME PFL SUMMER, and the LIVE VP AF I CF stack empty PFL stack empty For S2, we first apply the co-specification algorithm. Next, the Focus Tracking Algorithm is applied. I is the only anaphor, so it becomes the CF, the PFL is MONEY, EASILY, and the BUDGET VP, the AF is I, the CF stack contains HOME, and the PFL stack contains the previous PFL. Focus Data Structures after S2: CF I PFL MONEY, EASILY, and the BUDGET VP AF I CF stack HOME PFL stack SUMMER, and the LIVE VP S3 is a complex sentence using the conjunction "because." Such sentences are not explicitly handled by Sidner's algorithm. As noted earlier, we will compute the CF and PFL for each clause of a complex sentence as if it were a simple sentence following the preceding sentence, and then calculate the CF and PFL of the whole sentence based on those calculations. [20] For "X BECAUSE Y" sentences, we prefer elements of the X clause as focus candidates to those of the Y clause (see section 8). Thus, we take the CF from the main clause, and rank elements in the main clause before elements in the second clause on the PFL. [21] In this case, the co-specification algorithm will identify several anaphora: "I", "money", and "at home". The CF becomes MONEY since it co-specifies with a member of the PFL and since the co-specifier with the last CF (I) is the agent of S3. Because the PFL algorithm will order the elements of the main clause before the elements in the other clause (the one after "because"), the PFL will contain HOME, NOT SPEND VP, and GOOD FOOD, HOME, and the HAVE VP. The AF remains I. We stack the CF, AF and the PFL of S2. Focus Data Structures after S3: CF MONEY PFL HOME, NOT SPEND VP, and GOOD FOOD, HOME, and the HAVE VP. AF I CF stack I,HOME PFL stack PFL of S2, followed by the PFL stack of S2 Note that S4 has a missing agent in the second clause. To identify the missing agent in a non-sentence-initial clause, our co-specification algorithm (which fills empty NP's) will first test the AF agent of the last clause for possible co-specification. Since this poses no contradiction, the omitted NP is filled with "I". The CF is computed by first considering the first clause of S4, since the X clause is the preferred clause of an X BECAUSE Y construct. Since "money" co-specifies with the CF of S3, and nothing else in the preferred clause co-specifies a member of the PFL, MONEY remains the CF. The PFL contains COLLEGE, the SPEND VP,ALMOST EVERY DAY, the TO EAT VP, and the GO OUT TO EAT VP. We stack the CF, AF, and PFL of S3. Focus Data Structures after S4: CF MONEY PFL COLLEGE, the SPEND VP,ALMOST EVERY DAY, the TO EAT VP, and the GO OUT TO EAT VP. AF I CF stack MONEY,I,HOME PFL stack PFL of S3, followed by the PFL stack of S3 S5 contains a subordinate clause with a missing object. Our co-specification algorithm first considers the CF, MONEY, as the co-specifier of the omitted NP; semantics, syntax, and inferencing with discourse and general knowledge do not prevent this co-specification, so it is adopted. The co-specification algorithm is then applied to other NP's. The Focus Tracking Algorithm chooses MONEY as the CF, since it is the co-specifier of an omitted non-agent NP occurring in the preferred clause of this sentence (i.e., the verb complement clause). Focus Data Structures after S5: CF MONEY PFL NEED VP,MY PARENTS,HOME,GIVE VP AF I CF stack MONEY,MONEY,I,HOME PFL stack PFL of S4, followed by the PFL stack of S4 The next sentence of the text (S6) confirms MONEY as the CF of S5, thus giving some support to our expansion of the focus algorithm to handle omitted NP's. 8 Discussion and Required Extensions One of the major extensions needed in Sidner's algorithm has to do with handling complex sentences. Based on a limited analysis of sample texts, we propose that we will compute the CF and PFL of a complex sentence based on a classification of sentence types. For instance, for a sentence of the form "X BECAUSE Y" or "BECAUSE Y. X", we prefer the expected focus of the effect clause as CF, and order elements of the X clause on the PFL before elements of the Y clause. Analogous PFL orderings apply to other sentence types described here. For a sentence of the form "X CONJ Y", where X and Y are sentences, and CONJ is "and", "or", or "but", we prefer the expected focus of the Y clause. For a sentence of the form "IF X (THEN) Y", we will prefer the expected focus of the THEN clause, while for "X, IF Y", we will prefer the expected focus of the X clause. For a sentence with a verb complement, we will prefer the thematic positions of the verb complement before all other thematic positions (i.e., order all other thematic positions of the verb complement before thematic positions of the matrix sentence). Further study is needed to determine other preferences and actions (including how to further order elements on the PFL) for these and other sentence types. These preferences will likely depend on thematic roles and syntactic criteria (such as whether an element occurs in the clause that contains the expected CF). We also need to explore AF calculation for complex sentences. At this point, we will pick the AF from the preferred clause, and put the agents in other clauses at the beginning of the PAFL, followed by all other non-agent (animate) NP's. The decisions about how these and other extensions should proceed have been or will be based on analysis of both standard written English and the written English of deaf people. The algorithm will be developed to match the intuitions of native English speakers as to how focus shifts. A second difference between our algorithm and Sidner's is that we stack the PFL's as well as the CF's. Some NP omissions we have analyzed require returning to a stacked PFL. It seems reasonable that stacking the PFL's may be needed for processing standard English (and not just for our purposes). One reason we believe we need to stack PFL's is it seems that sometimes, in complex sentences, focus revolves around the theme in one of the clauses, and later it returns to revolve around items in another clause. Further investigation may indicate that we need to add new data structures or enhance existing data structures to handle focus shifts related to these and other complex discourse patterns. We should note that while we prefer the AF as the co-specifier of an omitted agent NP (recall our discussion of step 3, above), Sidner's recency rule [22] suggests that perhaps we should prefer a member of the PFL if it is the last constituent of the previous sentence (since a null argument seems similar to pronominal reference). However, our studies show that a rule analogous to the recency rule does not seem to be needed for resolving the co-specifier of an omitted NP. In addition, Carter [Car87] feels the recency rule leads to unreliable predictions for co-specifiers of pronouns. Thus, we do not expect to change our algorithm to reflect the recency rule. (We also suspect we will abandon the recency rule for resolving pronouns.) The analysis given above for filling in the missing NP of S4 fills the NP based on the focus information from the previous sentence. Alternately, we can consider filling missing NP's in relative clauses with the topic of the main clause. This would be consistent with an analysis where the relative clause is assumed to be topicalized with a topic based on the main clause. We need to explore this alternative analysis for filling in NP's to determine which is more accurate or under which conditions each analysis should be applied. Another task is to specify focus preferences among stacked PFL's and stacked CF's, perhaps taking thematic and syntactic information into consideration. 9 Example 2 Example: "There are many things I like about NTID. They offer supporting services like interpreters and notetakers for mainstream classes which I had experiences through my public schools. Now NTID/RIT offers same thing that my school offered but only better supporting services. That is why I like about NTID. (S1) But one thing worries me that most about NTID/RIT is financial problems. (S2) I hope I could find some ways to solve _. " 9.1 Discussion and Required Extensions An important question raised by this example is how to handle a paragraph-initial, but not discourse-initial, sentence. Do we want to treat it as discourse-initial, or as any other non-discourse-initial sentence? At this point, we suggest (based on analysis of samples) that if its sentence type fits a particular class of types, that we would use the preferences of the Discourse Initial Algorithm for calculating the CF and PFL to calculate the CF and PFL (in this sense, treat the sentence as discourse initial) and retain the CF and PFL stacks, pushing the last CF and PFL, (in this way, treating the sentence as not discourse-initial). Handling a paragraph-sentence in this manner allows for certain syntactic structures to play a more prominent role than they would otherwise. Two sentence types that would be included in this class of sentence types are mentioned by Sidner (p. 284, [Sid83]): agent "There once was a prince who was changed into a frog." object "There was a tree which Sanchez had planted." We include sentences starting with "First,", "Second,", "Third,", etc. in this class of sentences. We will explore whether other sentence types should be included in this class. If a paragraph-initial sentence does not fall into this class of sentence types, we will treat it as any other non-discourse-initial sentence. In this example, we will treat S1 as non-discourse-initial. First we use the co-specification algorithm to find co-specifiers of anaphora. Since S1 is a pseudo-cleft agent sentence, using the Focus Tracking Algorithm, we will pick the cleft agent as the CF. Thus, after S1, the CF is FINANCIAL PROBLEMS, the PFL contains THE ONE THING THAT WORRIES ME THE MOST ABOUT NTID/RIT,ME, and NTID/RIT. The AF is ME. (We stack the previous CF, AF, and PFL of the previous sentence.) Focus Data Structures after S1: CF FINANCIAL PROBLEMS PFL THE ONE THING THAT WORRIES ME THE MOST ABOUT NTID/RIT,ME, and NTID/RIT AF ME CF stack the CF of the previous sentence, followed by the CF stack of the previous sentence PFL stack the PFL of the previous sentence, followed by the PFL stack of the previous sentence. We need to fill in a non-agent empty NP in S2. We first test the previous CF, financial problems. Since there is no reason to reject the previous CF as the referent of the empty NP, the algorithm fills the empty NP with the old CF. Thus, the empty NP- is correctly filled by the algorithm. 10 Example 3 "(S1) I think that Gallaudet College should require all deaf students to take speech and speechreading courses (S2) Therefore, they call improve their oral skills for their future use. (S3) I am going to tell you that why the deaf student should take _" 10.1 Discussion and Required Extensions This example illustrates an assumption discussed in section 5.3. The assumption was that if there is an omitted NP in a clause, all NP's that co-reference that NP in that clause are also omitted. As a result, we will reject any filling of an NP that makes it impossible to find the co-specifier of a full definite noun phrase in the same clause. Under the assumption discussed above, the focusing algorithm functions as follows. After S1, by the Discourse Initial algorithm, the CF is SPEECH AND SPEECH-READING COURSES since that is the theme of the verb complement, and the PFL contains DEAF STUDENTS, the TAKE SPEECH AND SPEECH READING COURSES VP, GALLAUDET COLLEGE, and the REQUIRE VP. By the Actor Focus Algorithm, the AF is DEAF STUDENTS and the PAFL is GALLAUDET COLLEGE. Focus Data Structures after S1: CF SPEECH AND SPEECH-READING COURSES PFL DEAF STUDENTS the TAKE SPEECH AND SPEECH READING COURSES VP, GALLAUDET COLLEGE, and the REQUIRE VP AF DEAF STUDENTS CF stack empty PFL stack empty Sidner's (third person agent pronoun) anaphora resolution algorithm would correctly predict actor ambiguity for "they". (It isn't clear whether the students will improve their abilities or the courses will.) We will assume our oracles will make the same predictions. If we assume "they" of S2 refers to DEAF STUDENTS, then, by the Focus Tracking Algorithm, since only members of the PFL are co-specified by anaphora, the CF becomes DEAF STUDENTS, and by the Potential Focus Algorithm, the PFL contains THEIR ORAL SKILLS,THEIR FUTURE USE, and the CAN IMPROVE VP, and the CF stack contains SPEECH AND SPEECH READING COURSES. Focus Data Structures after S2 ("they" co-specifies DEAF STUDENTS): CF DEAF STUDENTS PFL THEIR ORAL SKILLS,THEIR FUTURE USE, and the CAN IMPROVE VP AF DEAF STUDENTS CF stack SPEECH AND SPEECH READING COURSES PFL stack DEAF STUDENTS the TAKE SPEECH AND SPEECH READING COURSES VP, GALLAUDET COLLEGE, and the REQUIRE VP Then, for S3, the co-specification algorithm tries to fill the empty object NP with the previous CF (the first choice for an omitted non-agent NP). However, we reject this (DEAF STUDENTS) because "deaf students" of "... why the deaf students should..." can not co-specify with the filled omitted NP (based on our assumption that an omitted NP can not co-specify any non-empty NP in the same clause). Next, we try filling the NP with members of the PFL; semantics causes the rejection of THEIR ORAL SKILLS and THEIR FUTURE USE, and a VP can not fill an NP. So, we look at the focus stack, which contains SPEECH AND SPEECH READING COURSES. There is no reason to reject this and, thus, the missing NP is correctly filled by the algorithm. If we assume "they" of S2 to refer to SPEECH AND SPEECH READING COURSES, then since "they" co-specifies the previous CF of S1, "their oral skills and "their future use" both co-specify DEAF STUDENTS, which is a member of the previous PFL, we must choose the new CF from among the previous CF and these PFL members. Since the anaphor co-specifying the CF is in agent position, we shift the focus to a member of the PFL co-specified by a non-agent. In this case, only one member of the PFL is co-specified by the remaining anaphora, so we select it (DEAF STUDENTS) as the CF. (We need to expand the algorithm to handle multiple anaphora co-specifying multiple members of the PFL.) The new PFL contains THEIR ORAL SKILLS,THEIR FUTURE USE, and the CAN IMPROVE VP. SPEECH AND SPEECH READING COURSES is pushed on the CF stack. Focus Data Structures after S2 ("they" co-specifies SPEECH AND SPEECH READING COURSES): CF DEAF STUDENTS PFL THEIR ORAL SKILLS,THEIR FUTURE USE, and the CAN IMPROVE VP AF DEAF STUDENTS CF stack SPEECH AND SPEECH READING COURSES PFL stack DEAF STUDENTS the TAKE SPEECH AND SPEECH READING COURSES VP, GALLAUDET COLLEGE, and the REQUIRE VP The analysis for filling the empty NP of S3 in this case (i.e., where "they" co-specifies SPEECH AND SPEECH READING COURSES),is the same as for the first case, since the contents of the focus data structures are the same after both case analyses of S2. While both analyses yield the correct result for filling in the empty NP of S3, because of the unnecessary work that is required by considering all possibilities when we have an ambiguous pronoun, we anticipate that in the written English tutorial system, we will query the user as to which is the correct referent when we encounter an ambiguity condition. 11 Conclusions We have discussed proposed extensions to Sidner's algorithm to track local focus in the presence of illegally omitted NP's, and to use the extended focusing algorithm to identify the intended co-specifiers of omitted NP's. This strategy is reasonable since LT may lead a writer to use a rule of discourse-oriented ASL which allows the omission of an NP that is the topic of a preceding sentence when writing sentence-oriented English. The focus algorithm is potentially beneficial for correcting other errors in deaf writing. For example, it may be useful in identifying the intended referent when the writer has used a pronoun where a full definite description is required (to avoid ambiguity). The major contribution of this thesis is the provision of a focusing algorithm for English that is more detailed and realistic with respect to English, especially for complex sentence types. Another contribution of this thesis is further documentation that there is transfer of discourse strategies in the language production of a second language learner. Additionally, it develops a methodology for capturing a particular class of errors expected to occur in the production of a sentence-oriented language being acquired by a speaker/signer of a discourse-oriented language. 12 Plan We plan to study more text samples from ASL natives and from native speakers of English to test the proposed extensions and to identify extensions that will address other focusing questions which we've identified. In particular, we will look for other examples of NP omissions. We will implement the focusing algorithm, assuming as input a GB-like syntactic parse tree and a semantic representation (as described in section 14.1) of the sentence. The co-specification confirmation will be done by an oracle since the required inferencing is beyond the scope of this work. The specification of the input requirements and output behavior of the oracle will be specified. Finally, we will test our LT hypothesis and the extended focusing algorithm by testing whether the algorithm can correctly identify the omitted NP's in deaf writing samples which are different from the examples used in developing the algorithm. 13 Acknowledgments We would like to thank John Albertini of the National Technical Institute for the Deaf (NTID), Bob McDonald of Gallaudet University, Lore Rosenthal of the Pennsylvania School for the Deaf, George Schellum (formerly) of the Margaret S. Sterck School, and MJ Bienvenu of the Bicultural Center for helping us gather writing samples. A good part of our knowledge of ASL comes from discussions with ASL signers. We would like to thank our informants, April Nelson of Rosemont College, Don Ruble of Bloomsburg State College, and Jean Quillen and Carmine Salvato of the Pennsylvania School for the Deaf. We also thank Lore Rosenthal of the Pennsylvania School for the Deaf for interpreting for us. In addition, we want to thank the numerous people from deaf schools and organizations who have discussed this project with us. We thank Julie Van Dyke for her implementation of the English grammar, and Jeff Reynar for his implementation of the error productions which will be used in the identification phase of the eventual overall system. We thank Karen Hamilton for the implementation of the database retrieval functions. References [Bak80] C. Baker. Sentences in American Sign Language. In C. Baker and R. Battison, editors, Sign Language and the Deaf Community, pages 75-86. National Association of the Deaf, Silver Spring, MD, 1980. [BC80] C. Baker and D. Cokely. American Sign Language: A Teacher's Resource Text on Grammar and Culture. TJ Publishers, Silver Spring, MD, 1980. [BP78] C. Baker and C. Padden. Focusing on the non-manual components of American Sign Language. In P. Siple, editor, Understanding Language through Sign Language Research, pages 27-58. AP, New York, 1978. [BPB83] K. Bellman, H. Poizner, and U. Bellugi. Invariant characteristics of some morphological processes in American Sign Language. Discourse Processes, 6:199-223, 1983. [BS88] C. Baker-Shenk. Comparative linguistic analysis for interpreters. In D. Cokely, editor, Sign Language Interpreter Training Curriculum, pages 84-108. Fredericton, NB: University of New Brunswick, 1988. [Car87] David Carter. Interpreting Anaphors in Natural Language Texts. John Wiley and Sons, New York, 1987. [Dah86] Deborah A. Dahl. Focusing and reference resolution in PUNDIT. In Proceedings of the 1986 National Conference on Artificial Intelligence, pages 1083-1088, Philadelphia, PA, August 1986. [Fan83] Lou Fant. The American Sign Language Phrase Book. Contemporary Books, Inc., Chicago, 1983. [Fil68] C. J. Fillmore. The case for case. In E. Bach and R. Harms, editors, Universals in Linguistic Theory, pages 1-90, New York, 1968. Holt, Rinehart, and Winston. [Gas79] Susan Gass. Language transfer and universal grammatical relations. Language Learning, 29:327-344, 1979. [Gas84] S. Gass. A review of interlanguage syntax: Language transfer and language universals. Language Learning, 34(2):115-132, 1984. [Gas89] S. M. Gass. How do learners resolve linguistic conflicts? In S. M. Gass and J. Schachter, editors, Linguistic Perspectives on Second Language Acquisition, chapter 8, pages 183-199. Cambridge University Press, New York, 1989. [Gro77] Barbara Grosz. The representation and use of focus in dialogue understanding. Technical Report 151, SRI International, Menlo Park Ca., 1977. [GS83a] S. Gass and L. Selinker, editors. Language Transfer in Language Learning. Newbury House, Rowley, MA, 1983. [GS83b] S. M. Gass and L. Selinker. Introduction to section 3. In S. Gass and L. Selinker, editors, Language Transfer in Language Learning. Newbury House, Rowley, MA, 1983. [GS86] Barbara J. Grosz and Candace L. Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175-204, July-August 1986. [Hak76] K. Hakuta. A case study of a Japanese child learning English as a second language. Language Learning, 26:321-51, 1976. [HS83] R. J. Hoffmeister and C. Shettle. Adaptations in communication made by deaf signers to different audience types. discourse processes, 6:259-274, 1983. [Hua84] C.-T. James Huang. On the distribution and reference of empty pronouns. Linguistic Inquiry, 15(4):531-574, Fall 1984. [Ing78] R. M. Ingram. Theme, rheme, topic and comment in the syntax of American Sign Language. Sign Language Studies, 20:193-218, Fall 1978. [KB79] E. S. Klima and U. Bellugi. The Signs of Language. Harvard University Press, Cambridge, MA, 1979. [KG83] J. Kegl and P. Gee. Narrative/story structure, pausing and American Sign Language. Discourse Processes, 6:243-258, 1983. [KK78] Richard R. Kretschmer Jr. and Laura W. Kretschmer. Language Development and Intervention with the Hearing Impaired. University Park Press, Baltimore, MD, 1978. [Kle77] H. Kleinmann. Avoidance behavior in adult second language acquisition. Language Learning, 27:93-108, 1977. [Lid80] Scott K. Liddell. American Sign Language Syntax. Mouton Publishers, 1980. [LM91] Diane C. Lillo-Martin. Universal Grammar and American Sign Language. Kluwer Academic Publishers, Boston, 1991. [MC88] Kathleen F. McCoy and Jeannette Cheng. Focus of attention: Constraining what can be said next. In C. L. Paris, W. R. Swartout, and W. C. Mann, editors, Proceedings of the 4th International Workshop on Natural Language Generation. Kluwer Academic Publishers, Boston, 1988. Santa Catalina Island, July. [McL87] R. McLaughlin. Theories of Second-Language Acquisition. Edward Arnold, London, 1987. [Odl89] T. Odlin. Language Transfer. Cambridge University Press, New York, 1989. [Pad81] C. Padden. Some arguments for syntactic patterning in American Sign Language. Sign Language Studies, 32:239-259, Fall 1981. [Pad82] C. Padden. Interaction of Morphology and Syntax in American Sign Language. PhD thesis, UCSD, 1982. [Pad88] C. Padden. Interaction of Morphology and Syntax in American Sign Language. Garland Publishing, Inc., New York, 1988. [PQ73] D. Power and S. Quigley. Deaf children's acquisition of the passive voice. Journal of Speech and Hearing Research, 16:5-11, 1973. [QP84] S. P. Quigley and P. V. Paul. Language and Deafness. College-Hill Press, Inc., San Diego, 1984. [QPS77] S. P. Quigley, D. J. Power, and M. W. Steinkamp. The language structure of deaf children. The Volta Review, 79(80):72 84, February-March 1977. [QSW74] S. P. Quigley, N. L. Smith, and R. B. Wilbur. Comprehension of relativized sentences by deaf students. Journal of Speech and Hearing Research, 17:325-341, 1974. [QWM76] S. Quigley, R. Wilbur, and D. Montanelli. Complement structures in the language of deaf students. Journal of Speech and Hearing Research, 19:448-457, 1976. [RQP76] W. K. Russell, S. P. Quigley, and D.J. Power. Linguistics and Deaf Children: Transformational Syntax and Its Application. The Alexander Graham Bell Association for the Deaf, Inc. Washington, D.C., 1976. [Sac90] Oliver W. Sacks. Seeing Voices. University of California Press, Berkeley and Los Angeles, CA, 1990. [Sch82] J. Schumann. Simplification, transfer and relexification as aspects of pidginization and early second language acquisition. Language Learning, 33:337-66, 1982. [Sid79] Candace L. Sidner. Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse. PhD thesis, MIT, June 1979. [Sid83] Candace L. Sidner. Focusing in the comprehension of definite anaphora. In Robert C. Berwick and Michael Brady, editors, Computational Models of Discourse, chapter 5, pages 267-330. MIT Press, Cambridge, MA, 1983. [Sle82] D. Sleeman. Inferring (mal) rules from pupil's protocols. Proceedings of ECAI-82, 9:160164, 1982. [SR79] J. Schachter and W. E. Rutherford. Discourse function and language transfer. Working Papers on Bilingualism, 19:1-12, 1979. [Sto60] W. C. Stokoe, Jr. Sign Language structure. Studies in Linguistics occasional papers, (8), 1960. [Str88] Michael Strong. Language Learning and Deafness. Cambridge University Press, New York, 1988. [Sur91] Linda Z. Suri. Language transfer: A foundation for correcting the written English of ASL signers. Technical Report TR-91-19, Dept. of CIS, University of Delaware, 1991. [Wil77] R. B. Wilbur. An explanation of deaf children's difficulty with certain syntactic structures of English. The Volta Review, 79(80):85-92, February-March 1977. [WS83] Ralph M. Weischedel and Norman K. Sondheimer. Meta-rules as a basis for processing ill-formed input. American Journal of Computational Linguistics, 9(3-4):161-176, 1983. [WVJ78] Ralph M. Weischedel, Wilfried M. Voge, and Mark James. An artificial intelligence approach to language instruction. Artificial Intelligence, 10:225-240, 1978. 14 Appendix A All of the algorithms described in this thesis assume the following as input: a syntax tree and a semantic representation of the sentence which indicates which NP fills which thematic role [Fil68]. Further input specifications are given for each particular algorithm. 14.1 Discourse Initial Algorithm The CF and PFL of a discourse-initial sentence are calculated slightly differently than for a non-discourse-initial sentence. Sidner referred to this algorithm as the Expected Focus Algorithm. We will use her algorithm, but refer to the output as the CF and PFL (rather than the expected focus and DEF) for simplicity. The algorithm relies on syntax in cases where the syntax is a strong indicator of focus (e.g., There-insertion sentences). If this is not the case, then the thematic roles of the sentence (giving preference to theme) are taken as the indicators of focus. IF (the sentence is an is-a sentence) THEN CF=the subject of the sentence. ELSE IF (the sentence is a there-insertion sentence) THEN CF=object of the there-insertion sentence ELSE IF (THEME is a verb complement) THEN CF = THEME of verb complement; ELSE % THEME is not a verb complement CF = THEME; PFL = all other thematic positions with the AGENT last, followed by the verb phrase. IF the sentence has an agent THEN AF = agent ELSE retain AF; PAFL = all NP's; Extensions related to those discussed for the PFL in sections 7 and 8 will be needed to calculate the CF and PFL of a discourse-initial complex sentence. 14.2 Filling in a Missing NP (Co-specification Algorithm) Missing non-agent NP: Try to fill empty NP with CF Try to fill empty NP with members of PFL Try to fill empty NP with stacked CF's, and stacked PFL elements, under preferences yet to be determined. (For example, should one go through all CF's before PFL's or go down the CF and PFL stacks layer by layer?) Missing agent NP: IF trying to fill an agent NP in a simple sentence or in the first clause of a complex sentence THEN Try to fill empty NP with AF Try to fill empty NP with PAFL Try to fill empty NP with CF Try to fill empty NP with stacked AF and PAFL ELSE IF trying to fill an NP in agent position in a complex sentence in other than the first clause THEN Try to fill empty NP with agent of the previous clause Try to fill empty NP with AF Try to fill empty NP with PAFL Try to fill empty NP with CF Try to fill empty NP with stacked AF and PAFL (IF Discourse-Initial sentence, then fill with ''I'' ) (For anaphora which are non-empty, we rely on oracles to give us the co-specifiers.) 14.3 Proposed Focus Tracking Algorithm Sidner mentions four sentence types which strongly mark focus and are usually not discourse initial (p. 284, [Sid83]): pseudo-cleft agent "The one who ate the rutabagas was Henrietta." pseudo-cleft object "What Henrietta ate was the rutabagas." cleft agent "It was Henrietta who ate the rutabagas." cleft object "It was the rutabagas that Henrietta ate." In order to recognize the strong focus marking tendencies of the syntactic structures the first thing the focusing algorithm does is test whether the sentence is of one of these types and if so, sets the CF accordingly. IF cleft or pseudo-cleft THEN IF the cleft item is not the previous CF, and some piece of the non-clefting item co-specifies with something in the focus data structure THEN CF is the cleft item ELSE the sentence is incoherent The Focus Tracking algorithm is used to update the CF based on the focusing data structures and the co-specifications of the anaphora in the current sentence. The algorithm here is based on Sidner's algorithm, and we have indicated which parts of the Focus Tracking Algorithm correspond to which steps of Sidner's algorithm by including step numbers in the comments. We have omitted the steps corresponding to do-anaphora (since we do not handle verbal anaphora) and focus sets (for clarity in the algorithm presentation; focus sets are rarely needed). Additions have been made to handle tracking focus in the presence of omitted NP's. Further modifications to handle complex sentence types must be made. Input: o CF (current focus) - the focus of the previous sentence (Based on Discourse Initial Algorithm at start of sentence 2; otherwise the CF is determined by the last iteration of this algorithm. ) o PFL = Potential Focus List from the previous sentence. (Based on Discourse Initial Algorithm if on sentence 2, based on PFL algorithm otherwise.) o CF, PFL stacks - history of past CF's and PFL's; both empty on sentence 2 o information on anaphora in the sentence and their co-specifications with elements in focusing data structures . Note: In what follows, the term "anaphor" includes an omitted NP. Stack the old CF; IF cleft or pseudo-cleft % Strong Syntactic indicators override % usual rules based on thematics, syntax % and focus history THEN IF the cleft item is not the previous CF, and some piece of the non-clefting item co-specifies with something in the focus data structure THEN CF is the cleft item ELSE the sentence is incoherent ELSE IF (there are multiple anaphora, at least one specifying the CF and at least one specifying something on the PFL) THEN BEGIN % step 3 IF there was an omitted non-agent NP AND (the omitted non-agent co-specifies the CF or something on the PFL) THEN CF=omitted non-agent NP ELSE IF (there are anaphora co-specifying the CF and some members of the PFL) % Need to expand this part to handle multiple % anaphora co-specifying multiple members of PFL THEN IF (the cospecifier of the CF is a nonagent AND the cospecifier of the PFL is an agent) THEN retain the CF as focus; ELSE IF (the cospecifier of the CF is an agent AND the cospecifier of the PFL is a nonagent) THEN make the new CF the old PFL element (for multiple PFL co-spec's, prefer pro's over full definite NP's, and consider PFL order (to be determined) if choice is still ambiguous ELSE IF (the CF co-specifies a non-agent AND the PFL co-specifies a non-agent) THEN IF only the PFL member is mentioned by a pronoun THEN make the CF the PFL member ELSE retain the CF; END % step 3 ELSE IF (CF is co-specified by an anaphor, but no member of the PFL is co-specified by an anaphor) THEN BEGIN % step 4 retain the CF as focus; END % step 4 ELSE IF (anaphora cospecify members of the PFL, but no anaphor co-specifies the CF) THEN BEGIN % step 5 IF (only one member of PFL is specified) THEN CF is that member of the PFL; ELSE choose CF in manner suggested by the ordering of the PFL (to be determined) END % step 5 ELSE IF (there is an omitted NP in agent position) % since we are in this step, we know % no anaphor co-specifying the CF or % the PFL was found THEN CF=omitted NP; ELSE IF (the anaphora cospecify a member of the focus stack) % (but no anaphor co-specifies the CF or a PFL member) THEN BEGIN % step 6 move the CF to the stack member by popping the stack END % step 6 ELSE %step 8 IF ((no anaphora cospecifying any of CF, PFL, or focus stack) AND (CF can fill a non-obligatory case OR the VP is related to the CF by nominalization)) THEN retain the CF; ELSE % step 10 IF (no foci mentioned) THEN BEGIN retain CF as focus; for any unspecified pronoun, the nonantecedent pronoun condition holds END % step 10 14.4 The PFL Algorithm - How to compute the PFL for a simple non-discourse-initial sentence At the end of processing each non-discourse initial sentence, we compute the PFL. For any simple sentence, the potential focus list consists of a list of all elements in the knowledge network which are specified by NP's filling a thematic role [23], excluding an NP in agent position and excluding the NP which co-specifies the CF, followed by the verb phrase. This description fits that of Sidner's PFL algorithm. We propose to explore whether we can further order elements of the PFL, based on thematic, syntactic or other criteria. One possibility is that we will order the elements by favoring elements filling obligatory roles over non-obligatory roles. We also plan to explore how (i.e., in what form) the VP should be included on the PFL. At this point, we believe that since all NP's that are related to the VP are already on the list, maybe all we need to do is put the verb on the list in order to handle nominalizations that serve as anaphora. We do not intend to handle VP anaphora. In sections 7 and 8, we discussed how we need to extend this algorithm to handle complex sentences 14.5 AF algorithm PROC calculate the AF; BEGIN Stack the AF and PAFL; IF the sentence has an agent THEN AF = agent ELSE retain AF; PAFL = all NP's; END 15 Appendix B o Conjunctions: 4 - Omitted conjunction: 1 "He taught _ directed, for almost 30 years ..." - Inappropriate conjunction: 3 "Other thing that I don't like is some oral people talking with me without sign language but I can only understand in body language with oral." "my classmate is deaf and some of them can hear a few things." o Prepositions: 66 - Omitted preposition: 26 "My brother like to go _ Castle Mall." "... the sign 'ONE-DAY-PAST' is glossed _ such words..." - Inappropriate preposition: 27 "My dolls are hanging at my wall." - Extra Preposition: 13 "We help with each other with problems or anything." o Determiners: 63 - Omitted determiner: 35 "Then we ate in _ bus." - Inappropriate determiner: 9 "I will to build the sandcastle." [24] - Extra Determiner: 19 "A very little study was done..." o Incorrect Number on Noun: 23 "... in several language." o Incorrect Subject-Verb Agreement: 11 "My brother like to go..." o Tense and Aspect: 70 - Dropped tense: 5 "The women's dormitory was clean and smell good." - Incorrect passive formation "...they were not permit to enter due to their clothes." "Suppose it was someone in your family that getting killed." - Incorrect BE/HAVE/DO auxiliary pairing "They were both English teachers and I do not heard from them since I left to college." ( ". . . have not. . . ") - Verb subcategorization problems "I really enjoyed to work at my father's office." "Third, Gallaudet will have more people interested to enroll Gallaudet University." - Problems related to use of "to" "The boys drive car and to listen the music." - Extra, Incorrect or Omitted Modal: 2 "All persons guilty of drunk driving _ be sent to jail." (should be sent) "I have more positive to do and _ handle myself." ("can handle") "They should need to communicate more and meet more people to communicate each other." ( "They need" ) "I might need more time to find right people or my reputation will become bad." ("might" - inappropriate) Other tense/aspect problems: 65 " I can go anywhere without clean my room..." o Mixing up English words or phrases which share a single ASL sign: 12 "Third, living at home is bored and quiet." - Omitted BE: 9 "Once the situation changes they _ different people." - Lack of BE/HAVE distinction: 7 "... some birth controls are side-effect." o Other Omitted Main Verbs: 7 (Usually only with dummy subjects, if not be or have) "I enjoy with NTID student who is Deaf person and NTID staff and I like to talk with old NTID student because I like to hear about NTID's History." ("talking" or "being") "Better wait until I lived here more than one month." ("It would be better...") o Incorrect WH-phrase: 4 "Now you can see what I compared these two of my teachers." (Correction, from context: "... how I compared... ") o Adjective Problems: 13 - Incorrect Adjective Choice: 3 "Especially, I do feel good to have here Because the food are the best than Gallaudet College." - Incorrect Adjective Formation: 10 "it was very complicate to know where exactly is the bank." o Incorrect Nominalization: 5 "I, myself, will call the drunk driver a murder if he hit and killed my folks." "I have to learn alot of responisibles at NTID." o Relative Clauses: 14 - Relative pronoun deletion: 4 "Then we go to see President from 1960 to 1963 _ is John Kennedy." ("who was John Kennedy." ) - Resumptive pronoun: 1 "When I came to NTID for the first time, I met all my old friends that I didn't expected them to come to same school." - Other: 9 o Pronouns: 12 - Incorrect pronoun choice (including pleonastic): 7 "The students should have them..." ("them" refers to "birth control.") - Inappropriate pronoun use (where full definite descriptions are required): 4 "Fraternities and Sororities will see each other again like an old time. If Gallaudet should not allow Greek organization to continue, they will not cooperate each other since they do not know each other very well." - Lack of pronoun use (overuse of definite descriptions): 1 "My father hired me to run for my dad." "An airplane is better than driving a car. An airplane is very safe to go on a trip than driving a car. An airplane is faster than a car. An airplane can takes more people on the plane than less people in a car. A car is cheaper to go anywhere, but an airplane is more expensive to fly. The people can see many things happening in a car than flying on an airplane to see a plain sky." o Pleonastic Pronoun Deletion: 10 - Object: 5 "I loved _ here at Rochester Institute of technology because it was very beautiful place..." - Subject: 5 "Better wait until I lived here more than one month." [25] "The people are very friendly and interesting to get to know then who are from all over the U.S." o Focus/Discourse Structuring Problems: 49 - Omission of focused element (subject: 4; object: 4): 8 "I hope I could find some ways to solve_." - Problems carrying over general/specific description strategies: 5 "Fraternities and Sororities here at XYZ DO provide social life. Some examples: parties; get togethers; workplaces; IM; and sports." - Structuring problems with "because": 8 "Only one thing that I don't like NTID because of student always bothering me while I'm at dorm." - Ambiguous modifier attachment: 1 "There are many things I like about NTID. They offer supporting services like interpreters and notetakers for mainstream classes which I had experiences through my public schools. Now NTID/RIT offers same thing that my school offered but only better supporting services ." (Ambiguous as to what student had experience with.) - Other (possibly related to carry-over of topic-comment strategies): 27 o Redundancy Problems: 2 "I still feel thankful for coming to NTID instead of Gallaudet the main reason why I stay here, is the warm feeling everyone have toward the others." o Not Enough Sentence Breaks: 6 o Other:104 (23% of errors in database) ENDNOTES [l] This research was supported in part by NSF Grant #IRI-9010112. Support was also provided by the Nemours Foundation . [2] This tool would be very useful to the deaf population. Since data on writing skills is not well-documented, we note that the reading comprehension level of deaf students is considerably lower than that of their hearing counterparts, "... with about half of the population of deaf 18-year-olds reading at or below a fourth grade level and only about 10% reading above the eighth grade level..."[Str88] [3] It should be noted that we have not attempted to prove that language transfer is behind the errors we have found. Rather, we will show that language transfer is a reasonable explanation. [4] A pleonastic NP is one which does not play a thematic role. For example, it in "It is raining" or "I like it when Mary sings", and there in "There is a book on the chair". [5] Other researchers (e.g., [PQ73], [QSW74], [QWM76], [RQP76], [QPS77], [KK78] [QP84]) studied errors in deaf writing but did not attribute errors to LT. [6] Error classes which occur less frequently have been classified under "Other". [7] We have recently created a database to store analyzed writing samples. A database user can retrieve all sentences (with their corresponding corrected sentences) containing a particular error. Entering the data into the database is very time-consuming, which is why only 17 samples have been entered thus far. [8] Note: "_" is used to mark places where we think the writer has omitted one or more words from the corresponding correct English sentence. [9] The positioning of a verb with respect to the ASL time line may reflect tense. [10] We do not claim that each instance of an error class that is attributable to LT necessarily resulted from LT, only that LT could explain the error, and thus may be the source of the error. We recognize that there are other sources for errors, including incorrect analyses of English on the part of the writer, and English instruction. [11] There are verbs for which the movement begins at the direct object and ends at the object. Padden [Pad88] refers to these as Backwards verbs. [12] These are actual excerpts. Each typically contains several errors. Here we focus on the deleted NP's only. [13] Throughout this paper, we discuss whether an NP co-specifies another NP, the Current Focus, or a member of a Potential Focus List, etc. By writing that X co-specifies Y. we mean that the knowledge network representation specified by X is the knowledge network representation specified by Y (in the case that Y is an NP), or corresponding to Y (in the case that Y is a focusing data structure). [14] Sometimes the AF and the CF are the same. [15] Sidner uses many terms and data structures to describe her algorithms. We will collapse these terms for simplicity. For instance, Sidner writes of a PFL, ALFL, and DEF, and we will refer to all of them as a PFL. She refers to an expected focus and a current focus (CF), but we will call them both CF (even though the the CF of a discourse-initial sentence may never be confirmed as the focus, but only be expected to be the focus when processing the next sentence). [16] Sidner's algorithm just stacked CF's. We have extended the algorithm to stack PFL's as well. [17] We need to confirm that this assumption is reasonable based on further study of ASL and analysis of deaf writing samples. [18] Recall we are only correcting NP's that are omitted under Topic NP deletion, as opposed to those deleted in the presence of rich verb morphology. [19] As future work, we will explore how to resolve more than one non-agent anaphor in a sentence co-specifying PFL elements. [20] If we were instead to split the sentence up, and treat each clause as a sentence, then the focus would shift away from MONEY when we process the second clause (which contradicts our intuition of what the focus is in this paragraph). [21] The appropriateness of placing elements from both clauses in one PFL and ranking them according to clause membership will be further investigated. This construct ("X BECAUSE Y") is further discussed in section 8. [22] Sidner's recency rules prefers a member of the PFL which occurred as the last constituent in the last sentence as a co-specifier of a subject pronoun. [23] Throughout this work, when we write that the CF is an NP, or the PFL contains an NP, we mean that the data structure contains an element in the knowledge network which is is specified by that NP. [24] Some examples are errors in the context in which they occurred, but the sentences appear correct in isolation. [25] Often more than one correction is possible. For example, here the correction could be "It is better to wait until ...." or "I had better wait...."