Comments
Transcript
CMSC 723 / LING 723: Computational Linguistics I
CMSC 723 / LING 723: Computational Linguistics I September 26, 2007: Dorr Computational Morphology (Chapter 3), Word Classes, POS Tagging (Chapter 5) Prof. Bonnie J. Dorr Co-Instructor: Nitin Madnani TA: Hamid Shahri Computational Morphology Finite State Morphology – Finite State Transducers (FST) Input/Output Analysis/Generation 1 Computational Morphology WORD cats cat cities geese ducks STEM (+FEATURES)* cat +N +PL cat +N +SG city +N +PL goose +N +PL (duck +N +PL) or (duck +V +3SG) merging merge +V +PRES-PART caught (catch +V +PAST-PART) or (catch +V +PAST) Building a Morphological Parser The Rules and the Lexicon – – – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language Approaches – Lexicon only – Lexicon and Rules • Finite-state Automata • Finite-state Transducers – Rules only 2 Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaim acclaim acclaimed acclaimed acclaiming acclaims acclaims acclamation acclamations acclimate acclimated acclimated acclimates acclimating acclaim $N$ acclaim $V+0$ acclaim $V+ed$ acclaim $V+en$ acclaim $V+ing$ acclaim $N+s$ acclaim $V+s$ acclamation acclamation acclimate acclimate acclimate acclimate acclimate $N$ $N+s$ $V+0$ $V+ed$ $V+en$ $V+s$ $V+ing$ Building a Morphological Parser The Rules and the Lexicon – – – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language Approaches – Lexicon only – Lexicon and Rules • Finite-state Automata • Finite-state Transducers – Rules only 3 Lexicon and Rules: FSA Inflectional Noun Morphology • English Noun Lexicon reg-noun Irreg-pl-noun Irreg-sg-noun plural fox cat dog geese sheep mice goose sheep mouse -s • English Noun Rule Lexicon and Rules: FSA English Verb Inflectional Morphology reg-verb-stem irreg-verb-stem irreg-past-verb past past-part pres-part 3sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing -s -ed 4 FSA for Derivational Morphology: Adjectival Formation More Complex Derivational Morphology 5 Using FSAs for Recognition: English Nouns and their Inflection Morphological Parsing Finite-state automata (FSA) – – Recognizer One-level morphology Finite-state transducers (FST) – – Two-level morphology input-output pair 6 Terminology for Kimmo-style Morphology Upper = lexical tape Lower = surface tape Characters correspond to pairs, written a:b If “a:a”, write “a” for shorthand Two-level lexical entries # = word boundary ^ = morpheme boundary Other = “any feasible pair that is not in this transducer” Final states indicated with “:” and non-final states indicated with “.” Four-Fold View of FSTs As a recognizer As a generator As a translator As a set relater 7 Nominal Inflection FST Lexical and Intermediate Tapes 8 Spelling Rules Name Rule Description Example Consonant Doubling 1-letter consonant doubled before -ing/-ed beg/begging E-deletion Silent e dropped before -ing and -ed make/making E-insertion e added after s,z,x,ch,sh before s watch/watches Y-replacement -y changes to -ie before -s, -i before -ed try/tries K-insertion verbs ending with vowel + -c add -k panic/panicked Chomsky and Halle Notation ε→e/ x s z ^ __ s # 9 Intermediate-to-Surface Transducer State Transition Table How would we represent this in NLTK? – from nltk_contrib.fst import fst – f = fst.FST('pluralize') – f.add_state('q0') – f.initial_state = 'q0' – f.set_final('q0') – f.add_state('q1') – f.add_arc('q0', 'q1', 'z', 'z') 10 Two-Level Morphology Sample Run 11 FSTs and ambiguity Parse Example 1: unionizable – union +ize +able – un+ ion +ize +able Parse Example 2: assess – assessV – assN +essN Parse Example 3: tender – tenderAJ – tenNum+dAJ+erCMP What to do about Global Ambiguity? Accept first successful structure Run parser through all possible paths Bias the search in some manner 12 Computational Morphology The Rules and the Lexicon – – – – General versus Specific Regular versus Irregular Accuracy, speed, space The Morphology of a language Approaches – Lexicon only – Lexicon and Rules • Finite-state Automata • Finite-state Transducers – Rules only Lexicon-Free Morphology: Porter Stemmer Lexicon-Free FST Approach By Martin Porter (1980) http://www.tartarus.org/%7Emartin/PorterStemmer/ Here is one you can try online: http://www.utilitymill.com/utility/Porter_Stemming_Al gorithm Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER 13 Porter Stemmer Definitions – C = string of one or more consonants, where a consonant is anything other than A E I O U or (Y preceded by C) – V = string of one or more vowels – M = Measure, roughly with number of syllables – Words = (C)*(V*C*)M(V)* • • • M=0 M=1 M=2 TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN, ORRERY Conditions – – – – *S - stem ends with S *v* - stem contains a V *d - stem ends with double C, e.g., -TT, -SS *o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOP *<S> = ends with <S> *v* = contains a V Porter Stemmer *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 1: Plural Nouns and Third Person Singular Verbs SSES Æ SS IES Æ I caresses ponies ties caress cats SS Æ SS S Æ Æ Æ Æ Æ Æ caress poni ti caress cat Step 2a: Verbal Past Tense and Progressive Forms (M>0) EED Æ EE Æ i (*v*) ED ii (*v*) ING Æ feed Æ feed, agreed Æ agree plastered Æ motoring plaster, bled Æ bled Æ motor, sing Æ sing Step 2b: If 2a.i or 2a.ii is successful, Cleanup AT Æ ATE conflat(ed) Æ conflate BL Æ BLE troubl(ed) Æ trouble IZ Æ IZE (*d and not (*L or *S or *Z)) Æ single letter (M=1 and *o) Æ E siz(ed) Æ size hopp(ing) Æ hop, tann(ed) Æ tan hiss(ing) Æ hiss, fizz(ed) Æ fizz fail(ing) Æ fail, fil(ing) Æ file 14 *<S> = ends with <S> Porter Stemmer *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 3: Y Æ I happy Æ happi sky Æ sky (*v*) Y Æ I Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> (m>0) TIONAL -> ATE TION (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) ENCE ANCE IZE ABLE AL ENT E OUS IZE ATE ATE AL IVE FUL OUS AL IVE BLE ENCI ANCI IZER ABLI ALLI ENTLI ELI OUSLI IZATION ATION ATOR ALISM IVENESS FULNESS OUSNESS ALITI IVITI BILITI -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> relational -> conditional -> rational -> valenci -> hesitanci -> digitizer -> conformabli -> radicalli -> differentli -> vileli - > analogousli -> vietnamization -> predication -> operator -> feudalism -> decisiveness -> hopefulness -> callousness -> formaliti -> sensitiviti -> sensibiliti -> relate condition rational valence hesitance digitize conformable radical different vile analogous vietnamize predicate operate feudal decisive hopeful callous formal sensitive sensible 15 Porter Stemmer Step 5: Derivational Morphology II: More Multiple Suffixes (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) (m>0) ICATE ATIVE ALIZE ICITI ICAL FUL NESS -> -> -> -> -> -> -> IC triplicate formative formalize electriciti electrical hopeful goodness AL IC IC Porter Stemmer -> -> -> -> -> -> -> triplic form formal electric electric hope good *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC Step 6: Derivational Morphology III: Single Suffixes (m>1) AL -> (m>1) ANCE -> (m>1) ENCE -> (m>1) ER -> (m>1) IC -> (m>1) ABLE -> (m>1) IBLE -> (m>1) ANT -> (m>1) EMENT -> (m>1) MENT -> (m>1) ENT -> (m>1 and (*S or *T)) ION -> (m>1) OU -> (m>1) ISM -> (m>1) ATE -> (m>1) ITI -> (m>1) OUS -> (m>1) IVE -> (m>1) IZE -> revival allowance inference airliner gyroscopic adjustable defensible irritant replacement adjustment dependent adoption homologou communism activate angulariti homologous effective bowdlerize second C is not W, X or Y -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> reviv allow infer airlin gyroscop adjust defens irrit replac adjust depend adopt homolog commun activ angular homolog effect bowdler 16 Porter Stemmer *<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 7a: Cleanup (m>1) E Æ probate Æ probat rate Æ rate cease Æ ceas (m=1 and not *o) E Æ Step 7b: More Cleanup controll Æ control roll Æ roll (m > 1 and *d and *L) Æ single letter Porter Stemmer Errors of Omission – – – – – European analysis matrices noise explain Europe analyzes matrix noisy explanation Errors of Commission – – – – – organization doing generalization numerical university organ doe generic numerous universe From Krovetz ‘93 17 Back to last week’s digression: How would FST composition be applied to stemming? FST_3: (*v*) Y Æ I 0 V:V Y:I 1 2 FST_5: NESS Æ NESS:0 0 1 FST_3 o FST_5 0 V:V 1 Y:I 2 NESS:0 3 Word Classes and Part-of-Speech Tagging Definition and Example Motivation Word Classes Rule-based Tagging Stochastic Tagging Transformation-Based Tagging Covered in a later lecture Tagging Unknown Words 18 Definition: POS Tagging “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) WORDS TAGS the girl kissed the boy on the cheek N V P DET An Example WORD the girl kissed the boy on the cheek LEMMA the girl kiss the boy on the cheek TAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN From: http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html 19 Motivation Speech synthesis — pronunciation Speech recognition — class-based N-grams Information retrieval — stemming, selection high-content words Word-sense disambiguation Corpus analysis of language & lexicography Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … POS based on morphology and syntax Open vs. Closed classes – Open: • Nouns, Verbs, Adjectives, Adverbs. – Closed: • determiners: a, an, the • pronouns: she, he, I • prepositions: on, under, over, near, by, … 20 Open Class Words Every known human language has nouns and verbs Nouns: people, places, things – Classes of nouns • proper vs. common • count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! – Unfortunately, John walked home extremely slowly yesterday Closed Class Words Idiosyncratic Examples: – – – – – – – prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … 21 Prepositions from CELEX English Single-Word Particles 22 Pronouns in CELEX Conjunctions 23 Auxiliaries Word Classes: Tag Sets Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and purpose – Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles – Simple morphology = more ambiguity = fewer tags 24 Word Classes: Tag set example Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? 25 The Problem Words often have more than one word class: this – This is a nice day = PRP – This day is nice = DT – You can go this far = RB Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 38,857 Ambiguous (2-7 tags): 8,844 2 tags 3 tags 4 tags 5 tags 6 tags 7 tags 6,731 1621 357 90 32 6 (Derose, 1988) 26 Part-of-Speech Tagging Rule-Based Tagger: ENGTWOL Stochastic Tagger: HMM-based Transformation-Based Tagger: Brill Covered in a later lecture Rule-Based Tagging: ENGTWOL Basic Idea: – Assign all possible tags to words – Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. – Typically more than 1000 hand-written rules, but may be machine-learned. 27 Sample ENGTWOL Lexicon Stage 1 of ENGTWOL Tagging First Stage: Run words through Kimmo-style morphological analyzer to get all parts of speech. Example: Pavlov had shown that salivation … Pavlov had PAVLOV N NOM SG PROPER HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV PRON DEM SG DET CENTRAL DEM SG CS salivation N NOM SG 28 Stage 2 of ENGTWOL Tagging Second Stage: Apply constraints. Constraints used in negative way. Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV Stochastic Tagging Based on probability of certain tag occurring given various possibilities Necessitates a training corpus No probabilities for words not in corpus. Training corpus may be too different from test corpus. 29 Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! – – – – Result: 90% accuracy Why? Baseline: Others will do better HMM is an example For next time … New Chapter 5 (particularly 5.5) and Chapter 6: HMMs 30