...

CMSC 723 / LING 723: Computational Linguistics I

by user

on
Category: Documents
11

views

Report

Comments

Transcript

CMSC 723 / LING 723: Computational Linguistics I
CMSC 723 / LING 723:
Computational Linguistics I
September 26, 2007: Dorr
Computational Morphology (Chapter 3),
Word Classes, POS Tagging (Chapter 5)
Prof. Bonnie J. Dorr
Co-Instructor: Nitin Madnani
TA: Hamid Shahri
Computational Morphology
‘Finite State Morphology
– Finite State Transducers (FST)
‘Input/Output
‘Analysis/Generation
1
Computational Morphology
WORD
‘ cats
‘ cat
‘ cities
‘ geese
‘ ducks
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
‘ merging merge +V +PRES-PART
‘ caught
(catch +V +PAST-PART) or
(catch +V +PAST)
Building a Morphological
Parser
‘ The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
‘ Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
2
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …?
• Analysis/Generation is easy
• Very large for English
• What about Arabic or Turkish?
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
acclamation
acclimate
acclimate
acclimate
acclimate
acclimate
$N$
$N+s$
$V+0$
$V+ed$
$V+en$
$V+s$
$V+ing$
Building a Morphological
Parser
‘ The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
‘ Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
3
Lexicon and Rules:
FSA Inflectional Noun Morphology
• English Noun Lexicon
reg-noun
Irreg-pl-noun
Irreg-sg-noun
plural
fox
cat
dog
geese
sheep
mice
goose
sheep
mouse
-s
• English Noun
Rule
Lexicon and Rules: FSA English Verb
Inflectional Morphology
reg-verb-stem
irreg-verb-stem
irreg-past-verb past
past-part
pres-part
3sg
walk
fry
talk
impeach
cut
speak
spoken
sing
sang
caught
ate
eaten
-ed
-ing
-s
-ed
4
FSA for Derivational Morphology:
Adjectival Formation
More Complex
Derivational Morphology
5
Using FSAs for Recognition:
English Nouns and their Inflection
Morphological Parsing
‘ Finite-state automata (FSA)
–
–
Recognizer
One-level morphology
‘ Finite-state transducers (FST)
–
–
Two-level morphology
input-output pair
6
Terminology for Kimmo-style
Morphology
‘ Upper = lexical tape
‘ Lower = surface tape
‘ Characters correspond to pairs, written a:b
‘ If “a:a”, write “a” for shorthand
‘ Two-level lexical entries
‘ # = word boundary
‘ ^ = morpheme boundary
‘ Other = “any feasible pair that is not in this
transducer”
‘ Final states indicated with “:” and non-final states
indicated with “.”
Four-Fold View of FSTs
‘As a recognizer
‘As a generator
‘As a translator
‘As a set relater
7
Nominal Inflection FST
Lexical and Intermediate Tapes
8
Spelling Rules
Name
Rule Description
Example
Consonant Doubling
1-letter consonant doubled before -ing/-ed
beg/begging
E-deletion
Silent e dropped before -ing and -ed
make/making
E-insertion
e added after s,z,x,ch,sh before s
watch/watches
Y-replacement
-y changes to -ie before -s, -i before -ed
try/tries
K-insertion
verbs ending with vowel + -c add -k
panic/panicked
Chomsky and Halle Notation
ε→e/
x
s
z
^ __ s #
9
Intermediate-to-Surface
Transducer
State Transition Table
How would we represent this in NLTK?
– from nltk_contrib.fst import fst
– f = fst.FST('pluralize')
– f.add_state('q0')
– f.initial_state = 'q0'
– f.set_final('q0')
– f.add_state('q1')
– f.add_arc('q0', 'q1', 'z', 'z')
10
Two-Level Morphology
Sample Run
11
FSTs and ambiguity
‘ Parse Example 1: unionizable
– union +ize +able
– un+ ion +ize +able
‘ Parse Example 2: assess
– assessV
– assN +essN
‘ Parse Example 3: tender
– tenderAJ
– tenNum+dAJ+erCMP
What to do about
Global Ambiguity?
‘Accept first successful structure
‘Run parser through all possible paths
‘Bias the search in some manner
12
Computational Morphology
‘ The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
‘ Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
Lexicon-Free Morphology:
Porter Stemmer
‘ Lexicon-Free FST Approach
‘ By Martin Porter (1980)
http://www.tartarus.org/%7Emartin/PorterStemmer/
‘ Here is one you can try online:
http://www.utilitymill.com/utility/Porter_Stemming_Al
gorithm
‘ Cascade of substitutions given specific conditions
GENERALIZATIONS
GENERALIZATION
GENERALIZE
GENERAL
GENER
13
Porter Stemmer
‘ Definitions
– C = string of one or more consonants, where a consonant is anything other
than A E I O U or (Y preceded by C)
– V = string of one or more vowels
– M = Measure, roughly with number of syllables
– Words = (C)*(V*C*)M(V)*
•
•
•
M=0
M=1
M=2
TR, EE, TREE, Y, BY
TROUBLE, OATS, TREES, IVY
TROUBLES, PRIVATE, OATEN, ORRERY
‘ Conditions
–
–
–
–
*S - stem ends with S
*v* - stem contains a V
*d - stem ends with double C, e.g., -TT, -SS
*o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOP
*<S> = ends with <S>
*v* = contains a V
Porter Stemmer
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 1: Plural Nouns and Third Person Singular Verbs
SSES Æ SS
IES Æ I
caresses
ponies
ties
caress
cats
SS Æ SS
S Æ
Æ
Æ
Æ
Æ
Æ
caress
poni
ti
caress
cat
Step 2a: Verbal Past Tense and Progressive Forms
(M>0) EED Æ EE
Æ
i
(*v*) ED
ii
(*v*) ING Æ
feed Æ feed, agreed Æ agree
plastered Æ
motoring
plaster, bled Æ bled
Æ motor, sing Æ sing
Step 2b: If 2a.i or 2a.ii is successful, Cleanup
AT Æ ATE
conflat(ed) Æ conflate
BL Æ BLE
troubl(ed) Æ trouble
IZ Æ IZE
(*d and not (*L or *S or *Z))
Æ single letter
(M=1 and *o) Æ E
siz(ed) Æ size
hopp(ing) Æ
hop, tann(ed) Æ tan
hiss(ing) Æ hiss, fizz(ed) Æ fizz
fail(ing) Æ fail, fil(ing) Æ file
14
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 3: Y Æ I
happy Æ happi
sky Æ sky
(*v*) Y Æ I
Porter Stemmer
Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL ->
(m>0) TIONAL ->
ATE
TION
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ENCE
ANCE
IZE
ABLE
AL
ENT
E
OUS
IZE
ATE
ATE
AL
IVE
FUL
OUS
AL
IVE
BLE
ENCI
ANCI
IZER
ABLI
ALLI
ENTLI
ELI
OUSLI
IZATION
ATION
ATOR
ALISM
IVENESS
FULNESS
OUSNESS
ALITI
IVITI
BILITI
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
relational
->
conditional
->
rational
->
valenci
->
hesitanci
->
digitizer
->
conformabli
->
radicalli
->
differentli
->
vileli
- >
analogousli
->
vietnamization ->
predication
->
operator
->
feudalism
->
decisiveness
->
hopefulness
->
callousness
->
formaliti
->
sensitiviti
->
sensibiliti
->
relate
condition
rational
valence
hesitance
digitize
conformable
radical
different
vile
analogous
vietnamize
predicate
operate
feudal
decisive
hopeful
callous
formal
sensitive
sensible
15
Porter Stemmer
Step 5: Derivational Morphology II: More Multiple Suffixes
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ICATE
ATIVE
ALIZE
ICITI
ICAL
FUL
NESS
->
->
->
->
->
->
->
IC
triplicate
formative
formalize
electriciti
electrical
hopeful
goodness
AL
IC
IC
Porter Stemmer
->
->
->
->
->
->
->
triplic
form
formal
electric
electric
hope
good
*<S> = ends with <S>
*v* = contains a V
*d = ends with double C
*o = ends with CVC
Step 6: Derivational Morphology III: Single Suffixes
(m>1) AL
->
(m>1) ANCE ->
(m>1) ENCE ->
(m>1) ER
->
(m>1) IC
->
(m>1) ABLE ->
(m>1) IBLE ->
(m>1) ANT
->
(m>1) EMENT ->
(m>1) MENT ->
(m>1) ENT
->
(m>1 and (*S or *T)) ION ->
(m>1) OU
->
(m>1) ISM
->
(m>1) ATE
->
(m>1) ITI
->
(m>1) OUS
->
(m>1) IVE
->
(m>1) IZE
->
revival
allowance
inference
airliner
gyroscopic
adjustable
defensible
irritant
replacement
adjustment
dependent
adoption
homologou
communism
activate
angulariti
homologous
effective
bowdlerize
second C is not W, X or Y
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
reviv
allow
infer
airlin
gyroscop
adjust
defens
irrit
replac
adjust
depend
adopt
homolog
commun
activ
angular
homolog
effect
bowdler
16
Porter Stemmer
*<S> = ends with <S>
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 7a: Cleanup
(m>1) E Æ
probate Æ probat
rate
Æ rate
cease Æ ceas
(m=1 and not *o) E Æ
Step 7b: More Cleanup
controll Æ control
roll Æ roll
(m > 1 and *d and *L)
Æ single letter
Porter Stemmer
‘ Errors of Omission
–
–
–
–
–
European
analysis
matrices
noise
explain
Europe
analyzes
matrix
noisy
explanation
‘ Errors of Commission
–
–
–
–
–
organization
doing
generalization
numerical
university
organ
doe
generic
numerous
universe
From Krovetz ‘93
17
Back to last week’s digression: How
would FST composition be applied to
stemming?
FST_3: (*v*) Y Æ I
0
V:V
Y:I
1
2
FST_5: NESS Æ
NESS:0
0
1
FST_3 o FST_5
0
V:V
1
Y:I
2
NESS:0
3
Word Classes and
Part-of-Speech Tagging
‘ Definition and Example
‘ Motivation
‘ Word Classes
‘ Rule-based Tagging
‘ Stochastic Tagging
‘ Transformation-Based Tagging Covered in a
later lecture
‘ Tagging Unknown Words
18
Definition: POS Tagging
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
TAGS
the
girl
kissed
the
boy
on
the
cheek
N
V
P
DET
An Example
WORD
the
girl
kissed
the
boy
on
the
cheek
LEMMA
the
girl
kiss
the
boy
on
the
cheek
TAG
+DET
+NOUN
+VPAST
+DET
+NOUN
+PREP
+DET
+NOUN
From: http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html
19
Motivation
‘Speech synthesis — pronunciation
‘Speech recognition — class-based N-grams
‘Information retrieval — stemming,
selection high-content words
‘Word-sense disambiguation
‘Corpus analysis of language & lexicography
Word Classes
‘ Basic word classes: Noun, Verb, Adjective,
Adverb, Preposition, …
‘ POS based on morphology and syntax
‘ Open vs. Closed classes
– Open:
• Nouns, Verbs, Adjectives, Adverbs.
– Closed:
• determiners: a, an, the
• pronouns: she, he, I
• prepositions: on, under, over, near, by, …
20
Open Class Words
‘ Every known human language has nouns and
verbs
‘ Nouns: people, places, things
– Classes of nouns
• proper vs. common
• count vs. mass
‘ Verbs: actions and processes
‘ Adjectives: properties, qualities
‘ Adverbs: hodgepodge!
– Unfortunately, John walked home extremely
slowly yesterday
Closed Class Words
‘ Idiosyncratic
‘ Examples:
–
–
–
–
–
–
–
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
21
Prepositions from CELEX
English Single-Word Particles
22
Pronouns in CELEX
Conjunctions
23
Auxiliaries
Word Classes: Tag Sets
‘Vary in number of tags: a dozen to over 200
‘Size of tag sets depends on language,
objectives and purpose
– Some tagging approaches (e.g., constraint
grammar based) make fewer distinctions e.g.,
conflating prepositions, conjunctions, particles
– Simple morphology = more ambiguity = fewer
tags
24
Word Classes: Tag set example
Example of Penn Treebank Tagging
of Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
25
The Problem
‘Words often have more than one word
class: this
– This is a nice day = PRP
– This day is nice = DT
– You can go this far = RB
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag): 38,857
Ambiguous (2-7 tags): 8,844
2 tags
3 tags
4 tags
5 tags
6 tags
7 tags
6,731
1621
357
90
32
6
(Derose, 1988)
26
Part-of-Speech Tagging
‘Rule-Based Tagger: ENGTWOL
‘Stochastic Tagger: HMM-based
‘Transformation-Based Tagger: Brill
Covered in a
later lecture
Rule-Based Tagging: ENGTWOL
‘Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of type:
if word+1 is an adj, adv, or quantifier and the
following is a sentence boundary and word-1
is not a verb like “consider” then eliminate
non-adv else eliminate adv.
– Typically more than 1000 hand-written rules,
but may be machine-learned.
27
Sample ENGTWOL Lexicon
Stage 1 of ENGTWOL Tagging
‘ First Stage: Run words through Kimmo-style
morphological analyzer to get all parts of speech.
‘ Example: Pavlov had shown that salivation …
Pavlov
had
PAVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown
SHOW PCP2 SVOO SVO SV
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
28
Stage 2 of ENGTWOL Tagging
‘ Second Stage: Apply constraints.
‘ Constraints used in negative way.
‘ Example: Adverbial “that” rule
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV
Stochastic Tagging
‘Based on probability of certain tag
occurring given various possibilities
‘Necessitates a training corpus
‘No probabilities for words not in corpus.
‘Training corpus may be too different from
test corpus.
29
Stochastic Tagging (cont.)
Simple Method: Choose most frequent tag in
training text for each word!
–
–
–
–
Result: 90% accuracy
Why?
Baseline: Others will do better
HMM is an example
For next time …
‘New Chapter 5 (particularly 5.5) and
Chapter 6: HMMs
30
Fly UP