CMSC 723 / LING 723: Computational Linguistics I

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 2

views

Report

Comments

Description

Download CMSC 723 / LING 723: Computational Linguistics I

Transcript

CMSC 723 / LING 723: Computational Linguistics I

CMSC 723 / LING 723:
Computational Linguistics I
September 26, 2007: Dorr
Computational Morphology (Chapter 3),
Word Classes, POS Tagging (Chapter 5)
Prof. Bonnie J. Dorr
Co-Instructor: Nitin Madnani
TA: Hamid Shahri
Computational Morphology
Finite State Morphology
– Finite State Transducers (FST)
Input/Output
Analysis/Generation
1
Computational Morphology
WORD
cats
cat
cities
geese
ducks
STEM (+FEATURES)*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(duck +N +PL) or
(duck +V +3SG)
merging merge +V +PRES-PART
caught
(catch +V +PAST-PART) or
(catch +V +PAST)
Building a Morphological
Parser
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
2
Lexicon-only Morphology
• The lexicon lists all surface level and lexical level pairs
• No rules …?
• Analysis/Generation is easy
• Very large for English
• What about Arabic or Turkish?
• Chinese?
acclaim
acclaim
acclaimed
acclaimed
acclaiming
acclaims
acclaims
acclamation
acclamations
acclimate
acclimated
acclimated
acclimates
acclimating
acclaim $N$
acclaim $V+0$
acclaim $V+ed$
acclaim $V+en$
acclaim $V+ing$
acclaim $N+s$
acclaim $V+s$
acclamation
acclamation
acclimate
acclimate
acclimate
acclimate
acclimate
$N$
$N+s$
$V+0$
$V+ed$
$V+en$
$V+s$
$V+ing$
Building a Morphological
Parser
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
3
Lexicon and Rules:
FSA Inflectional Noun Morphology
• English Noun Lexicon
reg-noun
Irreg-pl-noun
Irreg-sg-noun
plural
fox
cat
dog
geese
sheep
mice
goose
sheep
mouse
-s
• English Noun
Rule
Lexicon and Rules: FSA English Verb
Inflectional Morphology
reg-verb-stem
irreg-verb-stem
irreg-past-verb past
past-part
pres-part
3sg
walk
fry
talk
impeach
cut
speak
spoken
sing
sang
caught
ate
eaten
-ed
-ing
-s
-ed
4
FSA for Derivational Morphology:
Adjectival Formation
More Complex
Derivational Morphology
5
Using FSAs for Recognition:
English Nouns and their Inflection
Morphological Parsing
Finite-state automata (FSA)
–
–
Recognizer
One-level morphology
Finite-state transducers (FST)
–
–
Two-level morphology
input-output pair
6
Terminology for Kimmo-style
Morphology
Upper = lexical tape
Lower = surface tape
Characters correspond to pairs, written a:b
If “a:a”, write “a” for shorthand
Two-level lexical entries
# = word boundary
^ = morpheme boundary
Other = “any feasible pair that is not in this
transducer”
Final states indicated with “:” and non-final states
indicated with “.”
Four-Fold View of FSTs
As a recognizer
As a generator
As a translator
As a set relater
7
Nominal Inflection FST
Lexical and Intermediate Tapes
8
Spelling Rules
Name
Rule Description
Example
Consonant Doubling
1-letter consonant doubled before -ing/-ed
beg/begging
E-deletion
Silent e dropped before -ing and -ed
make/making
E-insertion
e added after s,z,x,ch,sh before s
watch/watches
Y-replacement
-y changes to -ie before -s, -i before -ed
try/tries
K-insertion
verbs ending with vowel + -c add -k
panic/panicked
Chomsky and Halle Notation
ε→e/
x
s
z
^ __ s #
9
Intermediate-to-Surface
Transducer
State Transition Table
How would we represent this in NLTK?
– from nltk_contrib.fst import fst
– f = fst.FST('pluralize')
– f.add_state('q0')
– f.initial_state = 'q0'
– f.set_final('q0')
– f.add_state('q1')
– f.add_arc('q0', 'q1', 'z', 'z')
10
Two-Level Morphology
Sample Run
11
FSTs and ambiguity
Parse Example 1: unionizable
– union +ize +able
– un+ ion +ize +able
Parse Example 2: assess
– assessV
– assN +essN
Parse Example 3: tender
– tenderAJ
– tenNum+dAJ+erCMP
What to do about
Global Ambiguity?
Accept first successful structure
Run parser through all possible paths
Bias the search in some manner
12
Computational Morphology
The Rules and the Lexicon
–
–
–
–
General versus Specific
Regular versus Irregular
Accuracy, speed, space
The Morphology of a language
Approaches
– Lexicon only
– Lexicon and Rules
• Finite-state Automata
• Finite-state Transducers
– Rules only
Lexicon-Free Morphology:
Porter Stemmer
Lexicon-Free FST Approach
By Martin Porter (1980)
http://www.tartarus.org/%7Emartin/PorterStemmer/
Here is one you can try online:
http://www.utilitymill.com/utility/Porter_Stemming_Al
gorithm
Cascade of substitutions given specific conditions
GENERALIZATIONS
GENERALIZATION
GENERALIZE
GENERAL
GENER
13
Porter Stemmer
Definitions
– C = string of one or more consonants, where a consonant is anything other
than A E I O U or (Y preceded by C)
– V = string of one or more vowels
– M = Measure, roughly with number of syllables
– Words = (C)*(V*C*)M(V)*
•
•
•
M=0
M=1
M=2
TR, EE, TREE, Y, BY
TROUBLE, OATS, TREES, IVY
TROUBLES, PRIVATE, OATEN, ORRERY
Conditions
–
–
–
–
*S - stem ends with S
*v* - stem contains a V
*d - stem ends with double C, e.g., -TT, -SS
*o - stem ends CVC, where second C is not W, X or Y, e.g., -WIL, HOP
*<S> = ends with <S>
*v* = contains a V
Porter Stemmer
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 1: Plural Nouns and Third Person Singular Verbs
SSES Æ SS
IES Æ I
caresses
ponies
ties
caress
cats
SS Æ SS
S Æ
Æ
Æ
Æ
Æ
Æ
caress
poni
ti
caress
cat
Step 2a: Verbal Past Tense and Progressive Forms
(M>0) EED Æ EE
Æ
i
(*v*) ED
ii
(*v*) ING Æ
feed Æ feed, agreed Æ agree
plastered Æ
motoring
plaster, bled Æ bled
Æ motor, sing Æ sing
Step 2b: If 2a.i or 2a.ii is successful, Cleanup
AT Æ ATE
conflat(ed) Æ conflate
BL Æ BLE
troubl(ed) Æ trouble
IZ Æ IZE
(*d and not (*L or *S or *Z))
Æ single letter
(M=1 and *o) Æ E
siz(ed) Æ size
hopp(ing) Æ
hop, tann(ed) Æ tan
hiss(ing) Æ hiss, fizz(ed) Æ fizz
fail(ing) Æ fail, fil(ing) Æ file
14
*<S> = ends with <S>
Porter Stemmer
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 3: Y Æ I
happy Æ happi
sky Æ sky
(*v*) Y Æ I
Porter Stemmer
Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL ->
(m>0) TIONAL ->
ATE
TION
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ENCE
ANCE
IZE
ABLE
AL
ENT
E
OUS
IZE
ATE
ATE
AL
IVE
FUL
OUS
AL
IVE
BLE
ENCI
ANCI
IZER
ABLI
ALLI
ENTLI
ELI
OUSLI
IZATION
ATION
ATOR
ALISM
IVENESS
FULNESS
OUSNESS
ALITI
IVITI
BILITI
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
relational
->
conditional
->
rational
->
valenci
->
hesitanci
->
digitizer
->
conformabli
->
radicalli
->
differentli
->
vileli
- >
analogousli
->
vietnamization ->
predication
->
operator
->
feudalism
->
decisiveness
->
hopefulness
->
callousness
->
formaliti
->
sensitiviti
->
sensibiliti
->
relate
condition
rational
valence
hesitance
digitize
conformable
radical
different
vile
analogous
vietnamize
predicate
operate
feudal
decisive
hopeful
callous
formal
sensitive
sensible
15
Porter Stemmer
Step 5: Derivational Morphology II: More Multiple Suffixes
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
(m>0)
ICATE
ATIVE
ALIZE
ICITI
ICAL
FUL
NESS
->
->
->
->
->
->
->
IC
triplicate
formative
formalize
electriciti
electrical
hopeful
goodness
AL
IC
IC
Porter Stemmer
->
->
->
->
->
->
->
triplic
form
formal
electric
electric
hope
good
*<S> = ends with <S>
*v* = contains a V
*d = ends with double C
*o = ends with CVC
Step 6: Derivational Morphology III: Single Suffixes
(m>1) AL
->
(m>1) ANCE ->
(m>1) ENCE ->
(m>1) ER
->
(m>1) IC
->
(m>1) ABLE ->
(m>1) IBLE ->
(m>1) ANT
->
(m>1) EMENT ->
(m>1) MENT ->
(m>1) ENT
->
(m>1 and (*S or *T)) ION ->
(m>1) OU
->
(m>1) ISM
->
(m>1) ATE
->
(m>1) ITI
->
(m>1) OUS
->
(m>1) IVE
->
(m>1) IZE
->
revival
allowance
inference
airliner
gyroscopic
adjustable
defensible
irritant
replacement
adjustment
dependent
adoption
homologou
communism
activate
angulariti
homologous
effective
bowdlerize
second C is not W, X or Y
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
->
reviv
allow
infer
airlin
gyroscop
adjust
defens
irrit
replac
adjust
depend
adopt
homolog
commun
activ
angular
homolog
effect
bowdler
16
Porter Stemmer
*<S> = ends with <S>
*v* = contains a V
*d = ends with double C
*o = ends with CVC
second C is not W, X or Y
Step 7a: Cleanup
(m>1) E Æ
probate Æ probat
rate
Æ rate
cease Æ ceas
(m=1 and not *o) E Æ
Step 7b: More Cleanup
controll Æ control
roll Æ roll
(m > 1 and *d and *L)
Æ single letter
Porter Stemmer
Errors of Omission
–
–
–
–
–
European
analysis
matrices
noise
explain
Europe
analyzes
matrix
noisy
explanation
Errors of Commission
–
–
–
–
–
organization
doing
generalization
numerical
university
organ
doe
generic
numerous
universe
From Krovetz ‘93
17
Back to last week’s digression: How
would FST composition be applied to
stemming?
FST_3: (*v*) Y Æ I
0
V:V
Y:I
1
2
FST_5: NESS Æ
NESS:0
0
1
FST_3 o FST_5
0
V:V
1
Y:I
2
NESS:0
3
Word Classes and
Part-of-Speech Tagging
Definition and Example
Motivation
Word Classes
Rule-based Tagging
Stochastic Tagging
Transformation-Based Tagging Covered in a
later lecture
Tagging Unknown Words
18
Definition: POS Tagging
“The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus” (Jurafsky and Martin)
WORDS
TAGS
the
girl
kissed
the
boy
on
the
cheek
N
V
P
DET
An Example
WORD
the
girl
kissed
the
boy
on
the
cheek
LEMMA
the
girl
kiss
the
boy
on
the
cheek
TAG
+DET
+NOUN
+VPAST
+DET
+NOUN
+PREP
+DET
+NOUN
From: http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html
19
Motivation
Speech synthesis — pronunciation
Speech recognition — class-based N-grams
Information retrieval — stemming,
selection high-content words
Word-sense disambiguation
Corpus analysis of language & lexicography
Word Classes
Basic word classes: Noun, Verb, Adjective,
Adverb, Preposition, …
POS based on morphology and syntax
Open vs. Closed classes
– Open:
• Nouns, Verbs, Adjectives, Adverbs.
– Closed:
• determiners: a, an, the
• pronouns: she, he, I
• prepositions: on, under, over, near, by, …
20
Open Class Words
Every known human language has nouns and
verbs
Nouns: people, places, things
– Classes of nouns
• proper vs. common
• count vs. mass
Verbs: actions and processes
Adjectives: properties, qualities
Adverbs: hodgepodge!
– Unfortunately, John walked home extremely
slowly yesterday
Closed Class Words
Idiosyncratic
Examples:
–
–
–
–
–
–
–
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
21
Prepositions from CELEX
English Single-Word Particles
22
Pronouns in CELEX
Conjunctions
23
Auxiliaries
Word Classes: Tag Sets
Vary in number of tags: a dozen to over 200
Size of tag sets depends on language,
objectives and purpose
– Some tagging approaches (e.g., constraint
grammar based) make fewer distinctions e.g.,
conflating prepositions, conjunctions, particles
– Simple morphology = more ambiguity = fewer
tags
24
Word Classes: Tag set example
Example of Penn Treebank Tagging
of Brown Corpus Sentence
The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN ?
Does that flight serve dinner ?
25
The Problem
Words often have more than one word
class: this
– This is a nice day = PRP
– This day is nice = DT
– You can go this far = RB
Word Class Ambiguity
(in the Brown Corpus)
Unambiguous (1 tag): 38,857
Ambiguous (2-7 tags): 8,844
2 tags
3 tags
4 tags
5 tags
6 tags
7 tags
6,731
1621
357
90
32
6
(Derose, 1988)
26
Part-of-Speech Tagging
Rule-Based Tagger: ENGTWOL
Stochastic Tagger: HMM-based
Transformation-Based Tagger: Brill
Covered in a
later lecture
Rule-Based Tagging: ENGTWOL
Basic Idea:
– Assign all possible tags to words
– Remove tags according to set of rules of type:
if word+1 is an adj, adv, or quantifier and the
following is a sentence boundary and word-1
is not a verb like “consider” then eliminate
non-adv else eliminate adv.
– Typically more than 1000 hand-written rules,
but may be machine-learned.
27
Sample ENGTWOL Lexicon
Stage 1 of ENGTWOL Tagging
First Stage: Run words through Kimmo-style
morphological analyzer to get all parts of speech.
Example: Pavlov had shown that salivation …
Pavlov
had
PAVLOV N NOM SG PROPER
HAVE V PAST VFIN SVO
HAVE PCP2 SVO
shown
SHOW PCP2 SVOO SVO SV
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
salivation N NOM SG
28
Stage 2 of ENGTWOL Tagging
Second Stage: Apply constraints.
Constraints used in negative way.
Example: Adverbial “that” rule
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV
Stochastic Tagging
Based on probability of certain tag
occurring given various possibilities
Necessitates a training corpus
No probabilities for words not in corpus.
Training corpus may be too different from
test corpus.
29
Stochastic Tagging (cont.)
Simple Method: Choose most frequent tag in
training text for each word!
–
–
–
–
Result: 90% accuracy
Why?
Baseline: Others will do better
HMM is an example
For next time …
New Chapter 5 (particularly 5.5) and
Chapter 6: HMMs
30