...

A Comparative Evaluation of Deep and Shallow Approaches to Common Grammatical Errors

by user

on
Category: Documents
49

views

Report

Comments

Transcript

A Comparative Evaluation of Deep and Shallow Approaches to Common Grammatical Errors
A Comparative Evaluation of
Deep and Shallow Approaches to
the Automatic Detection of
Common Grammatical Errors
Joachim Wagner, Jennifer Foster, and
Josef van Genabith
2007-07-26
National Centre for Language Technology
School of Computing, Dublin City University
1
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
2
Why Judge the Grammaticality?
• Grammar checking
• Computer-assisted language learning
– Feedback
– Writing aid
– Automatic essay grading
• Re-rank computer-generated output
– Machine translation
3
Why this Evaluation?
• No agreed standard
• Differences in
–
–
–
–
What is evaluated
Corpora
Error density
Error types
4
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
5
Deep Approaches
• Precision grammar
• Aim to distinguish grammatical sentences
from ungrammatical sentences
• Grammar engineers
– Increase coverage
– Avoid overgeneration
• For English:
– ParGram / XLE (LFG)
– English Resource Grammar / LKB (HPSG)
– RASP (GPSG to DCG influenced by ANLT)
6
Shallow Approaches
• Real-word spelling errors
– vs grammar errors in general
• Part-of-speech (POS) n-grams
–
–
–
–
–
Raw frequency
Machine learning-based classifier
Features of local context
Noisy channel model
N-gram similarity, POS tag set
7
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
8
Artificial Error Corpus
Real Error Corpus
(Small)
Chosen Error Types
Error Analysis
Automatic Error
Creation Modules
Common
Grammatical Error
Applied to BNC
(Big)
9
Common Grammatical Errors
• 20,000 word corpus
• Ungrammatical English sentences
– Newspapers, academic papers, emails, …
• Correction operators
–
–
–
–
Substitute (48 %)
Insert (24 %)
Delete (17 %)
Combination (11 %)
10
Common Grammatical Errors
• 20,000 word corpus
• Ungrammatical English sentences
– Newspapers, academic papers, emails, …
• Correction operators
–
–
–
–
Substitute (48 %)
Insert (24 %)
Delete (17 %)
Combination (11 %)
Agreement errors
Real-word spelling errors
11
Chosen Error Types
Agreement: She steered Melissa around a corners.
Real-word: She could no comprehend.
Extra word: Was that in the summer in?
Missing word: What the subject?
12
Automatic Error Creation
Agreement: replace determiner, noun or verb
Real-word: replace according to pre-compiled list
Extra word: duplicate token or part-of-speech,
or insert a random token
Missing word: delete token (likelihood based on
part-of-speech)
13
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
14
BNC Test Data (1)
BNC: 6.4 M sentences
4.2 M sentences (no speech,
poems, captions and list items)
Randomisation
1
2
3
4
5
…
10
10 sets with 420 K
sentences each
15
BNC Test Data (2)
Error corpus
1
2
3
4
5
…
10
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
…
10
Error creation
Agreement
Real-word
…
…
10
10
Extra word
Missing word
…
10
16
BNC Test Data (3)
Mixed error type
¼ each
1
1
1
1
…
1
…
…
¼ each
10
10
10
10
10
17
BNC Test Data (4)
5 error types:
agreement, real-word, extra word, missing word, mixed errors
1
50 sets
1
…
10
1
1
1
…
10
10
10
1
1
…
10
10
1
…
10
10
1
1
…
10
10
Each 50:50 ungrammatical:grammatical
18
BNC Test Data (5)
Test
data
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
…
10
…
10
10
10
…
10
10
…
10
10
…
10
10
Example:
1st crossvalidation run
for agreement
errors
Training
data
(if required
by method)
19
Evaluation Measures
• Accuracy on ungrammatical data
acc_ungram =
# correctly flagged as ungrammatical
# ungrammatical sentences
• Accuracy on grammatical data
acc_gram =
# correctly classified as grammatical
# grammatical sentences
• Independent of error density of test data
20
Accuracy Graph
21
Region of Improvement
22
Region of Degradation
23
Undecided
24
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
25
Overview of Methods
XLE Output
POS n-gram
information
M1
M2
Basic methods
M3
M4
Decision tree methods
M5
26
M1
Method 1: Precision Grammar
• XLE English LFG
• Fragment rule
– Parses ungrammatical input
– Marked with *
• Zero number of parses
• Parser exceptions (time-out, memory)
27
M1
XLE Parsing
First 60 K
sentences
1
…
10
1
…
10
1
…
10
1
…
10
1
…
10
XLE
50 x 60 K = 3 M parse results
28
M2
Method 2: POS N-grams
• Flag rare POS n-grams as errors
• Rare: according to reference corpus
• Parameters: n and frequency threshold
– Tested n = 2, …, 7 on held-out data
– Best: n = 5 and frequency threshold 4
29
M2
POS N-gram Information
9 sets
1
Reference
n-gram table
Repeated for
n = 2, 3, …, 7
…
10
1
…
10
1
…
10
1
…
10
1
…
10
Rarest n-gram
3 M frequency values
30
M3
Method 3: Decision Trees on
XLE Output
• Output statistics
– Starredness (0 or 1) and parser exceptions
(-1 = time-out, -2 = exceeded memory, …)
– Number of optimal parses
– Number of unoptimal parses
– Duration of parsing
– Number of subtrees
– Number of words
31
M3
Decision Tree Example
Star?
>= 0
<0
Star?
U
<1
>= 1
Optimal?
<5
U = ungrammatical
G = grammatical
U
>= 5
U
G
32
M4
Method 4: Decision Trees on Ngrams
• Frequency of rarest n-gram in sentence
• N = 2, …, 7
– feature vector: 6 numbers
33
M4
Decision Tree Example
5-gram?
>= 4
<4
7-gram?
U
<1
>= 1
5-gram?
<45
U
>= 45
G
G
34
M5
Method 5: Decision Trees on
Combined Feature Sets
Star?
>= 0
<0
Star?
U
<1
>= 1
5-gram?
<4
U
>= 4
U
G
35
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
36
XLE Parsing of the BNC
• 600,000 grammatical sentences
• 2.4 M ungrammatical sentences
• Parse-testfile command
–
–
–
–
Parse-literally 1
Max xle scratch storage 1,000 MB
Time-out 60 seconds
No skimming
37
Efficiency
Time-out
10,000 BNC
sentences
(grammatical)
38
XLE Parse Results and Method 1
Gramm.
Agree.
Real-w. Extra-w. Missing
Covered
67.1%
35.4%
42.7%
40.3%
52.2%
Fragments
29.7%
58.3%
53.8%
56.4%
44.6%
No parse
0.3%
0.4%
0.3%
0.3%
0.3%
Time-out
0.6%
1.1%
0.7%
0.6%
0.6%
Out-of-memory
2.3%
4.8%
2.6%
2.4%
2.2%
2
3
4
3
2
67.1%
64.6%
57.3%
59.7%
47.8%
Crash (absolute)
Accuracy M1
39
XLE Coverage
5 x 600 K
Test data
40
Applying Decision Tree to XLE
M1
M3
41
Overall Accuracy for M1 and M3
Overall accuracy
0.75
0.70
0.65
M1
M3
0.60
0.55
0.50
0%
20%
40%
60%
80%
Error density of test data
100%
42
Varying Training Error Density
20%
25%
M3 33%
M3 40%
M3 50%
M1
M3 60%
M3 67%
M3 75%
43
Varying Training Error Density
20%
25%
M3 33%
M3 40%
M3 50%
M1
M3 60%
M3 67%
M3 75%
44
Varying Training Error Density
M3 40%
M3 43%
M1
M1: XLE
M3: with
decision
tree
M3 50%
45
Varying Training Error Density
M3 40%
M3 43%
M1
M1: XLE
M3: with
decision
tree
M3 50%
46
N-Grams and DT (M2 vs M4)
M2: Ngram
M4: DT
M4 25%
M4 50%
M2
M4 67%
M4 75%
47
Methods 1 to 4
M1: XLE
M2: Ngram
M3/4: DT
M3 43%
M4 50%
M1
M2
M3 50%
48
Combined Method (M5)
10%, 20%
25%
50%
67%
75%
80%
90%
49
All Methods
M1: XLE
M2: Ngram
M3/4: DT
M5: comb
M5 45%
M3 43%
M1
M4
M5 50%
M2
M3 50%
50
Breakdown by Error Type
m.w.
r.w.
e.w.
ag.
M5 45%
m.w.
r.w. e.w.
M1
ag.
M5 50%
m.w.
r.w.
e.w.
ag.
51
Breakdown by Error Type
m.w.
r.w.
e.w.
ag.
M5 45%
e.w. r.w.
m.w.
r.w. e.w.
M3 43%
M1
ag.
M5 50%
M3 50%
m.w.
e.w. r.w.
ag.
52
Talk Outline
•
•
•
•
•
•
•
Motivation
Background
Artificial Error Corpus
Evaluation Procedure
Error Detection Methods
Results and Analysis
Conclusion and Future Work
53
Conclusions
• Basic methods surprisingly close to
each other
• Decision tree
– Effective with deep approach
– Small and noisy improvement with shallow
approach
• Combined approach best on all but one
error type
54
Future Work
• Error types:
– Word order
– Multiple errors per sentence
•
•
•
•
Add more features
Other languages
Test on MT output
Establish upper bound
55
References
• E. Atwell: How to detect grammatical errors in a text without
parsing it. In Proceedings of the 3rd EACL, pp 38-45, 1987
• M. Butt, H. Dyvik, T. H. King, H. Masuichi, and C. Rohrer: The
parallel grammar project. In Proceedings of COLING-2002
• J. Foster: Good Reasons for Noting Bad Grammar: Empirical
Investigations into the Parsing of Ungrammatical Written
English. Ph.D. thesis, University of Dublin, Trinity College, 2005
• J. Wagner, J. Foster and J. van Genabith: A Comparative
Evaluation of Deep and Shallow Approaches to the Automatic
Detection of Common Grammatical Errors. In Proceedings of
EMNLP-CoNLL 2007
• I. H. Witten and E. Frank: Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann Publishers, 2000
56
Thank You!
Djamé Seddah
(La Sorbonne University)
National Centre for Language Technology
School of Computing, Dublin City University
57
Why not use F-Score?
• Precision and F-Score
– Depend on error density of test data
– What are true positives?
– Weighting parameter of F-score
• Recall and 1-Fallout
– Accuracy on ungrammatical data
– Accuracy on grammatical data
58
Results: F-Score
0.8
0.7
0.6
0.5
XLE
ngram
XLE+DT
ngram+DT
combined
0.4
0.3
0.2
0.1
0.0
Agreement
Real-word
Extra word
Missing word
Mixed errors
59
F-Score (tp=correctly flagged)
1.0
F-Score
0.8
0.6
Baseline
M1
M3
0.4
0.2
0.0
0%
20%
40%
60%
80%
Error density of test data
100%
60
POS n-grams and Agreement
Errors
n = 2, 3, 4, 5
XLE parser
F-Score 65 %
Best Accuracy 55 %
Best F-Score 66 %
61
POS n-grams and ContextSensitive Spelling Errors
Best Accuracy 66 %
XLE 60 %
Best F-Score 69 %
n = 2, 3, 4, 5
62
POS n-grams and Extra Word
Errors
Best Accuracy 68 %
XLE 62 %
Best F-Score 70 %
n = 2, 3, 4, 5
63
POS n-grams and Missing Word
Errors
n = 2, 3, 4, 5
XLE 53 %
Best Accuracy 59 %
Best F-Score 67 %
64
Inverting Decisions
65
Why Judge Grammaticality? (2)
• Automatic essay grading
• Trigger deep error analysis
– Increase speed
– Reduce overflagging
• Most approaches easily extend to
– Locating errors
– Classifying errors
66
Grammar Checker Research
• Focus of grammar checker research
–
–
–
–
Locate errors
Categorise errors
Propose corrections
Other feedback (CALL)
• Approaches:
– Extend existing grammars
– Write new grammars
67
N-gram Methods
• Flag unlikely or rare sequences
– POS (different tagsets)
– Tokens
– Raw frequency vs. mutual information
• Most publications are in the area of
context-sensitive spelling correction
– Real word errors
– Resulting sentence can be grammatical
68
Test Corpus - Example
• Missing Word Error
She didn’t want to face him
She didn’t to face him
69
Test Corpus – Example 2
• Context-sensitive spelling error
I love them both
I love then both
70
Cross-validation
•
•
•
•
Standard deviation below 0.006
Except Method 4: 0.026
High number of test items
Report average percentage
71
Example
Run
Stdev
F-Score
1
0.654
2
0.655
3
0.655
4
0.655
5
0.653
6
0.652
7
0.653
8
0.657
9
0.654
10
0.653
Method 1 – Agreement errors:
65.4 % average F-Score
0.001
72
Fly UP