A Comparative Evaluation of Deep and Shallow Approaches to Common Grammatical Errors
by user
Comments
Transcript
A Comparative Evaluation of Deep and Shallow Approaches to Common Grammatical Errors
A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and Josef van Genabith 2007-07-26 National Centre for Language Technology School of Computing, Dublin City University 1 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 2 Why Judge the Grammaticality? • Grammar checking • Computer-assisted language learning – Feedback – Writing aid – Automatic essay grading • Re-rank computer-generated output – Machine translation 3 Why this Evaluation? • No agreed standard • Differences in – – – – What is evaluated Corpora Error density Error types 4 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 5 Deep Approaches • Precision grammar • Aim to distinguish grammatical sentences from ungrammatical sentences • Grammar engineers – Increase coverage – Avoid overgeneration • For English: – ParGram / XLE (LFG) – English Resource Grammar / LKB (HPSG) – RASP (GPSG to DCG influenced by ANLT) 6 Shallow Approaches • Real-word spelling errors – vs grammar errors in general • Part-of-speech (POS) n-grams – – – – – Raw frequency Machine learning-based classifier Features of local context Noisy channel model N-gram similarity, POS tag set 7 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 8 Artificial Error Corpus Real Error Corpus (Small) Chosen Error Types Error Analysis Automatic Error Creation Modules Common Grammatical Error Applied to BNC (Big) 9 Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences – Newspapers, academic papers, emails, … • Correction operators – – – – Substitute (48 %) Insert (24 %) Delete (17 %) Combination (11 %) 10 Common Grammatical Errors • 20,000 word corpus • Ungrammatical English sentences – Newspapers, academic papers, emails, … • Correction operators – – – – Substitute (48 %) Insert (24 %) Delete (17 %) Combination (11 %) Agreement errors Real-word spelling errors 11 Chosen Error Types Agreement: She steered Melissa around a corners. Real-word: She could no comprehend. Extra word: Was that in the summer in? Missing word: What the subject? 12 Automatic Error Creation Agreement: replace determiner, noun or verb Real-word: replace according to pre-compiled list Extra word: duplicate token or part-of-speech, or insert a random token Missing word: delete token (likelihood based on part-of-speech) 13 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 14 BNC Test Data (1) BNC: 6.4 M sentences 4.2 M sentences (no speech, poems, captions and list items) Randomisation 1 2 3 4 5 … 10 10 sets with 420 K sentences each 15 BNC Test Data (2) Error corpus 1 2 3 4 5 … 10 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 … 10 Error creation Agreement Real-word … … 10 10 Extra word Missing word … 10 16 BNC Test Data (3) Mixed error type ¼ each 1 1 1 1 … 1 … … ¼ each 10 10 10 10 10 17 BNC Test Data (4) 5 error types: agreement, real-word, extra word, missing word, mixed errors 1 50 sets 1 … 10 1 1 1 … 10 10 10 1 1 … 10 10 1 … 10 10 1 1 … 10 10 Each 50:50 ungrammatical:grammatical 18 BNC Test Data (5) Test data 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 … 10 … 10 10 10 … 10 10 … 10 10 … 10 10 Example: 1st crossvalidation run for agreement errors Training data (if required by method) 19 Evaluation Measures • Accuracy on ungrammatical data acc_ungram = # correctly flagged as ungrammatical # ungrammatical sentences • Accuracy on grammatical data acc_gram = # correctly classified as grammatical # grammatical sentences • Independent of error density of test data 20 Accuracy Graph 21 Region of Improvement 22 Region of Degradation 23 Undecided 24 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 25 Overview of Methods XLE Output POS n-gram information M1 M2 Basic methods M3 M4 Decision tree methods M5 26 M1 Method 1: Precision Grammar • XLE English LFG • Fragment rule – Parses ungrammatical input – Marked with * • Zero number of parses • Parser exceptions (time-out, memory) 27 M1 XLE Parsing First 60 K sentences 1 … 10 1 … 10 1 … 10 1 … 10 1 … 10 XLE 50 x 60 K = 3 M parse results 28 M2 Method 2: POS N-grams • Flag rare POS n-grams as errors • Rare: according to reference corpus • Parameters: n and frequency threshold – Tested n = 2, …, 7 on held-out data – Best: n = 5 and frequency threshold 4 29 M2 POS N-gram Information 9 sets 1 Reference n-gram table Repeated for n = 2, 3, …, 7 … 10 1 … 10 1 … 10 1 … 10 1 … 10 Rarest n-gram 3 M frequency values 30 M3 Method 3: Decision Trees on XLE Output • Output statistics – Starredness (0 or 1) and parser exceptions (-1 = time-out, -2 = exceeded memory, …) – Number of optimal parses – Number of unoptimal parses – Duration of parsing – Number of subtrees – Number of words 31 M3 Decision Tree Example Star? >= 0 <0 Star? U <1 >= 1 Optimal? <5 U = ungrammatical G = grammatical U >= 5 U G 32 M4 Method 4: Decision Trees on Ngrams • Frequency of rarest n-gram in sentence • N = 2, …, 7 – feature vector: 6 numbers 33 M4 Decision Tree Example 5-gram? >= 4 <4 7-gram? U <1 >= 1 5-gram? <45 U >= 45 G G 34 M5 Method 5: Decision Trees on Combined Feature Sets Star? >= 0 <0 Star? U <1 >= 1 5-gram? <4 U >= 4 U G 35 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 36 XLE Parsing of the BNC • 600,000 grammatical sentences • 2.4 M ungrammatical sentences • Parse-testfile command – – – – Parse-literally 1 Max xle scratch storage 1,000 MB Time-out 60 seconds No skimming 37 Efficiency Time-out 10,000 BNC sentences (grammatical) 38 XLE Parse Results and Method 1 Gramm. Agree. Real-w. Extra-w. Missing Covered 67.1% 35.4% 42.7% 40.3% 52.2% Fragments 29.7% 58.3% 53.8% 56.4% 44.6% No parse 0.3% 0.4% 0.3% 0.3% 0.3% Time-out 0.6% 1.1% 0.7% 0.6% 0.6% Out-of-memory 2.3% 4.8% 2.6% 2.4% 2.2% 2 3 4 3 2 67.1% 64.6% 57.3% 59.7% 47.8% Crash (absolute) Accuracy M1 39 XLE Coverage 5 x 600 K Test data 40 Applying Decision Tree to XLE M1 M3 41 Overall Accuracy for M1 and M3 Overall accuracy 0.75 0.70 0.65 M1 M3 0.60 0.55 0.50 0% 20% 40% 60% 80% Error density of test data 100% 42 Varying Training Error Density 20% 25% M3 33% M3 40% M3 50% M1 M3 60% M3 67% M3 75% 43 Varying Training Error Density 20% 25% M3 33% M3 40% M3 50% M1 M3 60% M3 67% M3 75% 44 Varying Training Error Density M3 40% M3 43% M1 M1: XLE M3: with decision tree M3 50% 45 Varying Training Error Density M3 40% M3 43% M1 M1: XLE M3: with decision tree M3 50% 46 N-Grams and DT (M2 vs M4) M2: Ngram M4: DT M4 25% M4 50% M2 M4 67% M4 75% 47 Methods 1 to 4 M1: XLE M2: Ngram M3/4: DT M3 43% M4 50% M1 M2 M3 50% 48 Combined Method (M5) 10%, 20% 25% 50% 67% 75% 80% 90% 49 All Methods M1: XLE M2: Ngram M3/4: DT M5: comb M5 45% M3 43% M1 M4 M5 50% M2 M3 50% 50 Breakdown by Error Type m.w. r.w. e.w. ag. M5 45% m.w. r.w. e.w. M1 ag. M5 50% m.w. r.w. e.w. ag. 51 Breakdown by Error Type m.w. r.w. e.w. ag. M5 45% e.w. r.w. m.w. r.w. e.w. M3 43% M1 ag. M5 50% M3 50% m.w. e.w. r.w. ag. 52 Talk Outline • • • • • • • Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work 53 Conclusions • Basic methods surprisingly close to each other • Decision tree – Effective with deep approach – Small and noisy improvement with shallow approach • Combined approach best on all but one error type 54 Future Work • Error types: – Word order – Multiple errors per sentence • • • • Add more features Other languages Test on MT output Establish upper bound 55 References • E. Atwell: How to detect grammatical errors in a text without parsing it. In Proceedings of the 3rd EACL, pp 38-45, 1987 • M. Butt, H. Dyvik, T. H. King, H. Masuichi, and C. Rohrer: The parallel grammar project. In Proceedings of COLING-2002 • J. Foster: Good Reasons for Noting Bad Grammar: Empirical Investigations into the Parsing of Ungrammatical Written English. Ph.D. thesis, University of Dublin, Trinity College, 2005 • J. Wagner, J. Foster and J. van Genabith: A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. In Proceedings of EMNLP-CoNLL 2007 • I. H. Witten and E. Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, 2000 56 Thank You! Djamé Seddah (La Sorbonne University) National Centre for Language Technology School of Computing, Dublin City University 57 Why not use F-Score? • Precision and F-Score – Depend on error density of test data – What are true positives? – Weighting parameter of F-score • Recall and 1-Fallout – Accuracy on ungrammatical data – Accuracy on grammatical data 58 Results: F-Score 0.8 0.7 0.6 0.5 XLE ngram XLE+DT ngram+DT combined 0.4 0.3 0.2 0.1 0.0 Agreement Real-word Extra word Missing word Mixed errors 59 F-Score (tp=correctly flagged) 1.0 F-Score 0.8 0.6 Baseline M1 M3 0.4 0.2 0.0 0% 20% 40% 60% 80% Error density of test data 100% 60 POS n-grams and Agreement Errors n = 2, 3, 4, 5 XLE parser F-Score 65 % Best Accuracy 55 % Best F-Score 66 % 61 POS n-grams and ContextSensitive Spelling Errors Best Accuracy 66 % XLE 60 % Best F-Score 69 % n = 2, 3, 4, 5 62 POS n-grams and Extra Word Errors Best Accuracy 68 % XLE 62 % Best F-Score 70 % n = 2, 3, 4, 5 63 POS n-grams and Missing Word Errors n = 2, 3, 4, 5 XLE 53 % Best Accuracy 59 % Best F-Score 67 % 64 Inverting Decisions 65 Why Judge Grammaticality? (2) • Automatic essay grading • Trigger deep error analysis – Increase speed – Reduce overflagging • Most approaches easily extend to – Locating errors – Classifying errors 66 Grammar Checker Research • Focus of grammar checker research – – – – Locate errors Categorise errors Propose corrections Other feedback (CALL) • Approaches: – Extend existing grammars – Write new grammars 67 N-gram Methods • Flag unlikely or rare sequences – POS (different tagsets) – Tokens – Raw frequency vs. mutual information • Most publications are in the area of context-sensitive spelling correction – Real word errors – Resulting sentence can be grammatical 68 Test Corpus - Example • Missing Word Error She didn’t want to face him She didn’t to face him 69 Test Corpus – Example 2 • Context-sensitive spelling error I love them both I love then both 70 Cross-validation • • • • Standard deviation below 0.006 Except Method 4: 0.026 High number of test items Report average percentage 71 Example Run Stdev F-Score 1 0.654 2 0.655 3 0.655 4 0.655 5 0.653 6 0.652 7 0.653 8 0.657 9 0.654 10 0.653 Method 1 – Agreement errors: 65.4 % average F-Score 0.001 72