Stagger: A modern POS tagger for Swedish Robert Östling Department of Linguistics
by user
Comments
Transcript
Stagger: A modern POS tagger for Swedish Robert Östling Department of Linguistics
Stagger: A modern POS tagger for Swedish Robert Östling Department of Linguistics Stockholm University SE-106 91 Stockholm [email protected] Abstract The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version of the Stockholm-Umeå Corpus is presented, whose more consistent annotation leads to significantly lower error rates for the POS tagger. Finally, a new, freely available annotated corpus of Swedish blog posts is presented and used to evaluate the tagger’s accuracy on this increasingly important genre. Details of the evaluation are presented throughout, to ensure easy comparison with future results. 1. Introduction The task of syntactic disambiguation of natural language, frequently referred to as part of speech (POS) tagging, aims to annotate each word token in a text with its part of speech and (often) its morphological features. I have implemented a new, freely available POS tagging system for Swedish, named Stagger,1 and used it to evaluate recently developed tagging algorithms on Swedish, as well as the effects of improved corpus annotation on POS tagging accuracy. 2. Data Two corpora were used to evaluate the accuracy of the POS tagger: an updated version of the Stockholm-Umeå Corpus, and a new corpus of Swedish blog texts. 2.1 Stockholm-Umeå Corpus The Stockholm-Umeå Corpus (SUC) is a balanced and POS-annotated corpus of about one million words of Swedish text, which was originally developed at the universities of Stockholm and Umeå during the 1990s. Its most recent release (Gustafson-Capková and Hartmann, 2008) has become a de-facto standard for Swedish POS tagging research. Due to the size of the corpus, multiple annotators have been used, and annotation (in)consistency is an issue. Källgren (1996) explored tagging errors in an earlier version of the corpus, and found that 1.2% of the words sampled contained POS annotation errors. Forsbom and Wilhelmsson (2010) corrected over 1500 errors in SUC 2.0, mostly in common, polysemous grammatical words, and found that this results in a small but significant improvement in POS tagger accuracy. We have included the changes of Forsbom and Wilhelmsson (2010), as well as over 2500 other changes to the annotation, into version 3.0 of SUC.2 1 http://www.ling.su.se/stagger More information and instructions for obtaining the corpus can be found at: http://www.ling.su.se/suc 2.2 Swedish blog texts The language in so-called user-generated content, written by non-professionals in for instance blog posts or online forum posts, may differ considerably from traditional written language and poses a challenge to many Natural Language Processing applications, including POS tagging (Giesbrecht and Evert, 2009). In order to evaluate the current POS tagger on usergenerated content in Swedish, a small corpus (8 174 tokens) of blog texts was compiled and manually annotated with SUC-compatible POS tags and named entities. The corpus is freely available for download from the Stockholm University website.3 2.3 Unannotated data For semi-supervised training, Collobert and Weston (2008) embeddings were induced from a corpus of about two billion tokens of Swedish blog texts. 2.4 Lexicon In addition to the vocabulary in the training data, the SALDO lexicon of Swedish morphology (Borin and Forsberg, 2009) is used as a POS tag lexicon. For known words, only POS tags occurring with the word in either the training data or the SALDO lexicon are considered. For unknown words, all POS tags that occur with a token of the same type (e.g. number, emoticon or letter sequence) are considered. 3. Method The tagger uses a feature-rich model, based on the averaged perceptron tagger of Collins (2002). A basic feature set similar to Collins’ is used. Details are omitted due to space limitations, but are documented in the software package. In addition, 48-dimensional Collobert and Weston (2008) embeddings (C&W) were used as features in one tagger configuration. Each word can then be 2 3 http://www.ling.su.se/sic Table 1: POS tagging accuracy in percent, with figures in bold significantly better than the others in the same column. Configuration – SALDO SALDO+C&W SUC2 95.86 96.32 96.40 SUC3 96.04 96.52 96.57 Test3 96.58 96.94 96.94 Blogs 91.72 92.45 92.10 represented by a 48-dimensional vector, reflecting distributional (and indirectly syntactic and semantic) properties of the word. 4. Results Using SUC, two evaluations were performed: 10-fold cross validation, and another using the training/development/test split in SUC 3.0. In the cross-validation, the 500 files were sorted alphanumerically and numbered from 0 to 499. Fold i (0..9) uses files k where k ≡ i (mod 10) for testing, and the remaining 450 files for training. A held-out set of 10 files in the training set are used to determine the number of training iterations. In the non-cross validations, the development set (2%) of SUC 3.0 was used for this purpose, and the test set (2%) used for estimating the final accuracy. Table 1 shows the results of the evaluation. SUC2 and SUC3 are cross-validations using SUC 2.0 and 3.0, respectively. Test3 uses the training/development/test sets of SUC 3.0, and Blogs uses the training and development sets of SUC 3.0 for training, and the annotated blog corpus for testing. Forsbom and Wilhelmsson (2010) found that a subset of the changes to SUC explored in this work led to a significant reduction in errors made by a POS tagger, and as expected this effect was larger after more errors were corrected. The evaluation also demonstrated the importance of using a good lexicon, where the SALDO lexicon of Swedish morphology made a great contribution to tagging accuracy. Finally, Collobert & Weston embeddings were shown to improve tagging accuracy by a small but significant4 amount in the cross-evaluation, similar to what Turian et al. (2010) showed for other NLP tasks. Surprisingly, given the fact that the embeddings were computed from an unannotated blog corpus, the accuracy on the annotated blog corpus is instead significantly lower with C&W embeddings. However, since there are only three authors represented in the blog corpus, it would be risky to draw too general conclusions on the basis of this result. 5. Related work Sjöbergh (2003) evaluated seven different POS tagging systems for Swedish through ten-fold cross-validation on SUC 2.0, where accuracies ranged between 93.8% and 96.0% for single systems (Carlberger and Kann, 1999, being the best), and a voting combination of all taggers reached 96.7%. 4 McNemar’s test with p < 0.05 is used throughout to test for statistical significance. However, since the details of his evaluation were not published, and he used a larger training data set in each fold (95%) than the present study (90%), our respective accuracy figures are not directly comparable. 6. Acknowledgments Thanks to those who found and corrected the thousands of annotation errors in SUC 2.0 that have been fixed in SUC 3.0: Britt Hartmann, Kenneth Wilhelmson, Eva Forsbom and the Swedish Treebank project. Further thanks to the two anonymous reviewers, who provided useful comments. 7. References Lars Borin and Markus Forsberg. 2009. All in the family: A comparison of SALDO and WordNet. In Proceedings of the Nodalida 2009 Workshop on WordNets and other Lexical Semantic Resources – between Lexical Semantics, Lexicography, Terminology and Formal Ontologies, Odense. Johan Carlberger and Viggo Kann. 1999. Implementing an efficient part-of-speech tagger. Software–Practice and Experience, 29:815–832. Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP 2002, pages 1–8. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, ICML ’08, pages 160–167, New York, NY, USA. ACM. Eva Forsbom and Kenneth Wilhelmsson. 2010. Revision of part-of-speech tagging in stockholm umeå corpus 2.0. In Proceedings of SLTC 2010. Eugenie Giesbrecht and Stefan Evert. 2009. Is part-ofspeech tagging a solved task? an evaluation of pos taggers for the german web as corpus. In Proceedings of the Fifth Web as Corpus Workshop (WAC5). Sofia Gustafson-Capková and Britt Hartmann, 2008. Manual of the Stockholm Umeå Corpus version 2.0. Stockholm University. Gunnel Källgren. 1996. Linguistic indeterminacy as a source of errors in tagging. In Proceedings of the 16th conference on Computational linguistics - Volume 2, COLING ’96, pages 676–680, Stroudsburg, PA, USA. Association for Computational Linguistics. Jonas Sjöbergh. 2003. Combining pos-taggers for improved accuracy on swedish text. In Proceedings of NODALIDA. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394, Stroudsburg, PA, USA. Association for Computational Linguistics.