...

Stagger: A modern POS tagger for Swedish Robert Östling Department of Linguistics

by user

on
Category: Documents
6

views

Report

Comments

Transcript

Stagger: A modern POS tagger for Swedish Robert Östling Department of Linguistics
Stagger: A modern POS tagger for Swedish
Robert Östling
Department of Linguistics
Stockholm University
SE-106 91 Stockholm
[email protected]
Abstract
The field of Part of Speech (POS) tagging has made slow but steady progress during the last decade, though many of the new
methods developed have not previously been applied to Swedish. I present a new system, based on the Averaged Perceptron
algorithm and semi-supervised learning, that is more accurate than previous Swedish POS taggers. Furthermore, a new version
of the Stockholm-Umeå Corpus is presented, whose more consistent annotation leads to significantly lower error rates for the
POS tagger. Finally, a new, freely available annotated corpus of Swedish blog posts is presented and used to evaluate the tagger’s
accuracy on this increasingly important genre. Details of the evaluation are presented throughout, to ensure easy comparison
with future results.
1.
Introduction
The task of syntactic disambiguation of natural language,
frequently referred to as part of speech (POS) tagging, aims
to annotate each word token in a text with its part of speech
and (often) its morphological features.
I have implemented a new, freely available POS tagging
system for Swedish, named Stagger,1 and used it to evaluate recently developed tagging algorithms on Swedish, as
well as the effects of improved corpus annotation on POS
tagging accuracy.
2.
Data
Two corpora were used to evaluate the accuracy of the POS
tagger: an updated version of the Stockholm-Umeå Corpus,
and a new corpus of Swedish blog texts.
2.1 Stockholm-Umeå Corpus
The Stockholm-Umeå Corpus (SUC) is a balanced and
POS-annotated corpus of about one million words of
Swedish text, which was originally developed at the universities of Stockholm and Umeå during the 1990s. Its most
recent release (Gustafson-Capková and Hartmann, 2008)
has become a de-facto standard for Swedish POS tagging
research.
Due to the size of the corpus, multiple annotators have
been used, and annotation (in)consistency is an issue. Källgren (1996) explored tagging errors in an earlier version of
the corpus, and found that 1.2% of the words sampled contained POS annotation errors. Forsbom and Wilhelmsson
(2010) corrected over 1500 errors in SUC 2.0, mostly in
common, polysemous grammatical words, and found that
this results in a small but significant improvement in POS
tagger accuracy.
We have included the changes of Forsbom and Wilhelmsson (2010), as well as over 2500 other changes to the annotation, into version 3.0 of SUC.2
1
http://www.ling.su.se/stagger
More information and instructions for obtaining the corpus
can be found at: http://www.ling.su.se/suc
2.2
Swedish blog texts
The language in so-called user-generated content, written by non-professionals in for instance blog posts or online forum posts, may differ considerably from traditional
written language and poses a challenge to many Natural
Language Processing applications, including POS tagging
(Giesbrecht and Evert, 2009).
In order to evaluate the current POS tagger on usergenerated content in Swedish, a small corpus (8 174 tokens) of blog texts was compiled and manually annotated
with SUC-compatible POS tags and named entities. The
corpus is freely available for download from the Stockholm
University website.3
2.3
Unannotated data
For semi-supervised training, Collobert and Weston (2008)
embeddings were induced from a corpus of about two billion tokens of Swedish blog texts.
2.4
Lexicon
In addition to the vocabulary in the training data, the
SALDO lexicon of Swedish morphology (Borin and Forsberg, 2009) is used as a POS tag lexicon. For known words,
only POS tags occurring with the word in either the training
data or the SALDO lexicon are considered. For unknown
words, all POS tags that occur with a token of the same type
(e.g. number, emoticon or letter sequence) are considered.
3.
Method
The tagger uses a feature-rich model, based on the averaged
perceptron tagger of Collins (2002).
A basic feature set similar to Collins’ is used. Details are
omitted due to space limitations, but are documented in the
software package. In addition, 48-dimensional Collobert
and Weston (2008) embeddings (C&W) were used as features in one tagger configuration. Each word can then be
2
3
http://www.ling.su.se/sic
Table 1: POS tagging accuracy in percent, with figures in
bold significantly better than the others in the same column.
Configuration
–
SALDO
SALDO+C&W
SUC2
95.86
96.32
96.40
SUC3
96.04
96.52
96.57
Test3
96.58
96.94
96.94
Blogs
91.72
92.45
92.10
represented by a 48-dimensional vector, reflecting distributional (and indirectly syntactic and semantic) properties of
the word.
4.
Results
Using SUC, two evaluations were performed: 10-fold cross
validation, and another using the training/development/test
split in SUC 3.0. In the cross-validation, the 500 files were
sorted alphanumerically and numbered from 0 to 499. Fold
i (0..9) uses files k where k ≡ i (mod 10) for testing,
and the remaining 450 files for training. A held-out set of
10 files in the training set are used to determine the number
of training iterations.
In the non-cross validations, the development set (2%)
of SUC 3.0 was used for this purpose, and the test set (2%)
used for estimating the final accuracy.
Table 1 shows the results of the evaluation. SUC2 and
SUC3 are cross-validations using SUC 2.0 and 3.0, respectively. Test3 uses the training/development/test sets of SUC
3.0, and Blogs uses the training and development sets of
SUC 3.0 for training, and the annotated blog corpus for
testing.
Forsbom and Wilhelmsson (2010) found that a subset of
the changes to SUC explored in this work led to a significant reduction in errors made by a POS tagger, and as
expected this effect was larger after more errors were corrected.
The evaluation also demonstrated the importance of using a good lexicon, where the SALDO lexicon of Swedish
morphology made a great contribution to tagging accuracy.
Finally, Collobert & Weston embeddings were shown
to improve tagging accuracy by a small but significant4
amount in the cross-evaluation, similar to what Turian et al.
(2010) showed for other NLP tasks. Surprisingly, given the
fact that the embeddings were computed from an unannotated blog corpus, the accuracy on the annotated blog corpus is instead significantly lower with C&W embeddings.
However, since there are only three authors represented in
the blog corpus, it would be risky to draw too general conclusions on the basis of this result.
5.
Related work
Sjöbergh (2003) evaluated seven different POS tagging systems for Swedish through ten-fold cross-validation on SUC
2.0, where accuracies ranged between 93.8% and 96.0% for
single systems (Carlberger and Kann, 1999, being the best),
and a voting combination of all taggers reached 96.7%.
4
McNemar’s test with p < 0.05 is used throughout to test for
statistical significance.
However, since the details of his evaluation were not published, and he used a larger training data set in each fold
(95%) than the present study (90%), our respective accuracy figures are not directly comparable.
6.
Acknowledgments
Thanks to those who found and corrected the thousands of
annotation errors in SUC 2.0 that have been fixed in SUC
3.0: Britt Hartmann, Kenneth Wilhelmson, Eva Forsbom
and the Swedish Treebank project. Further thanks to the
two anonymous reviewers, who provided useful comments.
7.
References
Lars Borin and Markus Forsberg. 2009. All in the family:
A comparison of SALDO and WordNet. In Proceedings
of the Nodalida 2009 Workshop on WordNets and other
Lexical Semantic Resources – between Lexical Semantics, Lexicography, Terminology and Formal Ontologies,
Odense.
Johan Carlberger and Viggo Kann. 1999. Implementing an
efficient part-of-speech tagger. Software–Practice and
Experience, 29:815–832.
Michael Collins. 2002. Discriminative training methods
for hidden markov models: Theory and experiments with
perceptron algorithms. In Proceedings of EMNLP 2002,
pages 1–8.
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of
the 25th international conference on Machine learning,
ICML ’08, pages 160–167, New York, NY, USA. ACM.
Eva Forsbom and Kenneth Wilhelmsson. 2010. Revision
of part-of-speech tagging in stockholm umeå corpus 2.0.
In Proceedings of SLTC 2010.
Eugenie Giesbrecht and Stefan Evert. 2009. Is part-ofspeech tagging a solved task? an evaluation of pos taggers for the german web as corpus. In Proceedings of the
Fifth Web as Corpus Workshop (WAC5).
Sofia Gustafson-Capková and Britt Hartmann, 2008. Manual of the Stockholm Umeå Corpus version 2.0. Stockholm University.
Gunnel Källgren. 1996. Linguistic indeterminacy as a
source of errors in tagging. In Proceedings of the 16th
conference on Computational linguistics - Volume 2,
COLING ’96, pages 676–680, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Jonas Sjöbergh. 2003. Combining pos-taggers for improved accuracy on swedish text. In Proceedings of
NODALIDA.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: a simple and general method for
semi-supervised learning. In Proceedings of the 48th
Annual Meeting of the Association for Computational
Linguistics, ACL ’10, pages 384–394, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Fly UP