Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts
- In this work we consider the problem of social media text Part-of-Speech tagging as fundamental task for Natural Language Processing. We present improvements to a social media Markov model tagger, by adapting parameter estimation methods for unknown tokens. In addition, we propose to enrich the social media text corpus by a linear combination with a newspaper training corpus. Applying our tagger to a social media text corpus results in accuracies of around 94.8%, which comes close to accuracies for standardized texts.
Author: | Melanie Neunerdt, Michael Reyer, Rudolf Mathar |
---|---|
URN: | https://nbn-resolving.org/urn:nbn:de:gbv:hil2-opus-2818 |
Parent Title (English): | Proceedings of the 12th edition of the KONVENS conference |
Document Type: | Conference Proceeding |
Language: | English |
Date of Publication (online): | 2014/10/23 |
Release Date: | 2014/10/23 |
Tag: | NLP for user-generated content |
GND Keyword: | Computerlinguistik |
First Page: | 186 |
Last Page: | 192 |
PPN: | Link zum Katalog |
Contributor: | Faaß, Gertrud |
Institutes: | Fachbereich III / Informationswissenschaft und Sprachtechnologie |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Collections: | KONVENS 2014 / Proceedings of the 12th KONVENS 2014 |
Licence (German): | ![]() |