Volltext-Downloads (blau) und Frontdoor-Views (grau)

Efficient Training Data Enrichment and Unknown Token Handling for POS Tagging of Non-standardized Texts

  • In this work we consider the problem of social media text Part-of-Speech tagging as fundamental task for Natural Language Processing. We present improvements to a social media Markov model tagger, by adapting parameter estimation methods for unknown tokens. In addition, we propose to enrich the social media text corpus by a linear combination with a newspaper training corpus. Applying our tagger to a social media text corpus results in accuracies of around 94.8%, which comes close to accuracies for standardized texts.

Download full text files

  • Main Conference Proceedings of the 12th Konvens 2014

Export metadata

Additional Services

Share in Twitter    Search Google Scholar    frontdoor_oas
Author:Melanie Neunerdt, Michael Reyer, Rudolf Mathar
Parent Title (English):Proceedings of the 12th edition of the KONVENS conference
Document Type:Conference Proceeding
Date of Publication (online):2014/10/23
Release Date:2014/10/23
Tag:NLP for user-generated content
GND Keyword:Computerlinguistik
First Page:186
Last Page:192
PPN:Link zum Katalog
Contributor:Faaß, Gertrud
Institutes:Fachbereich III / Informationswissenschaft und Sprachtechnologie
DDC classes:400 Sprache / 400 Sprache, Linguistik
Collections:KONVENS 2014 / Proceedings of the 12th KONVENS 2014
Licence (German):License LogoCreative Commons - Namensnennung 3.0