Alpes4science project : SMS corpus processing and tokenization problems
- Virtual textual communication involves numeric supports as transporter and mediator. SMS language is part of this type of communication and represents some specific particularities. An SMS text is characterized by an unpredictable use of white-spaces, special characters and a lack of any writing standards, when at the same time stays close to the orality. This paper aims to expose the database of alpes4science project from the collation to the processing of the SMS corpus. Then we present some of the most common SMS tokenization problems and works related to SMS normalization.
Author: | Eleni Kogkitsidou, Georges Antoniadis |
---|---|
URN: | https://nbn-resolving.org/urn:nbn:de:gbv:hil2-opus-2971 |
Parent Title (English): | Workshop proceedings of the 12th edition of the KONVENS conference |
Document Type: | Conference Proceeding |
Language: | English |
Date of Publication (online): | 2014/11/25 |
Release Date: | 2014/11/25 |
Tag: | cmc; corpus linguistics |
GND Keyword: | Computerunterstützte Kommunikation; Korpus <Linguistik> |
First Page: | 62 |
Last Page: | 66 |
PPN: | Link zum Katalog |
Institutes: | Fachbereich III / Informationswissenschaft und Sprachtechnologie |
DDC classes: | 400 Sprache / 400 Sprache, Linguistik |
Collections: | KONVENS 2014 / Workshop Proceedings of the 12th KONVENS 2014 |
Licence (German): | ![]() |