Volltext-Downloads (blau) und Frontdoor-Views (grau)

Alpes4science project : SMS corpus processing and tokenization problems

  • Virtual textual communication involves numeric supports as transporter and mediator. SMS language is part of this type of communication and represents some specific particularities. An SMS text is characterized by an unpredictable use of white-spaces, special characters and a lack of any writing standards, when at the same time stays close to the orality. This paper aims to expose the database of alpes4science project from the collation to the processing of the SMS corpus. Then we present some of the most common SMS tokenization problems and works related to SMS normalization.

Download full text files

Export metadata

Additional Services

Share in Twitter    Search Google Scholar    frontdoor_oas
Author:Eleni Kogkitsidou, Georges Antoniadis
Parent Title (English):Workshop proceedings of the 12th edition of the KONVENS conference
Document Type:Conference Proceeding
Date of Publication (online):2014/11/25
Release Date:2014/11/25
Tag:cmc; corpus linguistics
GND Keyword:Computerunterstützte Kommunikation; Korpus <Linguistik>
First Page:62
Last Page:66
PPN:Link zum Katalog
Institutes:Fachbereich III / Informationswissenschaft und Sprachtechnologie
DDC classes:400 Sprache / 400 Sprache, Linguistik
Collections:KONVENS 2014 / Workshop Proceedings of the 12th KONVENS 2014
Licence (German):License LogoCreative Commons - Namensnennung 3.0