Volltext-Downloads (blau) und Frontdoor-Views (grau)

For a fistful of blogs: Discovery and comparative benchmarking of republishable German content

  • We introduce two corpora gathered on the web and related to computer-mediated communication: blog posts and blog comments. In order to build such corpora, we addressed following issues: website discovery and crawling, content extraction constraints, and text quality assessment. The blogs were manually classified as to their license and content type. Our results show that it is possible to find blogs in German under Creative Commons license, and that it is possible to perform text extraction and linguistic annotation efficiently enough to allow for a comparison with more traditional text types such as newspaper corpora and subtitles. The comparison gives insights on distributional properties of the processed web texts on token and type level. For example, quantitative analysis reveals that blog posts are close to written language, while comments are slightly closer to spoken language.

Download full text files

Export metadata

Additional Services

Share in Twitter    Search Google Scholar    frontdoor_oas
Author:Adrien Barbaresi, Kay-Michael Würzner
Parent Title (English):Workshop proceedings of the 12th edition of the KONVENS conference
Document Type:Conference Proceeding
Date of Publication (online):2014/11/25
Release Date:2014/11/25
Tag:Korpuslinguistik; Webkorpora
cmc; corpus linguistics; web corpora
GND Keyword:Computerunterstützte Kommunikation; Korpus <Linguistik>
First Page:2
Last Page:10
PPN:Link zum Katalog
Institutes:Fachbereich III / Informationswissenschaft und Sprachtechnologie
DDC classes:400 Sprache / 400 Sprache, Linguistik
Collections:KONVENS 2014 / Workshop Proceedings of the 12th KONVENS 2014
Licence (German):License LogoCreative Commons - Namensnennung 3.0