Volltext-Downloads (blau) und Frontdoor-Views (grau)

Similarity Models in Distributional Semantics using Task Specific Information

  • In distributional semantics, the unsupervised learning approach has been widely used for a large number of tasks. On the other hand, supervised learning has less coverage. In this dissertation, we investigate the supervised learning approach for semantic relatedness tasks in distributional semantics. The investigation considers mainly semantic similarity and semantic classification tasks. Existing and newly-constructed datasets are used as an input for the experiments. The new datasets are constructed from thesauruses like Eurovoc. The Eurovoc thesaurus is a multilingual thesaurus maintained by the Publications Office of the European Union. The meaning of the words in the dataset is represented by using a distributional semantic approach. The distributional semantic approach collects co-occurrence information from large texts and represents the words in high-dimensional vectors. The English words are represented by using UkWaK corpus while German words are represented by using DeWaC corpus. After representing each word by the high dimensional vector, different supervised machine learning methods are used on the selected tasks. The outputs from the supervised machine learning methods are evaluated by comparing the tasks performance and accuracy with the state of the art unsupervised machine learning methods’ results. In addition, multi-relational matrix factorization is introduced as one supervised learning method in distributional semantics. This dissertation shows the multi-relational matrix factorization method as a good alternative method to integrate different sources of information of words in distributional semantics. In the dissertation, some new applications are also introduced. One of the applications is an application which analyzes a German company’s website text, and provides information about the company with a concept cloud visualization. The other applications are automatic recognition/disambiguation of the library of congress subject headings and automatic identification of synonym relations in the Dutch Parliament thesaurus applications.

Download full text files

Export metadata

Additional Services

Share in Twitter    Search Google Scholar    frontdoor_oas
Metadaten
Author:Rosa Tsegaye Aga
URN:https://nbn-resolving.org/urn:nbn:de:gbv:hil2-opus4-9339
DOI:https://doi.org/10.25528/008
Publisher:Rosa Tsegaye Aga
Place of publication:Hildesheim
Advisor:Lars Schmidt-Thieme, Christian Wartena
Document Type:Doctoral Thesis
Language:English
Year of Completion:2018
Granting Institution:Universität Hildesheim, Fachbereich IV
Date of final exam:2019/04/10
Release Date:2019/08/13
Tag:Distributional semantics, Supervised machine learning, Unsupervised machine learning, natural language processing, multi-relational matrix factorization
Pagenumber:146
Institutes:Universität
DDC classes:000 Allgemeines, Informatik, Informationswissenschaft / 000 Allgemeines, Wissenschaft / 004 Informatik
000 Allgemeines, Informatik, Informationswissenschaft / 000 Allgemeines, Wissenschaft / 006 Spezielle Computerverfahren
400 Sprache / 410 Linguistik / 410 Linguistik
Licence (German):License LogoCreative Commons - Namensnennung