Refine
Document Type
- Conference Proceeding (58)
- Doctoral Thesis (3)
- Master's Thesis (1)
Has Fulltext
- yes (62)
Is part of the Bibliography
- no (62)
Keywords
- clustering (3)
- data mining (3)
- k-means (3)
- law-enforcement (3)
- semi-supervised learning (3)
- Artificial Intelligence (1)
- Complexity Metrics (1)
- Electronic Commerce (1)
- Information Retrieval (1)
- Komplexitätsmetriken (1)
Institute
- Informatik (62) (remove)
We preset a network model for context-based retrieval allowing for integrating domain knowledge into document retrieval. Based on the premise that the results provided by a network model employing spreading activation are equivalent to the results of a vector space model, we create a network representation of a document collection for retrieval. We extended this well explored approach by blending it with techniques from knowledge representation. This leaves us with a network model for finding similarities in a document collection by content-based as well as knowledge-based similarities.
In this work we describe a "semantic personalization" web service for curriculum planning. Based on a semantic annotation of a set of courses, provided by the University of Hannover, reasoning about actions and change —in particular classical planning— are exploited for creating personalized curricula, i.e. for selecting and sequencing a set of courses which will allow a student to achieve her learning goal. The specific student's context is taken into account during the process: students with different initial knowledge will be suggested different solutions. The Curriculum Planning Service has been integrated as a new plug-and-play personalization service in the Personal Reader framework.
In this paper we present an interface for supporting a user in an interactive cross-language search process using semantic classes. In order to enable users to access multilingual information, different problems have to be solved: disambiguating and translating the query words, as well as categorizing and presenting the results appropriately. Therefore, we first give a brief introduction to word sense disambiguation, cross-language text retrieval and document categorization and finally describe recent achievements of our research towards an interactive multilingual retrieval system. We focus especially on the problem of browsing and navigation of the different word senses in one source and possibly several target languages. In the last part of the paper, we discuss the developed user interface and its functionalities in more detail.
An Evaluation of Text Retrieval Methods for Similarity Search of multi-dimensional NMR-Spectra
(2006)
Searching and mining nuclear magnetic resonance (NMR)-spectra of naturally occurring substances is an important task to investigate new potentially useful chemical compounds. Multidimensional NMR-spectra are relational objects like documents, but consists of continuous multi-dimensional points called peaks instead of words. We develop several mappings from continuous NMR-spectra to discrete text-like data. With the help of those mappings any text retrieval method can be applied. We evaluate the performance of two retrieval methods, namely the standard vector space model and probabilistic latent semantic indexing (PLSI). PLSI learns hidden topics in the data, which is in case of 2D-NMR data interesting in its owns rights. Additionally, we develop and evaluate a simple direct similarity function, which can detect duplicates of NMR-spectra. Our experiments show that the vector space model as well as PLSI, which are both designed for text data created by humans, can effectively handle the mapped NMRdata originating from natural products. Additionally, PLSI is able to find meaningful "topics" in the NMR-data.
Current document management systems (DMS) are designed to coordinate the collaborative creation and maintenance process of documents through the provision of a centralized repository. The focus is primarily on managing documents themselves. Relations between and within documents and effects of changes are largely neglected. To avoid inefficiencies, conflicts, and delays the support of modification management is indispensable. Here I present the design of the LOCUTOR system that aims to provide management of change functionality for arbitrary XML documents ranging from informal, e.g. instruction or construction manuals, to formal documents.
Hintergrund: Die Software-Produktlinienentwicklung ermöglicht eine kostengünstige und effiziente Entwicklung von Produktfamilien bei gesteigerter Qualität im Vergleich zur Einzelsystementwicklung. Dieses wird durch die Einführung von Variabilitätsmechanismen ermöglicht, welche eine hohe Anpassbarkeit der Produkte an verschiedene Kundenbedürfnisse ermöglichen. Allerdings erhöhen diese Variabilitätsmechanismen auch die Komplexität, da Entwickler das Zusammenwirken der Komponenten für verschiedene Produktvarianten berücksichtigen müssen. Daher wurden zur Qualitätssicherung von Software-Produktlinien neue Analysemethoden und -strategien entwickelt, darunter auch variabilitätsbasierte Code-Metriken. Ziel dieser Metriken ist es, unnötige Komplexität zu vermeiden und frühzeitig besonders fehleranfälligen Code zu identifizieren, um diesen zusätzlichen Qualitätsmaßnahmen unterziehen zu können. Unsere systematische Literaturstudie zu diesem Thema zeigt jedoch, dass der Nutzen dieser variabilitätsbasierten Code-Metriken nur in wenigen Fällen evaluiert wurde.
Ziel: Diese Arbeit untersucht inwieweit variabilitätsbasierte Code-Metriken zur Qualitätssteigerung von Software-Produktlinien genutzt werden können. Dazu wird betrachtet, ob sich mit Hilfe empirischer Untersuchungen Entwicklungsrichtlinien zur proaktiven Vermeidung von Komplexität und damit verbundenen Fehlern ableiten lassen. Der Fokus liegt auf der Analyse, ob sich die betrachteten Metriken zur Identifikation von potentiell fehleranfälligeren Code nutzen lassen. Dies umfasst sowohl die univariate Verwendung einzelner Metriken als auch den Aufbau von Vorhersagemodellen mit Verfahren des maschinellen Lernens. Dabei wird auch untersucht, ob die speziell für die Software-Produktlinienentwicklung konzipierten variabilitätsbasierten Code-Metriken einen Mehrwert gegenüber etablierten Metriken der Einzelsystementwicklung bieten.
Methodik: Es findet eine empirische Untersuchung von 692 Einzelsystem- und variabilitätsbasierte Code-Metriken auf dem Linux-Kernel statt. Dazu wird zunächst analysiert, inwieweit die Messwerte der Metriken mit Kompilierfehlern und Sicherheitslücken korreliert sind, welche von den Entwicklern übersehen wurden und so erst nach dem Commit bzw. nach dem Release entdeckt wurden. Darüber hinaus werden die Metriken bezüglich der gemessenen Eigenschaften gruppiert und mit vier Verfahren des maschinellen Lernens eine Identifikation der fehleranfälligen Codestellen erprobt, um so den Nutzen verschiedener Codeeigenschaften beurteilen zu können.
Ergebnisse und Schlussfolgerung: Auch wenn für einen Großteil der Metriken ein signifikanter Zusammenhang zwischen den Messwerten und fehleranfälligen Codestellen nachgewiesen werden kann, so zeigt sich, dass univariate Verfahren für die Praxis untauglich sind. Auf Grund der starken Klassenimbalance von nur 1,5% defekten Codefunktionen (Kompilierfehler), bzw. 0,4% Codefunktionen mit nachgewiesenen Sicherheitslücken, lassen sich bei der Verwendung einer einzelnen Metrik nur F1-Werte unterhalb von 0,073 erzielen. Mangels alternativer Implementierungen lassen sich so, entgegen den Aussagen einiger Veröffentlichungen, auch keine Entwicklungsempfehlungen ableiten. Hingegen können variabilitätsbasierte Code-Metriken, insofern sie auch die Variabilität verbundener Artefakte mit berücksichtigen, erfolgreich zur Fehlervorhersage genutzt werden.
So lässt sich beispielsweise bei Verwendung von Random Forest F1-Werte von 0,667 (Kompilierfehler), bzw. 0,711 (Sicherheitslücken), erzielen.
Das geographische Information Retrieval (GeoIR) berücksichtigt bei Suchanfragen – insb. nach Webseiten – neben dem Inhalt von Dokumenten auch eine räumliche Komponente, um gezielt nach Seiten suchen zu können, die für eine spezifische Region bedeutsam sind. Dazu müssen GeoIR-Systeme den geographischen Kontext einer Webseite erkennen können und in der Lage sein zu entscheiden, ob eine Seite überhaupt regional-spezifisch ("lokal") ist oder einen rein informativen Charakter besitzt, der keinen geographischen Bezug besitzt. Im Folgenden werden Ansätze vorgestellt, Merkmale lokaler Seiten zu ermitteln und diese für eine Einteilung von Webseiten in globale und lokale Seiten zu verwenden. Dabei sollen insbesondere die sprachlichen und geographischen Eigenschaften deutscher Webseiten berücksichtigt werden.
In diesem Aufsatz soll die geplante Implementierung von Suchmaschinentechnologien im Fachportal Pädagogik zum Anlass genommen werden, um sich mit den damit verbundenen neuen Anforderungen an ein Qualitätsmanagement auseinanderzusetzen. Im Zentrum stehen die Fragen, welche Zusammenhänge die Recherche- Situationen formen und welche Schlussfolgerungen sich daraus für ein Evaluationsdesign ergeben. Als analytisches Instrumentarium soll dabei eine soziotechnische Sichtweise auf das Information- Retrieval-System (IR) dienen.
Machine learning, statistics and knowledge engineering provide a broad variety of supervised learning algorithms for classification. In this paper we introduce the Automated Model Selection Framework (AMSF) which presents automatic and semi-automatic methods to select classifiers. To achieve this we split up the selection process into three distinct phases. Two of those select algorithms by static rules which are derived from a manually created knowledgebase. At this stage of AMSF the user can choose between different rankers in the third phase. Currently, we use instance based learning and a scoring scheme for ranking the classifiers. After evaluation of different rankers we will recommend the most successful to the user by default. Besides describing the architecture and design issues, we additionally point out the versatile ways AMSF is applied in a production process of the automotive industry
The learners’ motivation has an impact on the quality of learning, especially in e-Learning environments. Most of these environments store data about the learner’s actions in log files. Logging the users’ interactions in educational systems gives the possibility to track their actions at a refined level of detail. Data mining and machine learning techniques can “give meaning” to these data and provide valuable information for learning improvement. An area where improvement is absolutely necessary and of great importance is motivation, known to be an essential factor for preventing attrition in e-Learning. In this paper we investigate if the log files data analysis can be used to estimate the motivational level of the learner. A decision tree is build from a limited number of log files from a web-based learning environment. The results suggest that time spent reading is an important factor for predicting motivation; also, performance in tests was found to be a relevant indicator of the motivational level.
In this paper, we propose a case-based approach for characterizing and analyzing subgroup patterns: We present techniques for retrieving characteristic factors and cases, and merge these into prototypical cases for presentation to the user. In general, cases capture knowledge and concrete experiences of specific situations. By exploiting case-based knowledge for characterizing a subgroup pattern, we can provide additional information about the subgroup extension. We can then present the subgroup pattern in an alternative condensed form that characterizes the subgroup, and enables a convenient retrieval of interesting associated (meta-)information.
Agents in dynamic environments have to deal with complex situations including various temporal interrelations of actions and events. Discovering frequent patterns in such scenes can be useful in order to create prediction rules which can be used to predict future activities or situations. We present the algorithm MiTemP which learns frequent patterns based on a time intervalbased relational representation. Additionally the problem has also been transfered to a pure relational association rule mining task which can be handled by WARMR. The two approaches are compared in a number of experiments. The experiments show the advantage of avoiding the creation of impossible or redundant patterns with MiTemP. While less patterns have to be explored on average with MiTemP more frequent patterns are found at an earlier refinement level.
Recently, research projects such as PADLR and SWAP have developed tools like Edutella or Bibster, which are targeted at establishing peer-topeer knowledge management systems. In such a system, it is necessary to obtain brief semantic descriptions of peers, so that routing algorithms or matchmaking processes can make decisions about which communities peers should belong to, or to which peers a given query should be forwarded. This paper provides a graph clustering technique on knowledge bases for that purpose. Using this clustering, we can show that our strategy requires up to 58% fewer queries than the baselines to yield full recall in a bibliographic peer-to-peer scenario.
Knowledge-intensive work plays an increasingly important role in organisations of all types. This work is characterized by a defined input and a defined output but not the way how to transform the input to an output. Within this context, the research project DYONIPOS aims at encouraging the two crucial roles in a knowledge-intensive organization - the process executer and the process engineer. Ad-hoc support will be provided for the knowledge worker by synergizing the development of context sensitive, intelligent, and agile semantic technologies with contextual retrieval. DYONIPOS provides process executers with guidance through business processes and just-in-time resource support based on the current user context, that are the focus of this paper.
Can crimes be modeled as data mining problems? We will try to answer this question in this paper. Crimes are a social nuisance and cost our society dearly in several ways. Any research that can help in solving crimes faster will pay for itself. Here we look at use of clustering algorithm for a data mining approach to help detect the crimes patterns and speed up the process of solving crime. We will look at k-means clustering with some enhancements to aid in the process of identification of crime patterns. We will apply these techniques to real crime data from a sheriff’s office and validate our results. We also use semi-supervised learning technique here for knowledge discovery from the crime records and to help increase the predictive accuracy. We also developed a weighting scheme for attributes here to deal with limitations of various out of the box clustering tools and techniques. This easy to implement machine learning framework works with the geo-spatial plot of crime and helps to improve the productivity of the detectives and other law enforcement officers. It can also be applied for counter terrorism for homeland security.
Passage retrieval is an essential part of question answering systems. In this paper we use statistical language models to perform this task. Previous work has shown that language modeling techniques provide better results for both, document and passage retrieval. The motivation behind this paper is to define new smoothing methods for passage retrieval in question answering systems. The long term objective is to improve the quality of question answering systems to isolate the correct answer by choosing and evaluating the appropriate section of a document. In this work we use a three step approach. The first two steps are standard document and passage retrieval using the Lemur toolkit. As a novel contribution we propose as the third step a re-ranking using dedicated backing-off distributions. In particular backing-off from the passage-based language model to a language model trained on the document from which the passage is taken shows a significant improvement. For a TREC question answering task we can increase the mean average precision from 0.127 to 0.176.
Im vorliegenden Beitrag werden benutzerpartizipative Verfahren im Rahmen des Datenbankentwurfs für ein Informationssystem vorgestellt. Dabei wird aufgezeigt, wie Extreme Programming als zentraler Ansatz der agilen Software Entwicklung die synergetische Verflechtung des traditionell technologiebetriebenen Software Engineering (SE) mit benutzerzentrierten Verfahren des User-Centered Design (UCD) ermöglichen kann und welche Mehrwerte sich daraus ergeben. Da insbesondere die Kommunikation zwischen Systementwicklern und Experten im vorgestellten Projekt einen hohen Stellenwert einnahm, werden entsprechende Vorgehensweisen, aufgetretene Probleme sowie Lösungsansätze in der Anforderungsanalyse diskutiert. Der Einsatz von Interview- und Beobachtungstechniken wird dabei am Beispiel der Erfassung des Car Multimedia Anwendungsfeldes zum Zweck der Daten- und Systemmodellierung verdeutlicht.
Recently, there has been an increased interest in the exploitation of background knowledge in the context of text mining tasks, especially text classification. At the same time, kernel-based learning algorithms like Support Vector Machines have become a dominant paradigm in the text mining community. Amongst other reasons, this is also due to their capability to achieve more accurate learning results by replacing standard linear kernel (bag-of-words) with customized kernel functions which incorporate additional apriori knowledge. In this paper we propose a new approach to the design of ‘semantic smoothing kernels’ by means of an implicit superconcept expansion using well-known measures of term similarity. The experimental evaluation on two different datasets indicates that our approach consistently improves performance in situations where (i) training data is scarce or (ii) the bag-ofwords representation is too sparse to build stable models when using the linear kernel.
Im Patent Retrieval haben sich Rankingverfahren und Methoden wie Relevanz- Feedback noch nicht etabliert. An Ranking Systemen wird vor allem die mangelnde Transparenz für den Benutzer bemängelt. Das System PatentAide versucht, aufbauend auf einer Analyse der Rechercheprozesse im Patent Retrieval, ein Ranking-System zu implementieren. PatentAide unterstützt wichtige Techniken im Patent-Retrieval Prozess wie Term-Erweiterung, bietet ein geranktes Ergebnis und erlaubt darüber hinaus dynamisches Relevanz-Feedback.
Folgende Ausarbeitung begegnet dem Missstand, dass trotz der sich schnell entwickelnden Angebote von Suchmaschinen mit visueller Ergebnisrepräsentation noch kein Konsens gefunden wurde über eine gemeinsame Basis, auf deren Grundlage nachhaltige Evaluationen von Information Retrieval Systemen mit Visualisierungskomponente durchgeführt werden können. Diese Problematik wird anhand einer State-of-the-Art- Analyse aufgezeigt und es wird ein Lösungsvorschlag erarbeitet und exemplarisch getestet, der einen integrierten Ansatz zur Kombination geeigneter Evaluationsmethoden auf Grundlage eines morphologischen Rahmens empfiehlt.
Entwicklung eines dynamischen Entry Vocabulary Moduls für die Stiftung Wissenschaft und Politik
(2006)
Nicht übereinstimmendes Vokabular zwischen Anfrage und Dokumenten stellt ein Hauptproblem im Information Retrieval dar. Das Entry Vocabulary Modul hat sich in den letzten Jahren als Lösung hierfür etabliert. In diesem Beitrag wird ein dynamisches Entry Vocabulary Modul vorgestellt, das für einen Datenbestand mit mehreren inhaltsbezogenen Feldern in einem mehrstufigen Verfahren abhängig von Zwischenergebnissen die Anfrage erweitert. Das entwickelte System wurde anhand eines mehrsprachigen Datenbestands von rund 600.000 Fachtexten evaluiert und führte zu positiven Ergebnissen.
Employing lexical-semantic knowledge in information retrieval (IR) is recognised as a promising way to go beyond bag-of-words approaches to IR. However, it has not yet become a standard component of IR systems due to many difficulties which arise when knowledge-based methods are applied in IR. In this paper, we explore the use of semantic relatedness in IR computed on the basis of GermaNet, a German wordnet [Kunze, 2004]. In particular, we present several experiments on the German IR benchmarks GIRT’2005 (training set) and GIRT’2004 (test set) aimed at investigating the potential of semantic relatedness in IR as opposed to bag-of-words models, as implemented e.g. in Lucene [Gospodnetic and Hatcher, 2005]. These experiments shed some light upon how to combine the strengths of both models in our future work. Our evaluation results show some improvement in IR performance over the bag-of-words model, i.e. a significant increase in mean average precision of about 5 percent points for the training set, but only 1 percent increase for our test set.
This paper presents work in progress on an adaptive workflow management tool for digital design projects. The chip design follows standardized default processes which are adapted during an ongoing project by changing requirements from both design and application uncertainties. Our approach focuses on flexible monitoring and case-based authoring support of adaptive workflows in order to support the knowledge management in a real-world application.
In social bookmark tools users are setting up lightweight conceptual structures called folksonomies. Currently, the information retrieval support is limited. We present a formal model and a new search algorithm for folksonomies, called FolkRank, that exploits the structure of the folksonomy. The proposed algorithm is also applied to find communities within the folksonomy and is used to structure search results. All findings are demonstrated on a large scale dataset. A long version of this paper has been published at the European Semantic Web Conference 2006.
In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we define the class of so called tenuous outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for tenuous outerplanar graphs that works in incremental polynomial time, and evaluate the algorithm empirically on the NCI molecular graph dataset.
The exchange of personal experiences is a way of supporting decision making and interpersonal communication. In this article, we discuss how augmented personal memories could be exploited in order to support such a sharing. We start with a brief summary of a system implementing an augmented memory for a single user. Then, we exploit results from interviews to define an example scenario involving sharable memories. This scenario serves as background for a discussion of various questions related to sharing memories and potential approaches to their solution. We especially focus on the selection of relevant experiences and sharing partners, sharing methods, and the configuration of those sharing methods by means of reflection.
Der speziellen Behandlung geographischer Suchanfragen wird im Information Retrieval zunehmend mehr Beachtung geschenkt. So gibt der vorliegende Artikel einen Überblick über aktuelle Forschungsaktivitäten und zentrale Problemstellungen im Bereich des geographischen Information Retrieval, wobei speziell auf das Projekt GeoCLEF im Rahmen der crosslingualen Evaluierungsinitiative CLEF eingegangen wird. Die Informationswissenschaft der Universität Hildesheim hat in diesem Projekt sowohl organisatorische Aufgaben wahrgenommen als auch eigene Experimente durchgeführt. Dabei wurden die Aspekte der Verknüpfung von Gewichtungsansätzen mit Booleschem Retrieval sowie die Gewichtung von geographischen Eigennamen fokussiert. Anhand erster Interpretationen der Ergebnisse und Erfahrungen werden weiterer Forschungsbedarf und zukünftige, eigene Vorhaben wie die Überprüfung von Heuristiken zur Query-Expansion aufgezeigt.
Hashing-basierte Indizierung ist eine mächtige Technologie für die Ähnlichkeitssuche in großen Dokumentkollektionen [Stein 2005]. Sie basiert auf der Idee, Hashkollisionen als Ähnlichkeitsindikator aufzufassen – vorausgesetzt, dass eine entsprechend konstruierte Hashfunktion vorliegt. In diesem Papier wird erörtert, unter welchen Voraussetzungen grundlegende Retrieval- Aufgaben von dieser neuen Technologie profitieren können. Weiterhin werden zwei aktuelle, hashing-basierte Indizierungsansätze präsentiert und die mit ihnen erzielbaren Verbesserungen bei der Lösung realer Retrieval-Aufgaben verglichen. Eine Analyse dieser Art ist neu; sie zeigt das enorme Potenzial maßgeschneiderter hashing-basierter Indizierungsmethoden wie zum Beispiel dem Fuzzy- Fingerprinting.
Personalization involves the process of gathering user-specific information during interaction with the user, which is then used to deliver appropriate results to the user’s needs. This paper presents a statistical method that learns the user interests by collecting evidence from his search history. The method focuses on the use of both user relevance point of view on familiar words in order to infer and express his interests and the use of a correlation metric measure in order to update them.
In this paper, the idea of ubiquitous information retrieval is presented in a storytelling manner. Starting from a rough review of information retrieval system usage, some empirical hints on IR in everyday life are given. Ch. 4 explores the heterogeneity of interaction with IRS for one day in the life of a (common search engine) user. In ch. 5 summarizes these observations and suggests research approaches for modelling information retrieval as an essential component of interaction in the information society.
Cross Language Information Retrieval (CLIR) enables people to search information written in different languages from their query languages. Information can be retrieved either from a single cross lingual collection or from a variety of distributed cross lingual sources. This paper presents initial results exploring the effectiveness of distributed CLIR using query-based sampling techniques, which to the best of our knowledge has not been investigated before. In distributed retrieval with multiple databases, query-based sampling provides a simple and effective way for acquiring accurate resource descriptions which helps to select which databases to search. Observations from our initial experiments show that the negative impact of query-based sampling on cross language search may not be as great as it is on monolingual retrieval.
Die vorliegende Arbeit untersucht Möglichkeiten zur Integration und Modellierung von verteilten Geschäftsprozessen unter Verwendung von Web Services. Neben den Grundlagen werden weiterführende Konzepte im Bereich Web Services und Service-oriented Architectures thematisiert, wie z.B. Sicherheitsaspekte, Transaktionen und die Orchestrierung und Choreographie von Services. Als Anwendungskontext dient die Prozessportal-Lösung der Firma abaXX Technology AG (Stuttgart), innerhalb welcher Web Services durch eine im Rahmen dieser Arbeit erstellte, prototypische Software (Web Service ImportWizard) nutzbar gemacht werden.
In automatisierten Produktionsanlagen werden mehr und mehr Sensorsysteme eingesetzt, um die produzierte Qualität zu überwachen und auf Basis gesammelter Prozessdaten sicherzustellen. Die Heterogenität der an unterschiedlichen Stellen im Prozess integrierten Sensoren erfordert einen Ansatz zur einfachen Integration. Ziel der Integration ist die für verschiedene Rollen aufbereitete Qualitätssicht, die auch ein Feedback zur Fehlerdeduktion beinhaltet. In diesem Erfahrungsbericht wird der im Projekt BridgeIT entwickelte Ansatz zur syntaktischen und semantischen Integration von Qualitätsdaten vorgestellt. Der Ansatz ermöglicht insbesondere eine einfache Anbindung neuer Sensorsysteme.
The usage of Wikis for the purpose of knowledge management within a business company is only of value if the stored information can be found easily. The fundamental characteristic of a Wiki, its easy and informal usage, results in large amounts of steadily changing, unstructured documents. The widely used full-text search often provides search results of insufficient accuracy. In this paper, we will present an approach likely to improve search quality, through the use of Semantic Web, Text Mining, and Case Based Reasoning (CBR) technologies. Search results are more precise and complete because, in contrast to full-text search, the proposed knowledge-based search operates on the semantic layer.
Die Workshop-Woche "Lernen, Wissen und Adaptivität 2006" (LWA 06) versteht sich als Forum, bei dem etablierte und neu auf einem Gebiet arbeitende Wissenschaftlerinnen und Wissenschaftler ihre aktuellen Arbeiten vorstellen und intensiv miteinander diskutieren können. Dies macht den besonderen Reiz dieser Veranstaltung aus, welche erstmals 1999 in Magdeburg stattfand. Der diesjährige Austragungsort war die Universität Hildesheim und wie in den Jahren zuvor wurden von verschiedenen Fachgruppen der Gesellschaft für Informatik e. V. (GI) eine Reihe interessanter Workshops organisiert. Dies waren im Einzelnen: FG-ABIS (Workshop der Fachgruppe "Adaptivität und Benutzermodellierung in interaktiven Softwaresystemen"); FG-IR (Workshop der Fachgruppe "Information Retrieval"); FG-KDML / AK-KD (Workshop der Fachgruppe "Knowledge Discovery, Data Mining und Maschinelles Lernen" sowie des Arbeitskreises "Knowledge Discovery"); FG-WM (Workshop der Fachgruppe "Wissensmanagement").
Due to the inherent characteristics of data streams, appropriate mining techniques heavily rely on window-based processing and/or (approximating) data summaries. Because resources such as memory and CPU time for maintaining such summaries are usually limited, the quality of the mining results is affected in different ways. Based on Frequent Itemset Mining and an according Change Detection as selected mining techniques, we discuss in this paper extensions of stream mining algorithms allowing to determine the output quality for changes in the available resources (mainly memory space). Furthermore, we give directions how to estimate resource consumptions based on user-specified quality requirements.
Dieser Beitrag beschreibt Retrievalexperimente mit einem umfangreichen multilingualen Korpus im Rahmen von WebCLEF 2006 an der Universität Hildesheim. Im Vordergrund stand die Nutzung von HTML Strukturelementen, der Einsatz von Blind Relevance Feedback und die Evaluierung des sprachunabhängigen Indexierungsansatzes.
Evaluation metrics for rule learning typically, in one way or another, trade off consistency and coverage. In this work, we investigate this tradeoff for three different families of rule learning heuristics, all of them featuring a parameter that implements this trade-off in different guises. These heuristics are the m-estimate, the F-measure, and the Klösgen measures. The main goals of this work are to extend our understanding of these heuristics by visualizing their behavior via isometrics in coverage space, and to determine optimal parameter settings for them. Interestingly, even though the heuristics use quite different ways for implementing this trade-off, their optimal settings realize quite similar evaluation functions. Our empirical results on a large number of datasets demonstrate that, even though we do not use any form of pruning, the quality of the rules learned with these settings outperforms standard rule learning heuristics and approaches the performance of Ripper, a state-of-the-art rule learning system that uses extensive pruning and optimization phases.
Class binarizations are effective methods that break multi-class problem down into several 2-class or binary problems to improve weak learners. This paper analyzes which effects these methods have if we choose a Naive Bayes learner for the base classifier. We consider the known unordered and pairwise class binarizations and propose an alternative approach for a pairwise calculation of a modified Naive Bayes classifier.
Pattern recognition of gene expression data on biochemical networks with simple wavelet transforms
(2006)
Biological networks show a rather complex, scale-free topology consisting of few highly connected (hubs) and many low connected (peripheric and concatenating) nodes. Furthermore, they contain regions of rather high connectivity, as in e.g. metabolic pathways. To analyse data for an entire network consisting of several thousands of nodes and vertices is not manageable. This inspired us to divide the network into functionally coherent sub-graphs and analysing the data that correspond to each of these sub-graphs individually. We separated the network in a two-fold way: 1. clustering approach: sub-graphs were defined by higher connected regions using a clustering procedure on the network; and 2. connected edge approach: paths of concatenated edges connecting striking combinations of the data were selected and taken as sub-graphs for further analysis. As experimental data we used gene expression data of the bacterium Escherichia coli which was exposed to two distinctive environments: oxygen rich and oxygen deprived. We mapped the data onto the corresponding biochemical network and extracted disciminating features using Haar wavelet transforms for both strategies. In comparison to standard methods, our approaches yielded a much more consistent image of the changed regulation in the cells. In general, our concept may be transferred to network analyses on any interaction data, when data for two comparable states of the associated nodes are made available.
The Personal Reader Framework enables the design, realization and maintenance of personalized Web Content Reader. In this architecture personalized access to web content is realized by various Web Services - we call them Personalization Services. With our new approach of Configurable Web Services we allow users to configure these Personalization Services. Such configurations can be stored and reused at a later time. The interface between Users and Configurable Web Services is realized in a Personal Reader Agent. This Agent allows selection, configuration and calling of the Web Services and further provides personalization functionalities like reuse of stored configurations which suit the users interests.
German Smart Sensor Web (GSSW) is an experimental system for the German Federal Armed Forces. Its purpose is to provide a secure integration infrastructure for networked sensors. GSSW has a middleware based on ontologies and software agent technology. It uses a semantic representation of sensor data and other information in the area of intelligence, surveillance and reconnaissance (ISR) to feed “smart” symbolic AI based assistance functions. Interface agents also use the knowledge representation to personalize different aspects of the user interface. In this contribution the current state of GSSW, its software architecture and the personalization features of the user interface layer are presented.
Predicting student performance (PSP) is an important task in Student Modeling where we would like to know whether the students solve the given problems (tasks) correctly, so that we can understand how the students learn, provide them early feedbacks, and help them getting better in studying. This thesis introduces several approaches, which mainly base on state-ofthe- art techniques in Recommender Systems (RS), for student modeling, especially for PSP. First, we formulate the PSP problem and show how to map this problem to rating prediction task in RS and to forecasting problem. Second, we propose using latent factor models, e.g., matrix factorization, for student modeling. These models could implicitly take into account the student and task latent factors (e.g., slip and guess) as well as student effect/bias and task effect/bias. Moreover, there is a fact that similar students may have similar performances, we suggest using k-nearest neighbors collaborative filtering to take into account the correlations between the students and the tasks. Third, in student's problem solving, each student performs several tasks, and each task requires one or many skills, while the students are also required to master the skills that they have learned. We propose to exploit such multiple relationships by using multi-relational matrix factorization approach. Fourth, as the student performance (student knowledge) cumulates and improves over time, a trend line could be observed in his/her performance. Similar to time series, for solving this problem, forecasting techniques would be reasonable choices. Furthermore, it is well-know that student (human) knowledge is diverse, thus, thought and performance of one student may differ from another one. To cope with these aspects, we propose personalized forecasting methods which use the past performances of individual student to forecast his/her own future performance. Fifth, since student knowledge changes over time, temporal/sequential information would be an important factor in PSP. We propose tensor factorization methods to model both the student/task latent factors and the sequential/temporal effects. Sixth, we open an issue for recommendation in e-learning, that is, recommending the tasks to the students. This approach can tackle existing issues in the literature since we can recommend the tasks to the students using their performance instead of their preference. Based on student performance, we can recommend suitable tasks to the students by filtering out the tasks that are too easy or too hard, or both, depending on the system goal. Furthermore, we propose using context-aware factorization approach to utilize multiple interactions between the students and the tasks. Seventh, we discover a characteristic in student performance data, namely class imbalance problem, i.e., the number of correct solutions are higher than the number of incorrect solutions, which may hinder classifiers' performance. To tackle this problem, we introduce several methods as well as introducing a new evaluation measure for learning from imbalanced data. Finally, we validate the proposed methods by many experiments. We compare them with other state-of-the-art methods and empirically show that, in most of the cases, the proposed methods can improve the prediction results. We therefore conclude that our approaches would be reasonable choices for student modeling, especially for predicting student performance. Last but not least, we raise some open issues for the future research in this area.
The existing personalization systems typically base their services on general user models that ignore the issue of context-awareness. This position paper focuses on developing mechanisms for cross-context reasoning of the user models, which can be applied for the context-aware personalization. The reasoning augments the sparse user models by inferring the missing information from other contextual conditions. Thus, it upgrades the existing personalization systems and facilitates provision of accurate context-aware services.
This paper presents Prospector a front-end to the Google search engine which, using individual and group models of users’ interests, re-ranks search results to better suit the user’s inferred needs. The paper outlines the motivation behind the development of the system, describes its adaptive components, and discusses the lessons learned thus far, as well as the work planned for the future.
A key argument for modeling knowledge in ontologies is the easy re-use and re-engineering of the knowledge. However, current ontology engineering tools provide only basic functionalities for analyzing ontologies. Since ontologies can be considered as graphs, graph analysis techniques are a suitable answer for this need. Graph analysis has been performed by sociologists for over 60 years, and resulted in the vivid research area of Social Network Analysis (SNA). While social network structures currently receive high attention in the Semantic Web community, there are only very few SNA applications, and virtually none for analyzing the structure of ontologies. We illustrate the benefits of applying SNA to ontologies and the Semantic Web, and discuss which research topics arise on the edge between the two areas. In particular, we discuss how different notions of centrality describe the core content and structure of an ontology. From the rather simple notion of degree centrality over betweenness centrality to the more complex eigenvector centrality, we illustrate the insights these measures provide on two ontologies, which are different in purpose, scope, and size.
Dieses Papier gibt eine Einführung in TIRA, einer Software-Architektur für die Erstellung maßgeschneiderter Information-Retrieval-Werkzeuge. TIRA ermöglicht Anwendern, den Verarbeitungsprozess eines gewünschten IR-Werkzeugs interaktiv als Graph zu spezifizieren: die Knoten des Graphen bezeichnen so genannte "IRBasisdienste", Kanten modellieren Kontroll- und Datenflüsse. TIRA bietet die Funktionalität eines Laufzeit-Containers, um die spezifizierten Verarbeitungsprozesse in einer verteilten Umgebung auszuführen. Motivation für unsere Forschung ist u. a. die Herausforderung der Personalisierung: Es gibt eine Diskrepanz zwischen der IR-Theorie und ihren Algorithmen und der – an persönlichen Wünschen angepassten – Implementierung, Verteilung und Ausführung entsprechender Programme. Diese Kluft kann mit adäquater Softwaretechnik verkleinert werden.
In this work we propose a novel, generalized framework for feature space transformation in unsupervised knowledge discovery settings. Unsupervised feature space transformation inherently is a multi-objective optimization problem. In order to facilitate data exploration, transformations should increase the quality of the result and should still preserve as much of the original data set information as possible. We exemplify this relationship on the problem of data clustering. First, we show that existing approaches to multi-objective unsupervised feature selection do not pose the optimization problem in an appropriate way. Furthermore, using feature selection only is often not sufficient for real-world knowledge discovery tasks. We propose a new, generalized framework based on the idea of information preservation. This framework enables feature selection as well as feature construction for unsupervised learning. We compare our method against existing approaches on several real world data sets.
Recommender systems are personalized information systems that learn individual preferences from interacting with users. Recommender systems use machine learning techniques to compute suggestions for the users. Supervised machine learning relies on optimizing for a suitable objective function. Suitability means here that the function actually reflects what users and operators consider to be a good system performance. Most of the academic literature on recommendation is about rating prediction. For two reasons, this is not the most practically relevant prediction task in the area of recommender systems: First, the important question is not how much a user will express to like a given item (by the rating), but rather which items a user will like. Second, obtaining explicit preference information like ratings requires additional actions from the side of the user, which always comes at a cost. Implicit feedback in the form of purchases, viewing times, clicks, etc., on the other hand, is abundant anyway. Very often, this implicit feedback is only present in the form of positive expressions of preference. In this work, we primarily consider item recommendation from positive-only feedback. A particular problem is the suggestion of new items -- items that have no interaction data associated with them yet. This is an example of a cold-start scenario in recommender systems. Collaborative models like matrix factorization rely on interaction data to make predictions. We augment a matrix factorization model for item recommendation with a mechanism to estimate the latent factors of new items from their attributes (e.g. descriptive keywords). In particular, we demonstrate that optimizing the latent factor estimation with regard to the overall loss of the item recommendation task is superior to optimizing it with regard to the prediction error on the latent factors. The idea of estimating latent factors from attributes can be extended to other tasks (new users, rating prediction) and prediction models, yielding a general framework to deal with cold-start scenarios. We also adapt the Bayesian Personalized Ranking (BPR) framework, which is state of the art in item recommendation, to a setting where more popular items are more frequently encountered when making predictions. By generalizing even more, we get Weighted Bayesian Personalized Ranking, an extension of BPR that allows importance weights to be placed on specific users and items. All method contributions are supported by experiments using large-scale real-life datasets from various application areas like movie recommendation and music recommendation. The software used for the experiments has been released as part of an efficient and scalable free software package.
This paper reports on experiments that attempt to characterize the relationship between users and their knowledge of the search topic in a Question Answering (QA) system. It also investigates user search behavior with respect to the length of answers presented by a QA system. Two lengths of answers were compared; snippets (one to two sentences of text) and exact answers. A user test was conducted, 92 factoid questions were judged by 44 participants, to explore the participants’ preferences, feelings and opinions about QA system tasks. The conclusions drawn from the results were that participants preferred and obtained higher accuracy in finding answers from the snippets set. However, accuracy varied according to users’ topic familiarity; users were only substantially helped by the wider context of a snippet if they were already familiar with the topic of the question, without such familiarity, users were about as accurate at locating answers from the snippets as they were in exact set.
We propose the implementation of an intelligent information system on free and open source software. This system will consist of a case-based reasoning (CBR) system and several machine learning modules to maintain the knowledge base and train the CBR system thus enhancing its performance. Our knowledge base will include data on free and open source software provided by the Debian project, the FLOSSmole project, and other public free and open source software directories. We plan to enrich these data by learning additional information such as concepts and different similarities. With this knowledge base, we hope to be able to create an information system that will be capable of answering queries based on precise as well as vague criteria and give intelligent recommendations on software based on the preferences of the user.
Information on the internet is a vast resource for question answering. As the amount of available information from web pages increases, novel methods for finding precise answers to user queries and questions must be found. Standard information retrieval methods are efficient, but often fail to provide a user with short, precise answers. A deep linguistic analysis of all information is time consuming, but it offers more advanced means to find answers to a user’s question. Shallow natural language processing methods seem to work well on a limited range of questions, but they are not suitable for finding answers to more complex questions. This paper describes work in progress on the question answering system IRSAW (Intelligent Information Retrieval on the Basis of a Semantically Annotated Web), a system that combines information retrieval with a deep linguistic analysis of texts to obtain answers to natural language questions. In IRSAW, different techniques for finding answers lead to different sets of answer candidates, which are then merged to produce a final answer. The system’s architecture and functionality are described before evaluation results of a first prototype are presented.
In the context of genome research, the method of gene expression analysis has been used for several years. Related microarray experiments are conducted all over the world, and consequently, a vast amount of microarray data sets are produced. Having access to this variety of repositories, researchers would like to incorporate this data in their analyses to increase the statistical significance of their results. In this paper, we present a new two-phase clustering strategy which is based on the combination of local clustering results to obtain a global clustering. The advantage of such a technique is that each microarray data set can be normalized and clustered separately. The set of different relevant local clustering results is then used to calculate the global clustering result. Furthermore, we present an approach based on technical as well as biological quality measures to determine weighting factors for quantifying the local results proportion within the global result. The better the attested quality of the local results, the stronger their impact on the global result.
Many ubiquitous computing applications so far fail to live up to their expectations. While working perfectly in controllable laboratory environments, they seemto be particularly prone to problems related to a discrepancy between user expectation and systems behavior when released into the wild. This kind of unwanted behavior of course prevents the vision of an emerging trend of context aware and adaptive applications inmobile and ubiquitous computing to become reality. In this paper, we present examples from our practical work and show why for ubiquitous computing unwanted behavior is not just a matter of enough requirements engineering and good or bad technical system verification. We furthermore provide a classification of the phenomenon and an analysis of the causes of its occurrence and resolvability in context aware and adaptive systems.
User Centric Hierarchical Classification and Associated Evaluation Measures for Document Retrieval
(2006)
In a Web Service-based Semantic Web long term usage of single Services will become unlikely. Therefore, user modeling on Web Service’s site might be imprecise due to a lack of a sufficient amount of user interaction. In our Personal Reader Framework, the user profile is stored centrally and can be used by different Web Services. By combining information about the user from different Web Services, the coverage and precision of such centralized user profile increases. To preserve user’s privacy, access to the user profile is restricted by policies.
This paper presents results from an initial user study exploring the relationship between system effectiveness as quantified by traditional measures such as precision and recall, and users’ effectiveness and satisfaction of the results. The tasks involve finding images for recall-based tasks. It was concluded that no direct relationship between system effectiveness and users’ performance could be proven (as shown by previous research). People learn to adapt to a system regardless of its effectiveness. This study recommends that a combination of attributes (e.g. system effectiveness, user performance and satisfaction) is a more effective way to evaluate interactive retrieval systems. Results of this study also reveal that users are more concerned with accuracy than coverage of the search results.
Authors of menu optimization methods often use navigation time prediction models without validating whether the model is adequate for the site and its users. We review the assumptions underlying navigation time prediction models and present a method to validate these assumptions offline. Experiments on four web sites show how accurate the various model features describe the behavior of the users. These results can be used to select the best model for a new optimization task. In addition, we find that the existing optimization methods all use suboptimal models. This indicates that our results can significantly contribute to more effective menu optimization.
In diesem Beitrag wird ein Ansatz vorgestellt, der basierend auf Techniken der visuellen Daten- Exploration und semantikbasierten Fusion eine Nutzung von Analysemethoden wie Data- Mining- und Visualisierungstechniken zur Wissensgenerierung in verteilten, kooperativen Umgebungen erlaubt. Unter Einsatz von Ontologien zur semantischen Beschreibung verteilter Quellen wird es ermöglicht, die Daten und Analysemethoden aus diesen Quellen zu fusionieren. Kern der Architektur ist die Gatewaykomponente, die es dem Analysten erlaubt, Daten und Analysemethoden in einer verteilten Umgebung zu nutzen. Im Rahmen eines medizinischen Anwendungsszenarios wurden die vorgestellten Komponenten evaluiert.
This paper presents a search system for information on scientists which was implemented prototypically for the area of information science, employing Web Content Mining techniques. The sources that are used in the implemented approach are online publication services and personal homepages of scientists. The system contains wrappers for querying the publication services and information extraction from their result pages, as well as methods for information extraction from homepages, which are based on heuristics concerning structure and composition of the pages. Moreover a specialised search technique for searching for personal homepages of information scientists was developed.
The World Wide Web is an important medium for communication, data transaction and retrieving. Data mining is the process of extracting interesting patterns from a set of data sources. Web mining is the application of data mining techniques to extract useful patterns from web data. Web Mining can be divided into three categories, web usage mining, web content mining, and web structure mining. Web usage mining or web log mining is the extraction of interesting patterns from web log server entries. Those patterns are used to study user behavior and interests, facilitate support and services introduced to the website user, improve the structure of the website, and facilitate personalization and adaptive websites. This paper aims to explore various research issues in web usage mining and its application in the field of adaptive, and personalized websites.