Refine
Document Type
- Conference Proceeding (58)
- Doctoral Thesis (2)
- Master's Thesis (1)
Language
- English (45)
- German (15)
- Multiple languages (1)
Keywords
- clustering (3)
- data mining (3)
- k-means (3)
- law-enforcement (3)
- semi-supervised learning (3)
- - (1)
- Artificial Intelligence (1)
- Electronic Commerce (1)
- Information Retrieval (1)
- Künstliche Intelligenz (1)
Institute
- Informatik (61) (remove)
Recommender systems are personalized information systems that learn individual preferences from interacting with users. Recommender systems use machine learning techniques to compute suggestions for the users. Supervised machine learning relies on optimizing for a suitable objective function. Suitability means here that the function actually reflects what users and operators consider to be a good system performance. Most of the academic literature on recommendation is about rating prediction. For two reasons, this is not the most practically relevant prediction task in the area of recommender systems: First, the important question is not how much a user will express to like a given item (by the rating), but rather which items a user will like. Second, obtaining explicit preference information like ratings requires additional actions from the side of the user, which always comes at a cost. Implicit feedback in the form of purchases, viewing times, clicks, etc., on the other hand, is abundant anyway. Very often, this implicit feedback is only present in the form of positive expressions of preference. In this work, we primarily consider item recommendation from positive-only feedback. A particular problem is the suggestion of new items -- items that have no interaction data associated with them yet. This is an example of a cold-start scenario in recommender systems. Collaborative models like matrix factorization rely on interaction data to make predictions. We augment a matrix factorization model for item recommendation with a mechanism to estimate the latent factors of new items from their attributes (e.g. descriptive keywords). In particular, we demonstrate that optimizing the latent factor estimation with regard to the overall loss of the item recommendation task is superior to optimizing it with regard to the prediction error on the latent factors. The idea of estimating latent factors from attributes can be extended to other tasks (new users, rating prediction) and prediction models, yielding a general framework to deal with cold-start scenarios. We also adapt the Bayesian Personalized Ranking (BPR) framework, which is state of the art in item recommendation, to a setting where more popular items are more frequently encountered when making predictions. By generalizing even more, we get Weighted Bayesian Personalized Ranking, an extension of BPR that allows importance weights to be placed on specific users and items. All method contributions are supported by experiments using large-scale real-life datasets from various application areas like movie recommendation and music recommendation. The software used for the experiments has been released as part of an efficient and scalable free software package.
In this paper, the idea of ubiquitous information retrieval is presented in a storytelling manner. Starting from a rough review of information retrieval system usage, some empirical hints on IR in everyday life are given. Ch. 4 explores the heterogeneity of interaction with IRS for one day in the life of a (common search engine) user. In ch. 5 summarizes these observations and suggests research approaches for modelling information retrieval as an essential component of interaction in the information society.
Im vorliegenden Beitrag werden benutzerpartizipative Verfahren im Rahmen des Datenbankentwurfs für ein Informationssystem vorgestellt. Dabei wird aufgezeigt, wie Extreme Programming als zentraler Ansatz der agilen Software Entwicklung die synergetische Verflechtung des traditionell technologiebetriebenen Software Engineering (SE) mit benutzerzentrierten Verfahren des User-Centered Design (UCD) ermöglichen kann und welche Mehrwerte sich daraus ergeben. Da insbesondere die Kommunikation zwischen Systementwicklern und Experten im vorgestellten Projekt einen hohen Stellenwert einnahm, werden entsprechende Vorgehensweisen, aufgetretene Probleme sowie Lösungsansätze in der Anforderungsanalyse diskutiert. Der Einsatz von Interview- und Beobachtungstechniken wird dabei am Beispiel der Erfassung des Car Multimedia Anwendungsfeldes zum Zweck der Daten- und Systemmodellierung verdeutlicht.
The learners’ motivation has an impact on the quality of learning, especially in e-Learning environments. Most of these environments store data about the learner’s actions in log files. Logging the users’ interactions in educational systems gives the possibility to track their actions at a refined level of detail. Data mining and machine learning techniques can “give meaning” to these data and provide valuable information for learning improvement. An area where improvement is absolutely necessary and of great importance is motivation, known to be an essential factor for preventing attrition in e-Learning. In this paper we investigate if the log files data analysis can be used to estimate the motivational level of the learner. A decision tree is build from a limited number of log files from a web-based learning environment. The results suggest that time spent reading is an important factor for predicting motivation; also, performance in tests was found to be a relevant indicator of the motivational level.
Personalization involves the process of gathering user-specific information during interaction with the user, which is then used to deliver appropriate results to the user’s needs. This paper presents a statistical method that learns the user interests by collecting evidence from his search history. The method focuses on the use of both user relevance point of view on familiar words in order to infer and express his interests and the use of a correlation metric measure in order to update them.
Class binarizations are effective methods that break multi-class problem down into several 2- class or binary problems to improve weak learners. This paper analyzes which effects these methods have if we choose a Naive Bayes learner for the base classifier. We consider the known unordered and pairwise class binarizations and propose an alternative approach for a pairwise calculation of a modified Naive Bayes classifier.
Hashing-basierte Indizierung ist eine mächtige Technologie für die Ähnlichkeitssuche in großen Dokumentkollektionen [Stein 2005]. Sie basiert auf der Idee, Hashkollisionen als Ähnlichkeitsindikator aufzufassen – vorausgesetzt, dass eine entsprechend konstruierte Hashfunktion vorliegt. In diesem Papier wird erörtert, unter welchen Voraussetzungen grundlegende Retrieval- Aufgaben von dieser neuen Technologie profitieren können. Weiterhin werden zwei aktuelle, hashing-basierte Indizierungsansätze präsentiert und die mit ihnen erzielbaren Verbesserungen bei der Lösung realer Retrieval-Aufgaben verglichen. Eine Analyse dieser Art ist neu; sie zeigt das enorme Potenzial maßgeschneiderter hashing-basierter Indizierungsmethoden wie zum Beispiel dem Fuzzy- Fingerprinting.
Authors of menu optimization methods often use navigation time prediction models without validating whether the model is adequate for the site and its users. We review the assumptions underlying navigation time prediction models and present a method to validate these assumptions offline. Experiments on four web sites show how accurate the various model features describe the behavior of the users. These results can be used to select the best model for a new optimization task. In addition, we find that the existing optimization methods all use suboptimal models. This indicates that our results can significantly contribute to more effective menu optimization.
Cross Language Information Retrieval (CLIR) enables people to search information written in different languages from their query languages. Information can be retrieved either from a single cross lingual collection or from a variety of distributed cross lingual sources. This paper presents initial results exploring the effectiveness of distributed CLIR using query-based sampling techniques, which to the best of our knowledge has not been investigated before. In distributed retrieval with multiple databases, query-based sampling provides a simple and effective way for acquiring accurate resource descriptions which helps to select which databases to search. Observations from our initial experiments show that the negative impact of query-based sampling on cross language search may not be as great as it is on monolingual retrieval.
This paper presents Prospector a front-end to the Google search engine which, using individual and group models of users’ interests, re-ranks search results to better suit the user’s inferred needs. The paper outlines the motivation behind the development of the system, describes its adaptive components, and discusses the lessons learned thus far, as well as the work planned for the future.
Pattern recognition of gene expression data on biochemical networks with simple wavelet transforms
(2006)
Biological networks show a rather complex, scale-free topology consisting of few highly connected (hubs) and many low connected (peripheric and concatenating) nodes. Furthermore, they contain regions of rather high connectivity, as in e.g. metabolic pathways. To analyse data for an entire network consisting of several thousands of nodes and vertices is not manageable. This inspired us to divide the network into functionally coherent sub-graphs and analysing the data that correspond to each of these sub-graphs individually. We separated the network in a two-fold way: 1. clustering approach: sub-graphs were defined by higher connected regions using a clustering procedure on the network; and 2. connected edge approach: paths of concatenated edges connecting striking combinations of the data were selected and taken as sub-graphs for further analysis. As experimental data we used gene expression data of the bacterium Escherichia coli which was exposed to two distinctive environments: oxygen rich and oxygen deprived. We mapped the data onto the corresponding biochemical network and extracted disciminating features using Haar wavelet transforms for both strategies. In comparison to standard methods, our approaches yielded a much more consistent image of the changed regulation in the cells. In general, our concept may be transferred to network analyses on any interaction data, when data for two comparable states of the associated nodes are made available.
Recently, research projects such as PADLR and SWAP have developed tools like Edutella or Bibster, which are targeted at establishing peer-topeer knowledge management systems. In such a system, it is necessary to obtain brief semantic descriptions of peers, so that routing algorithms or matchmaking processes can make decisions about which communities peers should belong to, or to which peers a given query should be forwarded. This paper provides a graph clustering technique on knowledge bases for that purpose. Using this clustering, we can show that our strategy requires up to 58% fewer queries than the baselines to yield full recall in a bibliographic peer-to-peer scenario.
In diesem Aufsatz soll die geplante Implementierung von Suchmaschinentechnologien im Fachportal Pädagogik zum Anlass genommen werden, um sich mit den damit verbundenen neuen Anforderungen an ein Qualitätsmanagement auseinanderzusetzen. Im Zentrum stehen die Fragen, welche Zusammenhänge die Recherche- Situationen formen und welche Schlussfolgerungen sich daraus für ein Evaluationsdesign ergeben. Als analytisches Instrumentarium soll dabei eine soziotechnische Sichtweise auf das Information- Retrieval-System (IR) dienen.
We preset a network model for context-based retrieval allowing for integrating domain knowledge into document retrieval. Based on the premise that the results provided by a network model employing spreading activation are equivalent to the results of a vector space model, we create a network representation of a document collection for retrieval. We extended this well explored approach by blending it with techniques from knowledge representation. This leaves us with a network model for finding similarities in a document collection by content-based as well as knowledge-based similarities.
This paper presents a search system for information on scientists which was implemented prototypically for the area of information science, employing Web Content Mining techniques. The sources that are used in the implemented approach are online publication services and personal homepages of scientists. The system contains wrappers for querying the publication services and information extraction from their result pages, as well as methods for information extraction from homepages, which are based on heuristics concerning structure and composition of the pages. Moreover a specialised search technique for searching for personal homepages of information scientists was developed.
We propose the implementation of an intelligent information system on free and open source software. This system will consist of a case-based reasoning (CBR) system and several machine learning modules to maintain the knowledge base and train the CBR system thus enhancing its performance. Our knowledge base will include data on free and open source software provided by the Debian project, the FLOSSmole project, and other public free and open source software directories. We plan to enrich these data by learning additional information such as concepts and different similarities. With this knowledge base, we hope to be able to create an information system that will be capable of answering queries based on precise as well as vague criteria and give intelligent recommendations on software based on the preferences of the user.
The World Wide Web is an important medium for communication, data transaction and retrieving. Data mining is the process of extracting interesting patterns from a set of data sources. Web mining is the application of data mining techniques to extract useful patterns from web data. Web Mining can be divided into three categories, web usage mining, web content mining, and web structure mining. Web usage mining or web log mining is the extraction of interesting patterns from web log server entries. Those patterns are used to study user behavior and interests, facilitate support and services introduced to the website user, improve the structure of the website, and facilitate personalization and adaptive websites. This paper aims to explore various research issues in web usage mining and its application in the field of adaptive, and personalized websites.
In automatisierten Produktionsanlagen werden mehr und mehr Sensorsysteme eingesetzt, um die produzierte Qualität zu überwachen und auf Basis gesammelter Prozessdaten sicherzustellen. Die Heterogenität der an unterschiedlichen Stellen im Prozess integrierten Sensoren erfordert einen Ansatz zur einfachen Integration. Ziel der Integration ist die für verschiedene Rollen aufbereitete Qualitätssicht, die auch ein Feedback zur Fehlerdeduktion beinhaltet. In diesem Erfahrungsbericht wird der im Projekt BridgeIT entwickelte Ansatz zur syntaktischen und semantischen Integration von Qualitätsdaten vorgestellt. Der Ansatz ermöglicht insbesondere eine einfache Anbindung neuer Sensorsysteme.
Can crimes be modeled as data mining problems? We will try to answer this question in this paper. Crimes are a social nuisance and cost our society dearly in several ways. Any research that can help in solving crimes faster will pay for itself. Here we look at use of clustering algorithm for a data mining approach to help detect the crimes patterns and speed up the process of solving crime. We will look at k-means clustering with some enhancements to aid in the process of identification of crime patterns. We will apply these techniques to real crime data from a sheriff’s office and validate our results. We also use semi-supervised learning technique here for knowledge discovery from the crime records and to help increase the predictive accuracy. We also developed a weighting scheme for attributes here to deal with limitations of various out of the box clustering tools and techniques. This easy to implement machine learning framework works with the geo-spatial plot of crime and helps to improve the productivity of the detectives and other law enforcement officers. It can also be applied for counter terrorism for homeland security.