Refine
Document Type
- Doctoral Thesis (7)
Language
- English (7)
Has Fulltext
- yes (7)
Is part of the Bibliography
- no (7)
Keywords
Institute
- Fachbereich IV (5)
- Informatik (2)
Time series represent the most widely spread type of data, occurring in a myriad of application domains, ranging from physiological sensors up to astronomical light intensities. The classification of time-series is one of the most prominent challenges, which utilizes a recorded set of expert-labeled time-series, in order to automatically predict the label of future series without the need of an expert.The patterns of time-series are often shifted in time, have different scales, contain arbitrarily repeating patterns and exhibit local distortions/noise. In other cases, the differences among classes are attributed to small local segments, rather than the global structure. For those reasons, values corresponding to a particular time-stamp have different semantics on different time-series. We call this phenomena as intra-class variations. The lion's share of this thesis is composed of presenting new methods that can accurately classify time-series instances, by handling variations.
The answer towards resolving the bottlenecks of intra-class variations relies on not using the time-series values as direct features. Instead, the approach of this thesis is to extract a set of features that, on one hand, represent all the variations of the data and, on the other hand, can boost classification accuracy. In other words, this thesis proposes a list of methods that addresses diverse aspects of intra-class variations.
The first proposed approach is to generate new training instances, by transforming the support vectors of an SVM. The second approach decomposes time-series through a segment-wise convolutional factorization. The strategy involves learning a set of patterns and weights, whose product can approximate each sub-sequence of the time series. However, the main contribution of the thesis is the third approach, called shapelet learning, which utilizes the training labels during the learning process, i.e. the process is supervised. Since the features are learned on the training labels, there is a higher tendency of performing strongly in terms of predicting the testing labels. In addition, we present a fast alternative method for shapelet discovery. Our strategy is to prune segment candidates using a two step approach. First of all, we prune candidates based on their similarity towards previously considered candidates. Secondly, non-similar (hence diverse) candidates are selected only if the features they produce improve the classification results. The last two chapters of the thesis describes two methods that extract features from datasets having special characteristics. More concretely, we propose a classification method suited for series having missing values, as well as a method that extract features from time series having repetitive patterns.
In this thesis we design and test Learning Analytics algorithms for personalized tasks' sequencing that suggests the next task to a student according to his/her specific needs. Our solution is based on a sequencing policy derived from the Vygotsky's Zone of Proximal Development (ZPD), which denes those tasks that are neither too easy not too dicult for the student. The sequencer, called Vygotsky Policy Sequencer (VPS), can identify tasks in the ZPD thanks to the information it receives from performance prediction algorithms able to estimate
the knowledge of the student.
Under this context we describe hereafter the thesis contributions.
(1) A feasibility evaluation of domain independent Matrix Factorization applied in ITS for Performance Prediction.
(2) An adaption and the related evaluation of a domain independent update for online learning Matrix Factorization in ITS.
(3) A novel Matrix Factorization update method based on Kalman Filters approach. Two different updating functions are used: (a) a simple one considering the task just seen, and (b) one able to derive the skills' deficiency of the student.
(4) A new method for offline testing of machine learning controlled sequencers by modeling simulated environment composed by a simulated students and tasks with continuous knowledge and score
representation and different diffculty levels.
(5) The design of a minimal invasive API for the lightweight integration of machine learning components in larger systems to minimize the risk of integration and the cost of expertise transfer.
Profiting from all these contributions, the VPS was integrated in a commercial system and evaluated with 100 children over a month.
The VPS showed comparable learning gains and perceived experience results with those of the ITS sequencer. Finally, thanks to its better modeling abilities, the students finish faster the assigned tasks.
Finding an available parking spot in city centers can be a cumbersome task for individual drivers and also negatively affects general traffic flow and CO2 emissions.
In the context of smart cities and the internet of things this problem can be mitigated by using available data to monitor and predict parking occupancy in order to guide users to an available parking location near their destination.
With this goal in mind there arise multiple challenges of which we introduce selected ones to propose novel solutions based on machine learning.
The focus of this work is to enable the usage of readily available and inexpensive data sources like parking meter transactions, opposed to expensive technology like in-ground sensors or cameras where the costs prevent a widespread coverage. Our proposed data sources do not directly monitor the actual parking availability but still provide enough signal for our algorithms to infer the real parking situation with high accuracy.
As part of this work we developed a parking availability prediction system based on parking meter transactions that was deployed to 33 german cities.
A main contribution of our work is the proposal of a novel way to generate labels based on the parking transactions and to use semi-supervised-, more specifically positive-unlabeled learning, to leverage the sparse signal in order to require as little data as possible.
Additionally, we utilize and design novel methodologies in the area of transfer learning to learn simultaneously from different cities which leads to the previously seldom explored setting of combining transfer learning with positive-unlabeled learning. We therefore introduce a novel algorithm to tackle this problem type.
We hope that our work enables the deployment of smart parking systems at lower costs and therefore leads towards the goal of smart parking guidance in smart cities.
Recent decades have seen exponential growth in data acquisition attributed to advancements in edge device technology. Factory controllers, smart home appliances, mobile devices, medical equipment, and automotive sensors are a few examples of edge devices capable of collecting data. Traditionally, these devices are limited to data collection and transfer functionalities, whereas decision-making capabilities were missing. However, with the advancement in microcontroller and processor technologies, edge devices can perform complex tasks. As a result, it provides avenues for pushing training machine learning models to the edge devices, also known as learning-at-the-edge. Furthermore, these devices operate in a distributed environment that is constrained by high latency, slow connectivity, privacy, and sometimes time-critical applications. The traditional distributed machine learning methods are designed to operate in a centralized manner, assuming data is stored on cloud storage. The operating environment of edge devices is impractical for transferring data to cloud storage, rendering centralized approaches impractical for training machine learning models on edge devices.
Decentralized Machine Learning techniques are designed to enable learning-at-the-edge without requiring data to leave the edge device. The main principle in decentralized learning is to build consensus on a global model among distributed devices while keeping the communication requirements as low as possible. The consensus-building process requires averaging local models to reach a global model agreed upon by all workers. The exact averaging schemes are efficient in quickly reaching global consensus but are communication inefficient. Decentralized approaches employ in-exact averaging schemes that generally reduce communication by communicating in the immediate neighborhood. However, in-exact averaging introduces variance in each worker's local values, requiring extra iterations to reach a global solution.
This thesis addresses the problem of learning-at-the-edge devices, which is generally referred to as decentralized machine learning or Edge Machine Learning. More specifically, we will focus on the Decentralized Parallel Stochastic Gradient Descent (DPSGD) learning algorithm, which can be formulated as a consensus-building process among distributed workers or fast linear iteration for decentralized model averaging. The consensus-building process in decentralized learning depends on the efficacy of in-exact averaging schemes, which have two main factors, i.e., convergence time and communication. Therefore, a good solution should keep communication as low as possible without sacrificing convergence time. An in-exact averaging solution consists of a connectivity structure (topology) between workers and weightage for each link. We formulate an optimization problem with the objective of finding an in-exact averaging solution that can achieve fast consensus (convergence time) among distributed workers keeping the communication cost low. Since direct optimization of the objective function is infeasible, a local search algorithm guided by the objective function is proposed. Extensive empirical evaluations on image classification tasks show that the in-exact averaging solutions constructed through the proposed method outperform state-of-the-art solutions.
Next, we investigate the problem of learning in a decentralized network of edge devices, where a subset of devices are close to each other in that subset but further apart from other devices not in the subset. Closeness specifically refers to geographical proximity or fast communication links.
We proposed a hierarchical two-layer sparse communication topology that localizes dense communication among a subgroup of workers and builds consensus through a sparse inter-subgroup communication scheme. We also provide empirical evidence of the proposed solution scaling better on Machine Learning tasks than competing methods.
Finally, we address scalability issues of a pairwise ranking algorithm that forms an important class of problem in online recommender systems. The existing solutions based on a parallel stochastic gradient descent algorithm define a static model parameter partitioning scheme, creating an imbalance of work distribution among distributed workers. We propose a dynamic block partitioning and exchange strategy for the model parameters resulting in work balance among distributed workers. Empirical evidence on publicly available benchmark datasets indicates that the proposed method scales better than the static block-based methods and outperforms competing state-of-the-art methods.
Supervised learning, the standard paradigm in machine learning, only works well if a sufficiently large, diverse, and cleanly-annotated dataset is available. Unfortunately, this is often not the case. In fact, the lack of labeled data is an omnipresent issue in machine learning. The problem is particularly prevalent in computer vision, where unlabeled images or videos can often be acquired at a low cost, whereas labeling them is time-consuming and expensive. To address the issue, this thesis focuses on developing new methods that aim at reducing annotation costs in computer vision by leveraging unlabeled and partially labeled data.
In the first part, we provide an overview of previous research directions and discuss their strengths and weaknesses. Thereby, we identify particularly promising research areas. The subsequent chapters which form the central part of this thesis aim at developing algorithmic improvements in these especially attractive fields. Among them is self-supervised learning, which aims at learning transferable representations given a large number of unlabeled images. We find that existing self supervised methods are optimized for image classification tasks, only compute global per-image feature vectors, and are designed for object-centric datasets like ImageNet. To address these issues, we propose a method that is particularly suited for object detection downstream tasks and works well if multiple objects are present per image like in video data for autonomous driving. Another core downside of self-supervised learning algorithms is that they depend on very large batch sizes with batch norm statistics synchronized across GPUs and also require many epochs of training until convergence. We find that stabilizing the self-supervised training target substantially speeds up convergence and allows for training with much smaller batch sizes. Our method matches ImageNet weights after 25 epochs of training with a batch size of only 32.
Finally, we investigate supervised pretraining. We find that state-of-the-art self-supervised methods match ImageNet weights only in classification or detection but not in both. In addition, we show that more sophisticated supervised training strategies significantly improve upon ImageNet weights.
The second part of the thesis deals with partially labeled data for object detection. We propose to label only large, easy-to-spot objects given a limited budget. We argue that these contain more pixels and therefore usually more information about the underlying object class than small ones. At the same time, they are easier to spot and hence cheaper to label. Because conventional supervised learning algorithms do not work well given this annotation protocol, we develop our own method with does, by combining pseudo-labels, output consistency across scales, and an anchor scale-dependent ignore strategy. Furthermore, many object detection datasets such as MS COCO and CityPersons include group annotations, i.e., bounding boxes that contain multiple objects of a single class. We find that pseudo-labeling instances within a group box is superior to the commonly used training strategies.
In the third part of the thesis, we cover semi-supervised object detection where a subset of the images is fully labeled whereas the remaining ones are unlabeled. We show that existing methods that are almost exclusively developed for Faster R-CNN work much less well if applied to architectures that are sensitive to missing annotations. In the prefinal chapter, we investigate the interaction between data and computer vision algorithms. This is in contrast to the vast majority of research which considers the data to be fixed. We provide computer vision practitioners and researchers with guidelines about what to do in typical situations.
In the final part of the thesis, we discuss the overall findings and investigate if research should put greater weight on acquiring and labeling data. Finally, we discuss options of mimicking human learning with machines, which might eventually result in human-level intelligence. After all, humans are living proof that this kind of learning works, if done properly.
Automated machine learning represents the next generation of machine learning that involves efficiently identifying model hyperparameters and configurations that ensure decent generalization behavior beyond the training data. With a proper setup in place, considerable resources can be saved by practitioners and academics. Beyond naive approaches, e.g. random sampling or grid search, sequential model-based optimization has been at the forefront of solutions that attempt to optimize the black-box function representing the generalization surface, for example, the validation loss.
With the abundance of data and algorithm evaluations being available, transfer learning techniques and meta-knowledge can be utilized to further expedite hyperparameter optimization. In this thesis, we cover 4 ways in which meta-knowledge can be leveraged to improve hyperparameter optimization.
In the first part, we present two large-scale meta-datasets, i.e. a collection of hyperparameters and their respective response for a machine learning algorithm trained on several datasets. We describe in detail the implementation details and descriptive analytics that highlight the heterogeneity of the resulting response surface. The two meta-datasets are used as benchmark datasets upon which the subsequent methods developed in this thesis have been empirically evaluated.
In the second part, we introduce the first work that automates the process of learning meta-features, i.e. dataset characteristics, directly from the dataset distribution. Previously, meta-features required expert-domain knowledge and a lot of engineering to properly represent datasets as entities for a meta-learning task. Following this work, we integrate the meta-feature extractor as a module in the machine learning algorithm, and optimize it jointly for the meta-learning task, further promoting the benefits of differentiable meta-features. Finally, we carry over the concept of meta-feature learning in the absence of the underlying dataset. Specifically, we design a deep Gaussian kernel that allows for a richer representation of the attributes via non-linear transformation. The resulting surrogate is conditioned on landmark meta-features extracted from the history of task-specific evaluations.
In the third part, we formulate the problem of hyperparameter optimization as a Markov Decision Process. As such, we introduce the first paper on hyperparameter optimization in a reinforcement learning framework and define a novel transferable policy that acts as an acquisition function for hyperparameter optimization. Furthermore, we study the impact of planning in hyperparameter optimization through a novel non-myopic acquisition function.
Finally, we present hyperparameter optimization in a zero-shot setting. In contrast to sequential model-based optimization, the fastest way for HPO is by learning a zero-shot approach, that identifies the best configuration with a single trial. Our Zap-HPO approach outperforms the state-of-the-art in algorithm selection for deep learning pipelines that comprise a machine learning algorithm and its associated hyperparameters, given simple meta-features.
Recommender systems have been deployed in many diverse settings, and they aim to provide a personalized ranked list of items to users that they are likely to interact with. In order to provide an accurate list of items, models need to capture various aspects of the users' profiles, behaviors, and items' dynamics. Depending on the recommendation settings, these aspects can be mined from the different auxiliary information sources that might be readily available in these settings as side information. The more aspects being covered, the more accurate the learned user and item representations will be, improving prediction performance and overcoming various challenges such as sparse interaction data.
These auxiliary information sources might contain static attributes related to the users' and items' profiles or contain historical multi-relational implicit interactions between users and items, users and users, and items and items such as clicks, views, bought-together, and friendships. These interactions can be exploited to capture complex implicit relations that are usually not visible if the model only focuses on one user-item relationship.
Besides attributes and interaction data, auxiliary information might also contain contextual information that accompanies the interaction data, such as timestamps and locations. Incorporating such contextual information allows the models to comprehend the dynamics of users and items and learn the influence of time and environment.
In this thesis, we present four ways in which auxiliary information can be leveraged to improve the prediction performance of recommender systems and allow them to overcome many challenges.
Firstly we introduce an attribute-ware co-embedding model that can leverage user and item attributes along with a set of graph-based features for rating prediction. In particular, the model treats the user-item relation as a bipartite graph and constructs generic user and item attributes via the Laplacian of the co-occurrence graph. We also demonstrate that our attribute-ware model outperforms existing state-of-the-art attribute-aware recommender systems.
Next, we extend the model to handle different multi-relational interactions to overcome the challenges of having few and sparse interaction data between users and items. First, we extend the model by adding the capability to capture multi-relational interactions between the same entity types, particularly between users and users and between items and items. This is done by simultaneously scoring the different relations using a weighted joint loss. Later, we extend the model further by including the ability to accommodate different user and item interactions simultaneously by having an independent scoring function for each interaction type. The later extension allows the model to be employed in scenarios where the main relation between users and items is extremely sparse such as in auction settings which pose a significant challenge to traditional and state-of-the-art models.
Additionally, we introduce a sequential context and attribute-aware model that captures users' and items' dynamics through their sequential interaction patterns and their timestamps. The model can also capture various aspects of the users' and items' profiles through their static attributes and content information.
Finally, in the end, we also present a framework for optimizing ranking metrics directly, such as the Normalized Discounted Cumulative Gain (NDCG) using surrogate losses as an additional way of improving the models' performance.