Refine
Document Type
- Doctoral Thesis (2)
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Institute
- Fachbereich IV (1)
- Informatik (1)
Supervised learning, the standard paradigm in machine learning, only works well if a sufficiently large, diverse, and cleanly-annotated dataset is available. Unfortunately, this is often not the case. In fact, the lack of labeled data is an omnipresent issue in machine learning. The problem is particularly prevalent in computer vision, where unlabeled images or videos can often be acquired at a low cost, whereas labeling them is time-consuming and expensive. To address the issue, this thesis focuses on developing new methods that aim at reducing annotation costs in computer vision by leveraging unlabeled and partially labeled data.
In the first part, we provide an overview of previous research directions and discuss their strengths and weaknesses. Thereby, we identify particularly promising research areas. The subsequent chapters which form the central part of this thesis aim at developing algorithmic improvements in these especially attractive fields. Among them is self-supervised learning, which aims at learning transferable representations given a large number of unlabeled images. We find that existing self supervised methods are optimized for image classification tasks, only compute global per-image feature vectors, and are designed for object-centric datasets like ImageNet. To address these issues, we propose a method that is particularly suited for object detection downstream tasks and works well if multiple objects are present per image like in video data for autonomous driving. Another core downside of self-supervised learning algorithms is that they depend on very large batch sizes with batch norm statistics synchronized across GPUs and also require many epochs of training until convergence. We find that stabilizing the self-supervised training target substantially speeds up convergence and allows for training with much smaller batch sizes. Our method matches ImageNet weights after 25 epochs of training with a batch size of only 32.
Finally, we investigate supervised pretraining. We find that state-of-the-art self-supervised methods match ImageNet weights only in classification or detection but not in both. In addition, we show that more sophisticated supervised training strategies significantly improve upon ImageNet weights.
The second part of the thesis deals with partially labeled data for object detection. We propose to label only large, easy-to-spot objects given a limited budget. We argue that these contain more pixels and therefore usually more information about the underlying object class than small ones. At the same time, they are easier to spot and hence cheaper to label. Because conventional supervised learning algorithms do not work well given this annotation protocol, we develop our own method with does, by combining pseudo-labels, output consistency across scales, and an anchor scale-dependent ignore strategy. Furthermore, many object detection datasets such as MS COCO and CityPersons include group annotations, i.e., bounding boxes that contain multiple objects of a single class. We find that pseudo-labeling instances within a group box is superior to the commonly used training strategies.
In the third part of the thesis, we cover semi-supervised object detection where a subset of the images is fully labeled whereas the remaining ones are unlabeled. We show that existing methods that are almost exclusively developed for Faster R-CNN work much less well if applied to architectures that are sensitive to missing annotations. In the prefinal chapter, we investigate the interaction between data and computer vision algorithms. This is in contrast to the vast majority of research which considers the data to be fixed. We provide computer vision practitioners and researchers with guidelines about what to do in typical situations.
In the final part of the thesis, we discuss the overall findings and investigate if research should put greater weight on acquiring and labeling data. Finally, we discuss options of mimicking human learning with machines, which might eventually result in human-level intelligence. After all, humans are living proof that this kind of learning works, if done properly.
Recommender systems have been deployed in many diverse settings, and they aim to provide a personalized ranked list of items to users that they are likely to interact with. In order to provide an accurate list of items, models need to capture various aspects of the users' profiles, behaviors, and items' dynamics. Depending on the recommendation settings, these aspects can be mined from the different auxiliary information sources that might be readily available in these settings as side information. The more aspects being covered, the more accurate the learned user and item representations will be, improving prediction performance and overcoming various challenges such as sparse interaction data.
These auxiliary information sources might contain static attributes related to the users' and items' profiles or contain historical multi-relational implicit interactions between users and items, users and users, and items and items such as clicks, views, bought-together, and friendships. These interactions can be exploited to capture complex implicit relations that are usually not visible if the model only focuses on one user-item relationship.
Besides attributes and interaction data, auxiliary information might also contain contextual information that accompanies the interaction data, such as timestamps and locations. Incorporating such contextual information allows the models to comprehend the dynamics of users and items and learn the influence of time and environment.
In this thesis, we present four ways in which auxiliary information can be leveraged to improve the prediction performance of recommender systems and allow them to overcome many challenges.
Firstly we introduce an attribute-ware co-embedding model that can leverage user and item attributes along with a set of graph-based features for rating prediction. In particular, the model treats the user-item relation as a bipartite graph and constructs generic user and item attributes via the Laplacian of the co-occurrence graph. We also demonstrate that our attribute-ware model outperforms existing state-of-the-art attribute-aware recommender systems.
Next, we extend the model to handle different multi-relational interactions to overcome the challenges of having few and sparse interaction data between users and items. First, we extend the model by adding the capability to capture multi-relational interactions between the same entity types, particularly between users and users and between items and items. This is done by simultaneously scoring the different relations using a weighted joint loss. Later, we extend the model further by including the ability to accommodate different user and item interactions simultaneously by having an independent scoring function for each interaction type. The later extension allows the model to be employed in scenarios where the main relation between users and items is extremely sparse such as in auction settings which pose a significant challenge to traditional and state-of-the-art models.
Additionally, we introduce a sequential context and attribute-aware model that captures users' and items' dynamics through their sequential interaction patterns and their timestamps. The model can also capture various aspects of the users' and items' profiles through their static attributes and content information.
Finally, in the end, we also present a framework for optimizing ranking metrics directly, such as the Normalized Discounted Cumulative Gain (NDCG) using surrogate losses as an additional way of improving the models' performance.