Open research topics
for bachelor and master students

Unfortunately, due to other commitments, I cannot accept new students at the moment. Thank you for your understanding.


Dear Students,
Welcome! I offer and supervise a wide range of student research topics in the field of multimedia, nlp, and general deep learning. These topics are partly belonging to our current research projects, deepen previous work, or are of general interest to me. I would like to encourage you to also approach me with your own idea(s) in the field(s). Both types of studies, more theoretical and more empirical, are available. Typically, a project consists of at least a literature review to familiarize yourself with the topic, data preparation, and experimental work where you evaluate baseline models and develop an individual concept. More information regarding my supervision guidelines.

If you have any questions about the topics, please feel free to contact me.

Below is a list of thesis topics:

Preferred: I welcome suggestions for topics that transfer the latest methods (e.g. new neural network architectures and mechanisms) of the top notch Deep Learning conferences (CVPR, NeurIPS, ICCV, AAAI, ICLM, ICLR) to emotion recognition. To do this, you can simply look at the topics from the last conference and if you spot a paper that interests you, email me with your idea on how to transfer it. Most of them provide code, which can be very helpful when transferring.

Also check out the seminar topics for inspiration and for some, it is possible to frame it as a thesis topic.

Banner and Advertisement Detection and Localisation in YouTube Videos Utilising Pseudo-Supervised Deep Learning

Description

The use of large in-the-wild datasets is beneficial for research and industry. In-the-wild data, however, have a higher granularity and noise than laboratory data. In order to simplify the joint use, noise and particularly disturbing training influences have to be automatically detected, extracted and removed. Data sources such as YouTube, represent a very good data source due to its public availability and extensive content. These videos, however, often include banner highlighting additional information in textual form. These video elements are disturbing training influences that can confuse feature extraction frameworks trained by deep learning models. Removing these banners by hand would be extra effort for the creators, reducing the chance of receiving permission of them to use their videos for research purposes.

The aim of this study is to automatically detect and localise distracting elements in videos utilising SOTA deep learning algorithms. For this purpose, a label generator has to be developed, which projects realistic boxes and texts on random positions and in different sizes into the video. These elements are used as pseudo labels in the subsequent training process. The developed neural network should learn to predict these elements and their position in a video sequence (see Pixel CNNs).

Task

In this thesis, the student(s) will develop a state-of-the-art data generator and deep learning method for banner detection.

Utilises

Advanced Data Augmentation, Video/Image Segmentation/Masking, R-CNN

Requirements

Practical experience with GANs!, solid knowledge in Deep Learning, Computer Vision, good programming skills

Languages

German or English

Denoising Audio Signals from in-the-wild Youtube Videos utilising Deep Learning

Description

In recent years, the use of deep learning has rapidly increased in many research areas and industry, pushing the boundaries of automated data analysis. Large data companies (e.g. Google, Facebook) have a huge amount of data to train stable and versatile models and, thus, inspire many fields and architectures in deep learning. In contrast, generic research is tailored to very specific areas, such as emotion recognition, and models have been trained under laboratory conditions on academic datasets to learn domain-specific, valuable features.

The use of large in-the-wild datasets is beneficial for both sides. On the one hand, from a purely research perspective, they enable specific and, at the same time, stable models. On the other hand, industry can transfer pre-trained models, architectures and feature extraction frameworks to new applications. In-the-wild data, however, has a higher granularity and noise than laboratory data. In order to facilitate its use in both sectors, noise and particularly deleterious training influences have to be automatically detected, extracted and removed.

The aim of this study is to adapt one or more deep learning architectures for audio denoising, enhance them for a specific domain and tune them, identifying appropriate parameters. Recently, WaveNet [1] showed promising performance on a similar task [2] and will be analyzed regarding its utilizability. Audio examples [3] and a first implementation [4] are also available. The dataset that will be used in this project comprises Youtube videos capturing emotional car reviews (EmoCaR). Further data , e.g. to add natural noise, are available from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND). Typical noise patterns in the original videos are background music or car sounds.

[1] https://deepmind.com/blog/wavenet-generative-model-raw-audio/

[2] https://arxiv.org/pdf/1706.07162.pdf

[3] http://www.jordipons.me/apps/speech-denoising-wavenet/25.html

[4] https://github.com/drethage/speech-denoising-wavenet

Task

In this thesis, the student(s) will develop a state-of-the-art deep learning audio denoising technique.

Utilises

audio, deep neural networks, WaveNet, encoder-decoder, CNN-based

Requirements

Preliminary knowledge in deep learning and audio processing, good programming skills (e.g. Python, C++).

Languages

German or English

Below is a list of seminar topics:

Topics containing letters are divided into several seminar sub-topics with different focuses. For example, two letters a) and b) means that two seminar works on the same main topic are available with either focus a) or focus b). In addition to each empirical work, a purely theoretical seminar work in the form of an in-depth literature review of latest advances is also available. Depending on the topic, it is possible to fuse all sub-tasks to one thesis topic.

Unsupervised Emotion Diarization (mapping) and Uncertainty Prediction in-the-wild

Description:

A) Mapping: In emotion recognition, most data are annotated following the russell's circumplex model of emotion (valence, arousal) which is an annotation of continuous values. For humans, classes of emotions (happy vs. corresponding arousal and valence labels) are easier to understand than score values. To date, no model has been established that can dynamically map continuous values to classes. The question arises whether a special mapping can be derived using dynamic approaches such as machine learning.

B) Time-continuous to discrete: Emotion is an intense feeling that is short-term and is typically directed at a source/topic. While emotion annotations are usually continuous, an aggregated assessment towards a source/topic is often more informative for humans and, thus, a practical use of these systems. Continuous emotions can be aggregated on a) segment, b) sentence-level, c) self-derived shifts (e.g. aspects) in a time continuous manner.

Task:

A) Empirical term work: Develop algorithms and/or using machine learning to combine Valence and Arousal annotations into single and combined classes. The class shifts can also be learnt.

B) Models for the prediction of emotions are trained, which instead of the continuous annotation predict the minimum, maximum, quantile and mean values on a) sentence level b) segment level.

Utilises

Multimodal, Annotations, Unsupervised, Emotion neural networks

Requirements

Python, preliminary knowledge in data processing and deep learning

Languages

German or English

It is never too dark to be modeled - Investigation of the Effect of (sun-) Glasses on Emotion Recognition

Description:

While in most lab settings it is rather unusual for participants to wear glasses or sunglasses, this is more often the case for in-the-wild data. In our MuSe-CAR data set we have therefore deliberately collected a number of videos of speakers wearing glasses or sunglasses.

Task:

Empirical term work: Investigate the effect of (sun)glasses on the extraction of facial features (Facial Action Units, 2D Facial Landmarks, 3D Facial Landmarks, VGGFace) - can these feature extractors extract important facial features at all? Which extractor does this influence most negatively/least? And how does it influence the prediction result? Experimental setup: mix, glasses on glasses, no glasses on no glasses, cross.

Theoretical seminar work: b) Literature review on the latest advances in the influence of glasses on emotion recognition with focus on facial features.

Utilises

Multimodal, Vision, Facial features, Emotion neural networks

Requirements

Python, preliminary knowledge in data processing and deep learning

Languages

German or English

Investigation of Gender-(un)biased Emotion Modelling

Description:

When developing models for emotion recognition, we aim to work with a data set that is as balanced as possible with regard to the distinctive people characteristics. This avoids bias in the modelling which leads to worse results when predicting data with an unknown (gender) distribution. Hence, it is common practice to strive for an even distribution of the genders in the training set. When collecting data in-the-wild, however, this is sometimes not possible for certain domains, e.g. automobile or beauty reviews. For this reason, we want to explore the effects as well as develop appropriate counter-strategies of gender-wise unbalanced training data.

Task:

Empirical seminar work: a) Suitable models should be trained and compared using different training sets. Typical training sets include one gender (male or female), both genders (unbalanced) or balanced utilising simple resampling methods.

Theoretical seminar work: b) Literature review on gender-neutral emotion modelling for distant and similar tasks.

Utilises

Multimodal, Emotion Neural Networks

Requirements

Python, preliminary knowledge in data processing and deep learning

Languages

German or English

Investigation of Multiple Uni-modality Fusion Methods

Description:

From each modality (acoustic, visual, text) different sets of features can be extracted with uni-modal feature extractors. The feature extractors have been developed with different concepts (manual, deep) in mind and are differently successful for different tasks (e.g robustness against audio noise).

Task:

Empirical term work: Investigate uni-modal fusion strategies (early-fusion, unsupervised compression (encoder-decoder)) for a) acoustic b) vision and c) text modality.

Theoretical seminar work: d) Literature review on uni-modal feature extraction strategies.

Utilises

Multimodal, Fusion, Emotion neural networks

Requirements

Python, preliminary knowledge in data processing and deep learning

Languages

German or English