Open topics for theses and practical courses

All topics are supervised by Claudia Plant. For more information on the topics, please contact the mentioned person.


Practical Course, Bachelor or Master Thesis: Kernelization for the Maximum Common Subgraph Problem

Given two graphs, the maximum common subgraph problem asks for the largest graph that is contained in both as a subgraph. This NP-hard problem is highly relevant in many applications such as computational drug discovery. The goal of this project is to develop scalable algorithms following the concept of kernelization, i.e., the (iterative) reduction of the problem to smaller instances.

A classical technique reduces the maximum common subgraph problem to finding a maximum clique in the product graph of the two input graphs. Equivalently, a maximum independent set of the complement of the product graph can be determined instead. For this problem algorithms based on kernelization have been shown to be highly efficient in practice recently. In this project the properties of the product graph and its complement should be studied theoretically and the performance of the reduction should be investigated in practice. The kernelization could be improved further using specific properties of the product graph.

Students wanting to work on this topic are expected to have experience in graph algorithms and solid programming skills in C++.

Contact: Nils Kriege

Master Thesis: Reward Inference for Sequential Decision Making from Diverse and Implicit Feedback

Automated sequential decision making is an important application of machine learning systems in which such a system needs to select a sequence of actions step by step to optimize a reward/utility function. For instance, in autonomous driving, such a system needs to execute a sequence of steering, breaking and acceleration actions, or in a medical intensive care setting, such a system needs to execute a sequence of measurement and treatment actions.

One challenge in realizing such automated sequential decision making systems is the definition of the reward/utility function. For example, in autonomous driving it is hard to specify all the factors which define good driving behavior. In such settings, automatically inferring the reward/utility function from users’ feedback can be beneficial.

This project investigates approaches for reward/utility inference from diverse and implicit feedback, building on ideas for inverse reinforcement learning, active learning, implicit feedback, etc.

Interested students are expected to have solid mathematical and machine learning skills, and have experience in Python and deep learning (using PyTorch or TensorFlow).

Contact: Sebastian Tschiatschek

Practical Course, Bachelor or Master Thesis: Imitation Learning Under Domain Mismatch

Reinforcement learning has been successfully used to solve certain challenging sequential decision making problems in recent years. The employed techniques commonly require (i) huge amounts of interactions with the envirnoment and (ii) clearly specified reward signals to work well. In many applications however, one or both of these requirements are not met. In such cases, imitation learning can be an efficient approach to sequential decision making problems: an expert demonstrates near-optimal behavior and a learning agent attempts to mimic this behavior.

This project considers imitation learning in settings in which there is some form of mismatch between the expert demonstrator and the learning agent. The scope of the project is to study how existing algorithms perform in this setting and proposes modifications to existing algorithms to achieve better performance.

Students wanting to work on this topic are expected to have a basic understanding of machine learning techniques, solid knowledge of Python and basic knowledge of deep learning libraries (PyTorch or TensorFlow).

Contact: Sebastian Tschiatschek

Bachelor or Master Thesis: Selecting Sequences of Items for Non-monotone functions

Many applications involve the selection of sequences of items, e.g., recommender systems and personalization in MOOCs. Common to many of these applications is that the order of the selected items is not random but that items selected early, influence which items are selected later. The problem of selecting sequences of items has been studied in various settings, including that in which dependencies between items are modeled using graphs and monotone submodular functions.

This project aims at extending these settings to cover the case of non-monotone submodular functions, by proposing new algorithms and analyzing their properties. The findings are validated by an implementation of the proposed algorithm(s) and comparison agains reasonable baselines.

Students wanting to work on this topic are expected to have solid mathematical skills, a basic understanding of machine learning techniques, solid knowledge of Python and deep learning libraries (PyTorch or TensorFlow).

Contact: Sebastian Tschiatschek

Practical Course, Bachelor or Master Thesis: Posterior Consistency in Partial Variational Autoencoders

Variational Autoencoders (VAEs) are powerful deep generative models that have been successfully applied in a wide range of machine learning applications. Recently, the Partial VAE (PVAE), a variant of VAEs that can process partially observed inputs has been proposed and its effectiveness for data imputation has been demonstrated. Key to the fast training of VAEs and PVAEs is the amortized prediction of posterior distributions from observations. In PVAEs, these posterior distributions are predicted from partial observations.

This project aims at studying the consistency of these posterior distributions for different patterns of missing data. The insights are used to create/train better inference models and thereby improve the quality of PVAEs.

Students wanting to work on this topic are expected to have solid mathematical skills, a basic understanding of machine learning techniques and good programming skills in Python.

Contact: Sebastian Tschiatschek

Practical Course: Pharmacoinformatics Research Group

The Pharmacoinformatics Research Group is led by Univ.-Prof. Dr. Gerhard Ecker at the Department of Pharmaceutical Chemistry.

Following a holistic pharmacoinformatic approach we combine structural modeling of proteins, structure-based drug design, chemometric and in silico chemogenomic methods, statistical modeling and machine learning approaches to develop predictive computational systems for transporters and ion channels.

We work with workflow management systems like KNIME for data integration, we do statistical analysis in R, we program predictive models in Python and at times offer these tools to fellow researchers, and this is the part where you come in: We often need help in making our tools openly accessible, such as translating them into a web service or turning them into software.

For a recent tool take a look at our LiverTox Workspace.

Contact: Jana Gurinova

Practical Course, Master or Bachelor Thesis: Evaluation of different nowcasting techniques and data bases for temperature nowcasting

Nowcasting in meteorology refers to short-time prediction of timeseries of e.g. a few minutes to hours into the future. For this short-time predictions the most recent observations are of importance as well as spatial relationships of upstream observations.

Different methods for nowcasting exist in the meteorological world, often physics of physics-statistical driven. Only a few studies focused on applying machine learning and data mining techniques. For the latter nowadays also crowd-sourced information can contain important information. Here, data of NetAtmo stations could provide valuable information. For the first part a study investigated already different machine learning algorithms for nowcasting of temperature using the Austrian Met-Service (ZAMG) data. The idea of the proposed thesis is to combine the already existing algorithms with the NetAtmo data by first clustering the ZAMG data and NetAtmo data and then develop an algorithm which incorporates the NetAtmo data. Important to note is that NetAtmo sites are prone to errors thus need to be cleaned beforehand.

The developed algorithm will be evaluated against two statistical forecasting methods, an analogue search based method and a model output statistics method.

Contact: Irene Schicker

Master or Bachelor Thesis: Exploratory Data Analysis on the GPU - Cuda Warp Level Primitives and Independent Thread Scheduling

Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming. Here, we will enhance traditional Data Mining algorithms with the use of the GPU and its independent thread scheduling based on Cuda intrinsics. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher

Master or Bachelor Thesis: Exploratory Data Analysis with Google TPU

The Tensor Processing Unit (TPU) was announced in 2016 at Google I/O, when the company said that the TPU had already been used inside their data centers for over a year. The chip has been specifically designed for Google’s TensorFlow framework, a symbolic math library which is used for machine learning applications such as neural networks. Here, we will enhance traditional Data Mining algorithms with the use of the TPU. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher

Practical Course or Master Thesis: Deep Learning based Clustering

Deep embedded clustering or deep clustering is a growing field that combines ideas from clustering and deep learning. The integration of these techniques makes it possible to learn features automatically from the data and to increase clustering performance together. Deep clustering covers different areas of machine learning and data mining, namely deep learning, clustering, dimensionality reduction, matrix factorization and representation learning, which makes it an intriguing research direction. Interested students can work on the latest research by implementing recently published algorithms and/or supporting the development of new algorithms.

The scope of the project can be discussed and decided based on individual strengths and preferences. Students wanting to work on this topic are expected to have a basic understanding of machine learning and data mining techniques and a solid knowledge of Python, preferably including its data science stack (pandas, scikit-learn, numpy, …) and deep learning libraries (pytorch, tensorflow, …).

Contact: Lukas Miklautz

Practical Course or Bachelor Thesis: Causality Among Heterogeneous Processes by MDL

Minimum description length (MDL) principle can be used for causal inference. The goal of this practical course/thesis is to execute experiments for an existing toolbox for causal inference by MDL among heterogeneous random processes or to make an experimental comparison to a rival method. Any previous knowledge of causal inference, of MDL or of advanced statistical methods is for the work not necessary.

Contact: Katerina Schindlerova

Master or Bachelor Thesis: Implementation of an Algorithm for Anomaly Detection in Gait Time Series and its Testing

The goal of this theses is to implement a code for temporal anomaly detection algorithm on the motion capture database MoCap of time series of human gait. Depending whether a bachelor or master thesis, the work will be extended based on the queries in gait recognition.

Contact: Katerina Schindlerova

Master or Bachelor Thesis: Mining in time series: Nonlinear and non-additive causal modeling

The goal of this thesis is to develop models and algorithms able to detect causal patterns in time series which are of nonlinear or nonadditive nature. The models and algorithms will be validated on the state-of the art causal models.

Contact: Katerina Schindlerova

Master Thesis: Data Mining on Real World Accelerometer Time Series

In Data Mining and Machine Learning the choice of distance measure is a crucial design decision that strongly depends on the data structure and application. There are plenty of distance measures: Euclidean, Manhattan, edit distance, Dynamic Time Warping, SAX, Hemming, and many more. When analyzing time series data key information is given by the ordering of the observations, so the distance measure of choice should also regard this ordering, as Dynamic Time Warping (DTW) does.

Another research question is: How much data is enough? How many features are enough? What sampling rate is high enough?

You will start with literature research about time series distance measures and test them on accelerometer time series for supervised and unsupervised data mining tasks.

Contact: Maximilian Leodolter

Implementation of Data Mining Approach for Short-range Temperature Forecasts

Short-range forecasts of wind speeds (i.e., 1 - 72 hours into the future) and in particular nowcasting (i.e., very short-range forecasts with a time horizon of up to 6 hours) are vital for a wide range of applications. In contrast to wind, temperature typically changes gradually and may thus need a different setup than wind speed. Temperature involves daily fluctuations, which are well predictable but dependable on the daytime, which must be considered in the training dataset. Depending on the location, also rapid temperature changes may occur which are related, for example, to cold air pools or Föhn. Thus, also temperature highly depends on location (specific topography, prevailing weather conditions, and atmospheric dynamics). Depending on the application, we can either give point or area based predictions. Points refer to a particular location (e.g., a weather station) whereas spatial forecasts typically give a forecast for each grid cell over a region (i.e., each forecast is valid for the whole cell).

ZAMG (Zentralanstalt für Meteorologie und Geodynamik) employs (gridded) numerical weather prediction models in conjunction with observation data for short-range forecasting and a nowcasting system, INCA, for the prediction of meteorological parameters. Alternatively, machine learning methods are now being implemented. In particular, an artificial neural network (ANN) and a random forests (RF) are used in an experimental setup to show the skills of these methods and, possibly, be used as additional wind speed point forecasting method for the 10-meter wind. The existing methods can be used for temperature as well with the same training set, but the setup needs to be adapted.

The proposed student project shall address temperature forecasting (in two meters height) at meteorological observation sites by machine learning and data mining methods (e.g.: random forest, feed forward/BP artificial neural networks, kernel methods, etc.) and input feature selection of the training data set. It is possible to experiment with related meteorological parameters as well (e.g.: drew point temperature and relative humidity). Related work suggests to use back propagation neural network for predicting the 2-m drew point temperature and 2-m temperature. This can be used as a starting point to find a suitable selection of the training data for the current model and then extend the our current approaches by a back propagation neural network in order to set up a first prototype for temperature prediction by machine learning methods. The new model shall be tested on various scenarios (e.g.: different prevailing weather condition, locations, seasons) in order to compare the new data mining based model with the currently employed nowcasting system INCA.

The work is co-supervised by ZAMG's Section for numerical weather predictions (NWP) Applications. The developed method shall have a Python based frontend, C/C++ backend, and use csv or sqlite based meteorological data (provided by ZAMG) in order to align with other machine learning implementations running in our IT environment. Finally, the developed method will be set up in our development environment (Python 2.7/Linux 64-bit, multi-cored shared memory machine) to provide forecasts and validation of the method for selected test scenarios.

Contact: Petrina Papazek

SIGMOD programming contest

The ACM Special Interest Group on Management of Data organizes an annual programming contest. The contest and its content will be announced in February at the SIGMOD website see: The price-pool is often up to 5000$ for computing resources (cloud access). For this "Praktikum" we organize a team of 2 to 4 people.

Contact: Martin Perdacher



  • Thomas Spendlhofer, Bachelor Thesis: "Evaluating the usage of Tensor Processing Units (TPUs) for unsupervised learning on the example of the k-means algorithm", summer term 2019
  • Ernst Naschenweng, Bachelor Thesis: "A cache optimized implementation of the Floyd-Warshall Algorithm", summer term 2018
  • Hermann Hinterhauser, Bachelor Thesis: "ITGC: Information-theoretic grid-based clustering", summer term 2018, accepted paper in EDBT 2019 (download available here)
  • Mahmoud A. Ibrahim, Bachelor Thesis: "Parameter Free Mixed-Type Density-Based Clustering", winter term 2017/2018, accepted paper in DEXA 2018 (download available here)
  • Markus Tschlatscher: "Space-Filling Curves for Cache Efficient LU Decomposition", winter term 2017/18
  • Theresa Fruhwuerth, Master Thesis: "Uncovering High Resolution Mass Spectrometry Patterns through Audio Fingerprinting and Periodicity Mining Algorithms: An Exploratory Analysis", summer term 2017
  • Robert Fritze, PR1 "Combining spatial information and optimization for locating emergency medical service stations: A case study for Lower Austria", summer term 2017
  • Alexander Pfundner, PR2 "Integration of Density-based and Partitioning-based Clustering Methods", summer term 2017
  • Anton Kovác, Katerina Hlavácková-Schindler, Erasmus project, "Graphical Granger Causality for Detection Temporal Anomalies in EEG Data", winter term 2016/17 (download available here)