Open topics for theses and practical courses
Unless otherwise specified, all topics are available as practical course (P1/P2), bachelor or master thesis.
Work group:

Type:

Interpreting Deep Clustering Results [Practical Course or Master Thesis]
Deep embedded clustering also called deep clustering is a growing field that combines ideas from clustering and deep learning. The integration of these techniques makes it possible to learn features automatically from the data to increase clustering performance. Current deep clustering methods are hard to interpret, making it difficult to understand how a clustering result was reached.
The goal of this project is to develop an interactive visualization tool, e.g. a web based application, for exploring the predictions of deep clustering algorithms and helping to understand their decision making process.
The student is expected to do a literature review of existing visualization techniques developed for (supervised) deep learning, e.g. feature visualizations, that could be applicable to interpreting unsupervised deep clustering algorithms. The identified methods should then be applied (if necessary adapted) to and compared for existing deep clustering algorithms.
Some research questions of interest that should be considered during the project would be: How suitable are existing visualization techniques to interpret deep clustering results? How do the different parts of the multiobjective loss of deep clustering techniques relate to each other? Considering multiple clustering models, e.g. KMeans vs DBSCAN, how do the neural network visualizations differ for each of them?
Students working on this project need basic background knowledge in machine learning (e.g. Foundations of Data Analysis), visualisation (e.g. Visualisation and Visual Data Analysis), solid programming skills in Python, and desirably some background with PyTorch, deep learning and some visualization framework, like d3.
Supervisors: Claudia Plant, Lukas Miklautz in collaboration with Aleksandar Doknic and Torsten Möller
Contact: Lukas Miklautz, Claudia Plant
Dynamic Orbit Partition [Practical Course or Master Thesis]
A graph isomorphism is a mapping between the nodes of two graphs that preserves the edges. An automorphism is an isomorphism of a graph to itself. The (nontrivial) automorphisms intuitively represent the symmetries of a graph. Two vertices are considered equivalent if there is an automorphism that maps one vertex to the other. The corresponding equivalence classes are referred to as orbit partition. The orbit partition is of interest for symmetry breaking techniques for various combinatorial problems. Identifying and pruning symmetric branches in the search tree based on the symmetries of the input instance can lead to drastic speedups. This project aims to develop dynamic algorithms for maintaining the orbit partition when the input graph changes. The focus of the project can be on the development and theoretical analysis of algorithms or their implementation and experimental evaluation.
Requirements: Strong interest in algorithms
Contact: Nils Kriege, Christian Schulz, Kathrin Hanauer
Kernelization for Clique Solvers [Practical Course or Master Thesis]
A clique C in an undirected graph G = (V, E), where V is the set of vertices and E is the set of edges, is a subset of V such that all its vertices are connected. The size of C is its cardinality. The maximum clique problem is to find a clique of maximum size in G. An important generalization of the maximum clique problem is the maximum weight clique problem, in which the graph has a weight function w that assigns a positive integer called weight to each vertex, and the weight of a clique C is defined to be the total weight of the vertices in C. Clique problems have many important applications, especially also in chemistry and bioinformatics. They are used, for example, to compare molecular structures, to predict protein structure, or analyze interactions. The project has the goal to transfer reduction rules that have recently been developed for the dual weighted independent set problem to the max clique problem and thus improving the state of the art for the maximum weight clique problem. On the reduced problem, one can then apply an exact algorithm or a heuristic algorithm to solve the problem.
Requirements: Strong interest in algorithms, strong programming skills, C++
Contact: Nils Kriege, Christian Schulz, Kathrin Hanauer
Interpretable and Explainable Deep Learning
In this project you will study novel approaches for learning interpretable and explainable deep learning models with the goal of supporting users with different skill levels and knowledge. The models will be evaluated on applications with large societal impact, e.g., labour market data.
Contact: Sebastian Tschiatschek
Dynamic Information Acquisition in Questionnaires
In many areas of our life questionnaires/a collection of measurements are used to gather information for decision making, e.g., questionnaires are used to measure the success of marketing campaigns or customer satisfaction, or medical tests are conducted to decide on the best treatment for a patient. Common to those applications is that many questions need to be answered or many tests must be conducted. Often this is not because all the information is needed for deriving a decision but rather because the course of action is standardized (e.g., a survey always consists of the same questions).
In this project you will evaluate and develop methods for adaptive information acquisition that acquire the information necessary for decision making sequentially and adaptively, e.g., not all questions in a questionnaire are asked in all cases and the order of the questions can be changed. You will consider and evaluate recent machine learning models for this problem but also investigate how these methods can be extended to account for dependencies in the information that can be required, e.g., dependencies that only allow for a specific order of some questions of a questionnaire, the influence of how questions are asked on the answers, etc.
Contact: Sebastian Tschiatschek
Deep Probabilistic Clustering for Heterogeneous and Incomplete Data
Clustering is the task of finding groups in a data set. As an unsupervised learning approach it is important for exploratory data analysis, e.g., for finding commonalities between patients suffering from some disease. Often such data is highdimensional and contains complicated nonlinear relationships, which makes it difficult to apply standard clustering approaches. Deep clustering algorithms have been proposed recently to approach this kind of setting by combining clustering and deep learning in a common framework. An open problem that has not been sufficiently addressed in recent deep clustering research is incomplete and heterogeneous (e.g. survey data, blood tests, MRI images, etc.) data. Incomplete data is often an issue in practice, especially for health data where not everything about a patient’s history is known.
In this project you will study the clustering of incomplete data with heterogeneous data types with a particular focus on missingness and on determining which information is relevant for cluster assignments. You will develop an algorithm based on previous work and evaluate it on synthetic and realworld data. Some research questions of interest would be for instance: How do cluster assignments change when given new information? Or, given an uncertain cluster assignment (e.g. sick or healthy patient) which information would we need to query to make a decision?
Contact: Sebastian Tschiatschek
Multiagent Reinforcement Learning under Mismatch
Enabling intelligent agents to collaborate effectively despite mismatch in perception or capabilities.
Contact: Sebastian Tschiatschek
Multiagent Teaching Primitives
Enabling multiple interacting agents to convey their knowledge and skills through teaching.
Contact: Sebastian Tschiatschek
Abstraction in Reinforcement Learning
Learning abstractions of large state and action spaces in order to facilitate more efficient planning and exploration.
Contact: Sebastian Tschiatschek
Data mining approaches to EEG time series for analysis of antidepressant treatment response [Master Thesis]
A deeplearning model for antidepressiva treatment analysis based on EEG measurements will be developed. An ANNbased classificator and a clustering algorithm will be proposed to discover the synchronized structure of the brain of the treated patients. After you have made yourself familiar with central literature on the topic, you will decide together with your supervisor if you want to focus on the clustering or on the classification topic.
Contact: Katerina Schindlerova, Claudia Plant
Bivariate causal measures: Experiments with existing causal methods in python on synthetic and real data [Practical Course]
Bivariate causal measures: Experiments with existing causal methods in python on synthetic and real data.
Contact: Katerina Schindlerova
GPU based data mining on Android devices
Energy efficiency for high performance computing is a current hot topic in scientific research. Many Android(tm) devices have an integrated GPU and a vendor supplied OpenCL shared C library on the device. Although not directly supported by Android, these OpenCL libraries allow to address onboard GPUs by application developers.
In a former project a comfortable development environment for Android studio has been created. A C wrapper library serves as link between the Java/JNI/C part of an application and the OpenCL library on the device. Locking mechanisms allow to use the GPU in a safe manner and free resources timely after the application has been stopped externally (by the user or the operating system).
For this project the wrapper library already developed should be used to implement data mining algorithms with OpenCL and compare runtimes on the GPU with those achieved on the device (Java and C with SIMD instructions). Candidate algorithms are DBSCAN, KNN , Subclu and Optics. Students may implement further algorithms at their pleasure. An application stub with the OpenCL wrapper library will be provided.
Prerequisites:
 Good knowledge of Java, C and OpenCL (Kotlin only upon agreement)
 Knowledge of GPU design
 The student should be familiar with the IDE (Android Studio) for the application development.
 The student must have an Android device (tablet or mobile phone) with an integrated GPU and the OpenCL library must be available on the device (please check directories /system/lib, /system/lib64, /vendor/lib, /vendor/lib64 and technical documentation of the device). The application can NOT be developed with the AVD manager (GPU not supported)! If the Android OS version on the tablet is 7.0 or later (API>=24), please check if the name of the OpenCL library is listed in the files/vendor/etc/public.libraries.txt or /system/etc/public.librariesCOMPANYNAME.txt on the device, because if not, Android does not allow to dynamically load the library at runtime. This is a mandatory prerequisite for the use of the wrapper library.
 For the use of SIMD instructions, either knowledge of (inline) assembler (depending on the students tablet CPU architecture armv7, aarch64, x86 or x86_64) or knowledge of the use of SIMD intrinsics (NEON or SSE) in C.
Contact: Robert Fritze, Claudia Plant
Learning with Reduced Molecular Graphs
Drug discovery at its early stage can greatly benefit from machine learning methods. As molecules are structured objects consisting of atoms and chemical bonds, they cannot be represented by vectors in a straightforward way, but are adequately modeled as graphs. Recent advances in machine learning with graphs led to well engineered methods applicable to graphs annotated with node and edge attributes, e.g., graph neural networks. For molecules, different graph representations exists. The most natural approach is to represent atoms as nodes and bonds as edge. However, different socalled reduced graph models exist, where groups of atoms are represented by a single node and their properties by node attributes. The goal of this project is to compare the performance of the various graph learning techniques with different graph models. Based on the result tailored combinations of methods and representations should be developed.
Students wanting to work on this topic are expected to have experience in machine learning and basic knowledge on graph theory and algorithms. Solid programming skills, preferably in Python, are required.
Contact: Nils Kriege, Franka Bause
Oracle analysis of distant supervision errors
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Weakly supervised discourse relation prediction
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Incomplete Schema Relation Clustering
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Weakly supervised learning with latent class predictions
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Gradient matching for semisupervised learning
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Thresholdfinding for knowledgebase completion using Gaussian processes
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Pathbased knowledgebase completion
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Better sentence representations based on BERT
Please see [Moodle] for a detailed description of this and other Natural Language Processing topics.
Contact: Benjamin Roth
Explainable Policies for Game Play [Master Thesis]
Reinforcement learning has shown remarkable successes in playing board games like Go but also in playing computer games like DOTA. While these results are impressive, the learned strategies (policies) are typically not directly interpretable by humans. However, interpretability is important for instance to enable better collaboration of humans with AI agents or to assess safety of the learned behavior.
In this project you will make such learned policies easier to understand by humans by incorporating constraints on their structure. You will develop and implement a novel policy training algorithm and evaluate it on simple computer games.
Students wanting to work on this topic are expected to have solid coding skills in Python, basic knowledge in PyTorch or TensorFlow, and be curious to learn about reinforcement learning and its applications.
Contact: Sebastian Tschiatschek
Learning Composable Policies [Master Thesis]
Reinforcement learning has shown remarkable successes in application domains like game play and robotics but suffers from large sample complexity, limiting its broader usage. A promising approach for reducing the sample complexity is to learn skills that can be reused when solving a series of related tasks. In the context of robotics, such skills could for instance correspond to lifting a leg or grepping an object.
In this project you will develop and implement a learning algorithm algorithm for such composable policies based on recently proposed Successor Options. You will evaluate this algorithm on several benchmark tasks and compare it to other state of the art algorithms.
Students wanting to work on this topic are expected to have solid coding skills in Python, basic knowledge in PyTorch or TensorFlow, and be curious to learn about reinforcement learning and its applications.
Contact: Sebastian Tschiatschek
The Cost of Feedback
Reinforcement learning has shown remarkable successes in application domains like game play and robotics. These successes where achieved by performing large numbers of interactions of a learning agent with the environment. In these interactions, the agent constantly receives information about rewarding behavior. However, in many applications involving human users (e.g., personal assistants), the reward information is explicitly or implicitly provided by these human users. Therefore, this information must be treated as a scarce and valuable resource, and the (cognitive) cost of providing this information should be accounted for.
In this project you will study the cost of providing different types of feedback used for reinforcement learning in user studies. To this end, you will implement a simple reinforcement learning environment for which a human user can provide reward information in various forms (e.g., comparisons, sorted lists, etc.) and compare the cost of providing this feedback as well as the usefulness of the feedback for training the reinforcement learning agent. Depending on whether you are doing a P1/P2/Bachelor Thesis/Master Thesis, the scope of the project will be adjusted.
Students wanting to work on this topic are expected to have solid coding skills in Python, basic knowledge about the development of interactive web pages, and be curious to learn about reinforcement learning and its applications.
Contact: Sebastian Tschiatschek
Reinforcement Learning for Optimizing Electronic Circuit Design [Practical Course]
Electronic circuits are parts of all electric devices we use on a daily basis. When developing a new electric device, tailored electronic circuits have to be constructed for the specific use case, an often timeconsuming process. Therefore it is appealing to aim to automate this process. Reinforcement learning is a promising approach therefore and has shown remarkable successes in application domains like game play and robotics but is so far rarely used for designing and optimizing electronic circuits.
In this project you will build the basis for using reinforcement learning for electronic circuit design and optimization. In particular, you will wrap an existing electronic circuit simulator in a reinforcement learning environment such that it can be easily interacted with by standard reinforcement learning algorithms. You will then implement and evaluate standard reinforcement learning algorithms for designing and optimizing simple electronic circuits, e.g., current sources.
Students wanting to work on this topic are expected to have solid coding skills in Python and be curious to learn about reinforcement learning and apply it in a realworld application. Some background in electronic circuit design is advantageous but not a prerequisite.
Contact: Sebastian Tschiatschek
Algorithms for Finding Matched Molecular Pairs
The concept of matched molecular pairs (MMP) refers to two molecules, which have a large common substructure and differ only by a single local modification. Finding such pairs and analyzing their physical properties such as biological activity allows to attribute any differences to the specific structural change.
The problem can be formalized in terms of graph theory and is then closely related to variants of the classical graph isomorphism problem such as the maximum common subgraph problem and graph canonization. In this project, you will implement, develop and experimentally evaluate graph algorithms for finding MMPs. Several algorithmic challenges can be studied in more detail, e.g., the efficient computation of canonical forms in a dynamic setting or the use of index data structures and hashing techniques for acceleration.
Requirements: C++ or Java (no experience in chemistry necessary!)
Contact: Nils Kriege, Steffen Hirte
2D pharmacophore descriptors for machine learning and similarity searches
Pharmacophores are an abstract description of molecular features that are necessary for the molecular recognition of drugs by a biological macromolecule (usually proteins). The molecular features constituting a pharmacophore can be divided into different classes according to the nature of their nonbonding interaction with complementary parts of the macromolecular target receptor. Every molecule has a characteristic pharmacophoric feature pattern in terms of number, type and mutual distance of its features. Different molecules that are recognized by the same biological target usually show a high similarity regarding their feature pattern. A database search for molecules which have a pharmacophoric profile that is similar to a given query pharmacophore can thus reveal novel molecules that have a high chance to bind to the same receptor for which the query was created. To be able to search millions of molecules in acceptable time, a suitable inmemory pharmacophore representation is required which allows for a fast similarity assessment. In cheminformatics, an often employed representation of molecular structures which fulfills these requirements are fixedsize binary fingerprints (bit strings) where similarity calculations essentially boil down to fast logical bitset operations.
The goal of this project is to implement such a binary encoding also for pharmacophores that can be used for fast pharmacophore similarity searches and for the development of machine learning models. The intended algorithm for converting an arbitrary pharmacophore into a fixedsize bit representation is exemplified in the following figure:
The implementation will be carried out on top of the open source cheminformatics toolkit CDPKit (https://github.com/aglanger/CDPKit) which already provides all required functionality for the I/O of molecular structures and pharmacophore generation.
Requirements: C++ (preferable) or Python, basic knowledge about the fundamentals of organic chemistry advantageous
Contact: Thomas Seidel, Nils Kriege
Detecting errors in protein crystal structures
In this project, students are provided an electron density of a protein: a threedimensional grid with each point having a number assigned. Additionally, a molecule is specified through the type and position of each individual atom (on the same grid). Each atom should be associated with a specific amount of electron density in its surrounding. Your goal is implementing an algorithm that identifies unassigned or overallocated electron density.
The story behind this is the following: Proteins are the machinery of the human body. In order to understand the function of these important molecules, a clever technique is used to infer their threedimensional structure. Shooting an Xray beam on a protein crystal reveals the electron density that gives hints on the type and position of the individual atoms. Additional human effort is necessary to fit the protein into this density, but this last step is subjective and errorprone. Atoms are overlooked or placed in regions without supporting electron density. For example in the illustration above, one oxygen is modeled despite the lack of density and one hydrogen on the top right could as well be of a different type. Since a medicinal chemist depends on the correctness of the protein structure, a quick way to assess the compatibility of measured electron density and inferred atom positions would be of high importance!
Requirements: Javascript or Typescript or similar (no experience in chemistry necessary!)
Contact: Johannes Kirchmair, Nils Kriege
Molecular Puzzle
In this project students apply techniques of nonlinear optimization to find a perfect position and rotation of a given molecule in order to optimize the binding to another static molecular structure. The score of a specific position is calculated using a given function that mimics the forces of physics.
The story behind this is the following: designing a medicine for a given disease can often be broken down into solving this kind of threedimensional puzzle. Oftentimes, the static molecules are socalled proteins that control and execute all functions of the human body. These large structures communicate using small molecules as signals. Preventing a disease is often associated with breaking this communication by designing other molecules that block specific proteins. The blocking molecule has to bind well in order to fulfil this task, and this can be evaluated by simulating the optimal position and orientation of the molecule with respect to the given protein. In the simplified 2D illustration above, a large and static protein is indicated by the gray area with some relevant atoms sticking out. The orange glowing molecule can be moved freely, which affects the score. In this example, beneficial interactions of molecule and protein are indicated using green lines, and the orientation on the right would reflect reality more appropriately.
Requirements: Python or C++ (no experience in chemistry necessary!)
Contact: Steffen Hirte, Nils Kriege
Causal inference for wind turbine extreme events [Bachelor or Master thesis]
This work will focus mainly on programme implementation of algorithms for causal detection among extreme event processes and their testing on real data. The causal inference will be tested among the wind turbine time series as well as other relevant meteorological time series, as provided by Austrian ZAMG (Die Zentralanstalt für Meteorologie und Geodynamik). More details to the interested student will be provided on request.
Contact: Katerina Schindlerova
Clustering of physiological, behavioral and climatic data of wild boars (Sus scrofa) [Bachelor or Master thesis]
This thesis will mainly focus on developing a data base combining data retrieved from biologgers (body temperature, heart rate, acceleration data, location data), reproductive success and climate data collected during a research project on wild boars at the Research Institute of Wildlife Ecology, University of Veterinary Medicine. Aim of this work is to get a functional data base which would allow researchers to access and evaluate data of different aspects of climate change on wild boar ecology in the long term.
If the student is interested, this basic approach can be broadened by using and adapting e.g. a machine learning process to allow the interpretation of acceleration data. We already used the approach of machine learning to classify videotaped behaviors (laying, walking, trotting, lactating) using acceleration data (collected via eartags). However, it would be very interesting whether the approach of machine learning could help us to detect daily and annual patterns of activity, affected by e.g. high ambient temperatures, in these data.
We would be happy to find an interested student to support the ecological evaluation of these data.
Contact: Claudia Bieber, Claudia Plant
Clustering of spatiotemporal climatological data [Master thesis]
This work will focus on the integration of massive amounts of data of heterogeneous spatial and temporal resolution of spatiotemporal climatological data provided by ZAMG. An effective informationtheoretic objective function supporting different length of time series and different spatial resolutions will be applied. A joint lowdimensional vector space embedding of all data will be learned, which will be then used as a model for data compression by clustering. Measurement sites with similar characteristics are in this space close to each other and measurement sites with different characteristics will be far apart from each other in this vector space. Deep neural nets and greedy optimization techniques such as iterative least square methods will be considered to learn this space. A clustering algorithm will be developed, which works with this representation of the data and iteratively refines it during the clustering process.
Contact: Katerina Schindlerova, Claudia Plant
Reinforcement Learning from Implicit and Explicit Feedback
Recent advances in reinforcement learning have enabled superhuman performance of AI agents in many applications, e.g., game play. In this project you will work on building theory and/or models for reinforcement learning from implicit and explicit feedback. In particular, you will consider the setting in which no rewards are observed during execution of the agent but only cumulative reward information is obtained at the end of the execution. As a practical example, assume playing a computer game in which you cannot observe the score while you play the game but only see your achieved score at the very end. This setting has important applications in humanintheloop settings in which feedback can be expensive to obtain.
Students working on this project need basic background knowledge in machine learning, solid programming skills in Python, and the desire to work on reinforcement learning.
Contact: Sebastian Tschiatschek
Machine Learning for Personalized Education [Practical Course or Bachelor Thesis]
In this project you will work on predicting students' responses to mathematical questions using machine learning models. This is important to assess a student's skills and provide personalized educational resources. We use realworld data provided by the "Diagnostic Questions: Predicting Student Responses and Measuring Question Quality" challenge at NeurIPS'20. The challenge consists of four different tasks from which a subset can be selected for the project, depending on background knowledge and programming skills.
Students working on this project need basic background knowledge in machine learning, programming skills in Python, and the desire to develop and test machine learning models.
Contact: Sebastian Tschiatschek
Probabilistic Models of Multimodal Distributions
Many realworld phenomena can only be accurately described by multimodal distributions, e.g., the hourly power consumption of a household (typically) is bimodal, with peaks in the morning and the evening. However, learning deep generative models of these multimodal distributions, e.g., using variational auto encoders, still remains challenging. In this project you will develop theory and/or models for multimodal distributions based on variational autoencoders. You will evaluate these models on benchmark and realworld datasets.
Students working on this project need basic background knowledge in machine learning, programming skills in Python, and the desire to develop and test machine learning models.
Contact: Sebastian Tschiatschek
Reward Inference for Sequential Decision Making from Diverse and Implicit Feedback [Master Thesis]
Automated sequential decision making is an important application of machine learning systems in which such a system needs to select a sequence of actions step by step to optimize a reward/utility function. For instance, in autonomous driving, such a system needs to execute a sequence of steering, breaking and acceleration actions, or in a medical intensive care setting, such a system needs to execute a sequence of measurement and treatment actions.
One challenge in realizing such automated sequential decision making systems is the definition of the reward/utility function. For example, in autonomous driving it is hard to specify all the factors which define good driving behavior. In such settings, automatically inferring the reward/utility function from users’ feedback can be beneficial.
This project investigates approaches for reward/utility inference from diverse and implicit feedback, building on ideas for inverse reinforcement learning, active learning, implicit feedback, etc.
Interested students are expected to have solid mathematical and machine learning skills, and have experience in Python and deep learning (using PyTorch or TensorFlow).
Contact: Sebastian Tschiatschek
Imitation Learning Under Domain Mismatch
Reinforcement learning has been successfully used to solve certain challenging sequential decision making problems in recent years. The employed techniques commonly require (i) huge amounts of interactions with the envirnoment and (ii) clearly specified reward signals to work well. In many applications however, one or both of these requirements are not met. In such cases, imitation learning can be an efficient approach to sequential decision making problems: an expert demonstrates nearoptimal behavior and a learning agent attempts to mimic this behavior.
This project considers imitation learning in settings in which there is some form of mismatch between the expert demonstrator and the learning agent. The scope of the project is to study how existing algorithms perform in this setting and proposes modifications to existing algorithms to achieve better performance.
Students wanting to work on this topic are expected to have a basic understanding of machine learning techniques, solid knowledge of Python and basic knowledge of deep learning libraries (PyTorch or TensorFlow).
Contact: Sebastian Tschiatschek
Selecting Sequences of Items for Nonmonotone functions [Bachelor or Master Thesis]
Many applications involve the selection of sequences of items, e.g., recommender systems and personalization in MOOCs. Common to many of these applications is that the order of the selected items is not random but that items selected early, influence which items are selected later. The problem of selecting sequences of items has been studied in various settings, including that in which dependencies between items are modeled using graphs and monotone submodular functions.
This project aims at extending these settings to cover the case of nonmonotone submodular functions, by proposing new algorithms and analyzing their properties. The findings are validated by an implementation of the proposed algorithm(s) and comparison agains reasonable baselines.
Students wanting to work on this topic are expected to have solid mathematical skills, a basic understanding of machine learning techniques, solid knowledge of Python and deep learning libraries (PyTorch or TensorFlow).
Contact: Sebastian Tschiatschek
Posterior Consistency in Partial Variational Autoencoders
Variational Autoencoders (VAEs) are powerful deep generative models that have been successfully applied in a wide range of machine learning applications. Recently, the Partial VAE (PVAE), a variant of VAEs that can process partially observed inputs has been proposed and its effectiveness for data imputation has been demonstrated. Key to the fast training of VAEs and PVAEs is the amortized prediction of posterior distributions from observations. In PVAEs, these posterior distributions are predicted from partial observations.
This project aims at studying the consistency of these posterior distributions for different patterns of missing data. The insights are used to create/train better inference models and thereby improve the quality of PVAEs.
Students wanting to work on this topic are expected to have solid mathematical skills, a basic understanding of machine learning techniques and good programming skills in Python.
Contact: Sebastian Tschiatschek
Evaluation of different nowcasting techniques and data bases for temperature nowcasting
Nowcasting in meteorology refers to shorttime prediction of timeseries of e.g. a few minutes to hours into the future. For this shorttime predictions the most recent observations are of importance as well as spatial relationships of upstream observations.
Different methods for nowcasting exist in the meteorological world, often physics of physicsstatistical driven. Only a few studies focused on applying machine learning and data mining techniques. For the latter nowadays also crowdsourced information can contain important information. Here, data of NetAtmo stations could provide valuable information. For the first part a study investigated already different machine learning algorithms for nowcasting of temperature using the Austrian MetService (ZAMG) data. The idea of the proposed thesis is to combine the already existing algorithms with the NetAtmo data by first clustering the ZAMG data and NetAtmo data and then develop an algorithm which incorporates the NetAtmo data. Important to note is that NetAtmo sites are prone to errors thus need to be cleaned beforehand.
The developed algorithm will be evaluated against two statistical forecasting methods, an analogue search based method and a model output statistics method.
Contact: Irene Schicker, Claudia Plant
Exploratory Data Analysis on the GPU  Cuda Warp Level Primitives and Independent Thread Scheduling [Bachelor or Master Thesis]
Volta’s new independent thread scheduling capability enables finergrain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming. Here, we will enhance traditional Data Mining algorithms with the use of the GPU and its independent thread scheduling based on Cuda intrinsics. Candidate Algorithms are KMeans, DBSCAN, AprioriAlgorithm or dimensionality reduction techniques such as SVD or PCA.
Contact: Martin Perdacher
Exploratory Data Analysis with Google TPU [Bachelor or Master Thesis]
The Tensor Processing Unit (TPU) was announced in 2016 at Google I/O, when the company said that the TPU had already been used inside their data centers for over a year. The chip has been specifically designed for Google’s TensorFlow framework, a symbolic math library which is used for machine learning applications such as neural networks. Here, we will enhance traditional Data Mining algorithms with the use of the TPU. Candidate Algorithms are KMeans, DBSCAN, AprioriAlgorithm or dimensionality reduction techniques such as SVD or PCA.
Contact: Martin Perdacher
Causal inference among climatological time series with extreme events [Master thesis]
This work will focus mainly on implementation of an algorithm for causal detection among processes as well as its testing on probability distributions with heavy tailed distributions and on climatological data provided by ZAMG (Die Zentralanstalt für Meteorologie und Geodynamik) following these distributions. More details to the interested student will be provided on request.
Contact: Katerina Schindlerova
Data Mining on Real World Accelerometer Time Series [Master thesis]
In Data Mining and Machine Learning the choice of distance measure is a crucial design decision that strongly depends on the data structure and application. There are plenty of distance measures: Euclidean, Manhattan, edit distance, Dynamic Time Warping, SAX, Hemming, and many more. When analyzing time series data key information is given by the ordering of the observations, so the distance measure of choice should also regard this ordering, as Dynamic Time Warping (DTW) does.
Another research question is: How much data is enough? How many features are enough? What sampling rate is high enough?
You will start with literature research about time series distance measures and test them on accelerometer time series for supervised and unsupervised data mining tasks.
Contact: Maximilian Leodolter
Implementation of Data Mining Approach for Shortrange Temperature Forecasts
Shortrange forecasts of wind speeds (i.e., 1  72 hours into the future) and in particular nowcasting (i.e., very shortrange forecasts with a time horizon of up to 6 hours) are vital for a wide range of applications. In contrast to wind, temperature typically changes gradually and may thus need a different setup than wind speed. Temperature involves daily fluctuations, which are well predictable but dependable on the daytime, which must be considered in the training dataset. Depending on the location, also rapid temperature changes may occur which are related, for example, to cold air pools or Föhn. Thus, also temperature highly depends on location (specific topography, prevailing weather conditions, and atmospheric dynamics). Depending on the application, we can either give point or area based predictions. Points refer to a particular location (e.g., a weather station) whereas spatial forecasts typically give a forecast for each grid cell over a region (i.e., each forecast is valid for the whole cell).
ZAMG (Zentralanstalt für Meteorologie und Geodynamik) employs (gridded) numerical weather prediction models in conjunction with observation data for shortrange forecasting and a nowcasting system, INCA, for the prediction of meteorological parameters. Alternatively, machine learning methods are now being implemented. In particular, an artificial neural network (ANN) and a random forests (RF) are used in an experimental setup to show the skills of these methods and, possibly, be used as additional wind speed point forecasting method for the 10meter wind. The existing methods can be used for temperature as well with the same training set, but the setup needs to be adapted.
The proposed student project shall address temperature forecasting (in two meters height) at meteorological observation sites by machine learning and data mining methods (e.g.: random forest, feed forward/BP artificial neural networks, kernel methods, etc.) and input feature selection of the training data set. It is possible to experiment with related meteorological parameters as well (e.g.: drew point temperature and relative humidity). Related work suggests to use back propagation neural network for predicting the 2m drew point temperature and 2m temperature. This can be used as a starting point to find a suitable selection of the training data for the current model and then extend the our current approaches by a back propagation neural network in order to set up a first prototype for temperature prediction by machine learning methods. The new model shall be tested on various scenarios (e.g.: different prevailing weather condition, locations, seasons) in order to compare the new data mining based model with the currently employed nowcasting system INCA.
The work is cosupervised by ZAMG's Section for numerical weather predictions (NWP) Applications. The developed method shall have a Python based frontend, C/C++ backend, and use csv or sqlite based meteorological data (provided by ZAMG) in order to align with other machine learning implementations running in our IT environment. Finally, the developed method will be set up in our development environment (Python 2.7/Linux 64bit, multicored shared memory machine) to provide forecasts and validation of the method for selected test scenarios.
Contact: Petrina Papazek, Claudia Plant
SIGMOD programming contest
The ACM Special Interest Group on Management of Data organizes an annual programming contest. The contest and its content will be announced in February at the SIGMOD website see: https://sigmod2020.org/. The pricepool is often up to 5000$ for computing resources (cloud access). For this "Praktikum" we organize a team of 2 to 4 people.
Contact: Martin Perdacher
Completed
 Yigit Berkay Bozkurt, Bachelor Thesis: "Anomaly Detection by Heterogenous Graphical Granger Causality and its Application to Climate Data", 2019
 Christina Pacher, Bachelor Thesis: "Clustering Weather Stations: A Clustering Application for Meteorological Data", summer term 2019
 Thomas Spendlhofer, Bachelor Thesis: "Evaluating the usage of Tensor Processing Units (TPUs) for unsupervised learning on the example of the kmeans algorithm", summer term 2019
 Ernst Naschenweng, Bachelor Thesis: "A cache optimized implementation of the FloydWarshall Algorithm", summer term 2018
 Hermann Hinterhauser, Bachelor Thesis: "ITGC: Informationtheoretic gridbased clustering", summer term 2018, accepted paper in EDBT 2019 (download available here)
 Mahmoud A. Ibrahim, Bachelor Thesis: "Parameter Free MixedType DensityBased Clustering", winter term 2017/2018, accepted paper in DEXA 2018 (download available here)
 Markus Tschlatscher: "SpaceFilling Curves for Cache Efficient LU Decomposition", winter term 2017/18
 Theresa Fruhwuerth, Master Thesis: "Uncovering High Resolution Mass Spectrometry Patterns through Audio Fingerprinting and Periodicity Mining Algorithms: An Exploratory Analysis", summer term 2017
 Robert Fritze, PR1 "Combining spatial information and optimization for locating emergency medical service stations: A case study for Lower Austria", summer term 2017
 Alexander Pfundner, PR2 "Integration of Densitybased and Partitioningbased Clustering Methods", summer term 2017
 Anton Kovác, Katerina HlaváckováSchindler, Erasmus project, "Graphical Granger Causality for Detection Temporal Anomalies in EEG Data", winter term 2016/17 (download available here)