Open topics for theses and practical courses

All topics are supervised by Claudia Plant. For more information on the topics, please contact the mentioned person.

 

Practical Course: Pharmacoinformatics Research Group

The Pharmacoinformatics Research Group is led by Univ.-Prof. Dr. Gerhard Ecker at the Department of Pharmaceutical Chemistry.

Following a holistic pharmacoinformatic approach we combine structural modeling of proteins, structure-based drug design, chemometric and in silico chemogenomic methods, statistical modeling and machine learning approaches to develop predictive computational systems for transporters and ion channels.

We work with workflow management systems like KNIME for data integration, we do statistical analysis in R, we program predictive models in Python and at times offer these tools to fellow researchers, and this is the part where you come in: We often need help in making our tools openly accessible, such as translating them into a web service or turning them into software.

For a recent tool take a look at our LiverTox Workspace.

Contact: Jana Gurinova


Practical Course, Master or Bachelor Thesis: Evaluation of different nowcasting techniques and data bases for temperature nowcasting

Nowcasting in meteorology refers to short-time prediction of timeseries of e.g. a few minutes to hours into the future. For this short-time predictions the most recent observations are of importance as well as spatial relationships of upstream observations.

Different methods for nowcasting exist in the meteorological world, often physics of physics-statistical driven. Only a few studies focused on applying machine learning and data mining techniques. For the latter nowadays also crowd-sourced information can contain important information. Here, data of NetAtmo stations could provide valuable information. For the first part a study investigated already different machine learning algorithms for nowcasting of temperature using the Austrian Met-Service (ZAMG) data. The idea of the proposed thesis is to combine the already existing algorithms with the NetAtmo data by first clustering the ZAMG data and NetAtmo data and then develop an algorithm which incorporates the NetAtmo data. Important to note is that NetAtmo sites are prone to errors thus need to be cleaned beforehand.

The developed algorithm will be evaluated against two statistical forecasting methods, an analogue search based method and a model output statistics method.

Contact: Irene Schicker


Practical Course, Master or Bachelor Thesis: IoT data outlier detection for urban meteorology

Crowd-sourced data are rapidly gaining importance for a wide range of meteorological applications, such as urban heat island (UHI) studies. Differences in land cover between a city and the surrounding areas (e.g. concrete and asphalt vs trees and meadows) result in temperatures generally being higher in cities (referred to as the UHI effect), with additional intra-urban hot spots caused by local differences in urban fabric. Detailed information about the temperature distribution is important, as heat directly influences human health and wellbeing.

Traditionally, high-quality weather stations that are part of the operational observational network have been used to study this effect. However, this network is too coarse to obtain information at the high spatial resolution needed for the identification of these hot spots. Through the mainstreaming of Internet-of-Things (IoT) applications, the number of personal weather stations has increased dramatically in recent years, providing a cost-effective opportunity to obtain higher-resolution information.

For Vienna, this results in over 1400 data points (a number that is expected to continue to grow) in addition to the nine stations in the operational network, albeit with unknown quality. Therefore, a stringent outlier detection is necessary, which should ideally be fast and retain as much useful information as possible.

Contact: Rosmarie de Wit


Master or Bachelor Thesis: Exploratory Data Analysis on the GPU - Cuda Warp Level Primitives and Independent Thread Scheduling

Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming. Here, we will enhance traditional Data Mining algorithms with the use of the GPU and its independent thread scheduling based on Cuda intrinsics. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher


Master or Bachelor Thesis: Exploratory Data Analysis with Google TPU

The Tensor Processing Unit (TPU) was announced in 2016 at Google I/O, when the company said that the TPU had already been used inside their data centers for over a year. The chip has been specifically designed for Google’s TensorFlow framework, a symbolic math library which is used for machine learning applications such as neural networks. Here, we will enhance traditional Data Mining algorithms with the use of the TPU. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher


Master or Bachelor Thesis: Predicting driving behavior based on smartphone data

Identifying drivers who have a higher risk of getting involved in traffic accidents can have several real-world benefits. The driver can be made aware of his/her risky behavior, other road users can be warned early, or insurance companies can improve risk assessments and reduce information asymmetry. The models used for risk prediction consist of many submodules, which are embedded in a larger pipeline. The aim of this project is to predict whether a car drives in a rural or urban area, based on data from smartphone sensors. This context information is an important feature for risk prediction models and is currently only available by labeling the data by hand.

The project focuses on the evaluation of more than one million collected user trips from different countries. The data consists of time series from smartphone sensors and cross-sectional metadata. During the project, the entire KDD process is applied, from pre-processing and feature engineering to modelling and evaluation. The scope of the thesis can be discussed and decided based on individual strengths and preferences. Students wanting to work on this project are expected to have a basic understanding of machine learning and data mining techniques and a solid knowledge of Python, preferably including its data science stack (pandas, scikit-learn, numpy, ...).

Contact: Lukas Miklautz


Practical Course or Master Thesis: Deep Learning based Clustering

Deep embedded clustering or deep clustering is a growing field that combines ideas from clustering and deep learning. The integration of these techniques makes it possible to learn features automatically from the data and to increase clustering performance together. Deep clustering covers different areas of machine learning and data mining, namely deep learning, clustering, dimensionality reduction, matrix factorization and representation learning, which makes it an intriguing research direction. Interested students can work on the latest research by implementing recently published algorithms and/or supporting the development of new algorithms.

The scope of the project can be discussed and decided based on individual strengths and preferences. Students wanting to work on this topic are expected to have a basic understanding of machine learning and data mining techniques and a solid knowledge of Python, preferably including its data science stack (pandas, scikit-learn, numpy, …) and deep learning libraries (pytorch, tensorflow, …).

Contact: Lukas Miklautz


Practical Course or Bachelor Thesis: Causality Among Heterogeneous Processes by MDL

Minimum description length (MDL) principle can be used for causal inference. The goal of this practical course/thesis is to execute experiments for an existing toolbox for causal inference by MDL among heterogeneous random processes or to make an experimental comparison to a rival method. Any previous knowledge of causal inference, of MDL or of advanced statistical methods is for the work not necessary.

Contact: Katerina Schindlerova


Master or Bachelor Thesis: Implementation of an Algorithm for Anomaly Detection in Gait Time Series and its Testing

The goal of this theses is to implement a code for temporal anomaly detection algorithm on the motion capture database MoCap of time series of human gait. Depending whether a bachelor or master thesis, the work will be extended based on the queries in gait recognition.

Contact: Katerina Schindlerova


Master or Bachelor Thesis: Mining in time series: Nonlinear and non-additive causal modeling

The goal of this thesis is to develop models and algorithms able to detect causal patterns in time series which are of nonlinear or nonadditive nature. The models and algorithms will be validated on the state-of the art causal models.

Contact: Katerina Schindlerova


Master or Bachelor Thesis: Spectral clustering and graph sparsification

Spectral clustering is a well-know graph-clustering approach leading to efficiency and effectiveness while you do not cluster data points directly in their native data space but instead from a similarity matrix. On the other hand, graph sparsification, i.e. approximating a given graph by a graph with fewer edges or vertices, is a useful and versatile primitive in designing efficient graph algorithms. There are various criteria to achive a sparse graph. In this project we aim at sparsifying the complex graphs by preserving the cultering structure.

Contact: Sahar Behzadi Soheil


Master Thesis: Data Mining on Real World Accelerometer Time Series

In Data Mining and Machine Learning the choice of distance measure is a crucial design decision that strongly depends on the data structure and application. There are plenty of distance measures: Euclidean, Manhattan, edit distance, Dynamic Time Warping, SAX, Hemming, and many more. When analyzing time series data key information is given by the ordering of the observations, so the distance measure of choice should also regard this ordering, as Dynamic Time Warping (DTW) does.

Another research question is: How much data is enough? How many features are enough? What sampling rate is high enough?

You will start with literature research about time series distance measures and test them on accelerometer time series for supervised and unsupervised data mining tasks.

Contact: Maximilian Leodolter


Master Thesis: Predicting urban heat island intensity in a changing climate

Cities are critical areas where climate change is expected to have severe impacts. One of the well-known problems is the increase in temperatures in cities as compared to their surroundings caused by the modification of the energy balance in the built-up urban environment. This phenomenon is called the urban heat island effect, which is responsible for enhanced heat stress, health risks and reduced quality of life for the urban population. Although urban heat islands are primarily characterized by the structure of the urban environment, their intensity and pattern depend on the large-scale atmospheric conditions too. Combining weather pattern statistics and high-resolution urban heat island information, this connection can be investigated in detail.

The aim of this project is to determine the relationship between large-scale weather patterns and the urban heat island intensity and pattern of Vienna using machine learning techniques. As a second step, predictions of the urban heat island intensity in a changing climate can be performed using those results combined with climate model data for past and future climatic periods.

The work is co-supervised by the Urban Modelling team (part of the Section for Numerical Weather Prediction Applications) of the ZAMG (Zentralanstalt für Meteorologie und Geodynamik, Austria’s national weather service), and is part of the team’s research activities. As a result, the programming environment used within the project should be compatible with ZAMG’s IT infrastructure.

Contact: Rosmarie de Wit


Implementation of Data Mining Approach for Short-range Temperature Forecasts

Short-range forecasts of wind speeds (i.e., 1 - 72 hours into the future) and in particular nowcasting (i.e., very short-range forecasts with a time horizon of up to 6 hours) are vital for a wide range of applications. In contrast to wind, temperature typically changes gradually and may thus need a different setup than wind speed. Temperature involves daily fluctuations, which are well predictable but dependable on the daytime, which must be considered in the training dataset. Depending on the location, also rapid temperature changes may occur which are related, for example, to cold air pools or Föhn. Thus, also temperature highly depends on location (specific topography, prevailing weather conditions, and atmospheric dynamics). Depending on the application, we can either give point or area based predictions. Points refer to a particular location (e.g., a weather station) whereas spatial forecasts typically give a forecast for each grid cell over a region (i.e., each forecast is valid for the whole cell).

ZAMG (Zentralanstalt für Meteorologie und Geodynamik) employs (gridded) numerical weather prediction models in conjunction with observation data for short-range forecasting and a nowcasting system, INCA, for the prediction of meteorological parameters. Alternatively, machine learning methods are now being implemented. In particular, an artificial neural network (ANN) and a random forests (RF) are used in an experimental setup to show the skills of these methods and, possibly, be used as additional wind speed point forecasting method for the 10-meter wind. The existing methods can be used for temperature as well with the same training set, but the setup needs to be adapted.

The proposed student project shall address temperature forecasting (in two meters height) at meteorological observation sites by machine learning and data mining methods (e.g.: random forest, feed forward/BP artificial neural networks, kernel methods, etc.) and input feature selection of the training data set. It is possible to experiment with related meteorological parameters as well (e.g.: drew point temperature and relative humidity). Related work suggests to use back propagation neural network for predicting the 2-m drew point temperature and 2-m temperature. This can be used as a starting point to find a suitable selection of the training data for the current model and then extend the our current approaches by a back propagation neural network in order to set up a first prototype for temperature prediction by machine learning methods. The new model shall be tested on various scenarios (e.g.: different prevailing weather condition, locations, seasons) in order to compare the new data mining based model with the currently employed nowcasting system INCA.

The work is co-supervised by ZAMG's Section for numerical weather predictions (NWP) Applications. The developed method shall have a Python based frontend, C/C++ backend, and use csv or sqlite based meteorological data (provided by ZAMG) in order to align with other machine learning implementations running in our IT environment. Finally, the developed method will be set up in our development environment (Python 2.7/Linux 64-bit, multi-cored shared memory machine) to provide forecasts and validation of the method for selected test scenarios.

Contact: Petrina Papazek

 

Completed

     

  • Hermann Hinterhauser, Bachelor Thesis: "ITGC: Information-theoretic grid-based clustering", summer term 2018, accepted paper in EDBT 2019 (download available here)
  •  

  • Mahmoud A. Ibrahim, Bachelor Thesis: "Parameter Free Mixed-Type Density-Based Clustering", winter term 2017/2018, accepted paper in DEXA 2018 (download available here)
  •  

  • Markus Tschlatscher: "Space-Filling Curves for Cache Efficient LU Decomposition", winter term 2017/18
  •  

  • Theresa Fruhwuerth, Master Thesis: "Uncovering High Resolution Mass Spectrometry Patterns through Audio Fingerprinting and Periodicity Mining Algorithms: An Exploratory Analysis", summer term 2017
  •  

  • Robert Fritze, PR1 "Combining spatial information and optimization for locating emergency medical service stations: A case study for Lower Austria", summer term 2017
  •  

  • Alexander Pfundner, PR2 "Integration of Density-based and Partitioning-based Clustering Methods", summer term 2017
  •  

  • Anton Kovác, Katerina Hlavácková-Schindler, Erasmus project, "Graphical Granger Causality for Detection Temporal Anomalies in EEG Data", winter term 2016/17 (download available here)