Open topics for theses and practical courses

 

Practical Course: Pharmacoinformatics Research Group

The Pharmacoinformatics Research Group is led by Univ.-Prof. Dr. Gerhard Ecker at the Department of Pharmaceutical Chemistry.

Following a holistic pharmacoinformatic approach we combine structural modeling of proteins, structure-based drug design, chemometric and in silico chemogenomic methods, statistical modeling and machine learning approaches to develop predictive computational systems for transporters and ion channels.

We work with workflow management systems like KNIME for data integration, we do statistical analysis in R, we program predictive models in Python and at times offer these tools to fellow researchers, and this is the part where you come in: We often need help in making our tools openly accessible, such as translating them into a web service or turning them into software.

For a recent tool take a look at our LiverTox Workspace.

Contact: Claudia PlantJana Gurinova


Master or Bachelor Thesis: Exploratory Data Analysis on the GPU - Cuda Warp Level Primitives and Independent Thread Scheduling

Volta’s new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads. Finally, a new combined L1 Data Cache and Shared Memory subsystem significantly improves performance while also simplifying programming. Here, we will enhance traditional Data Mining algorithms with the use of the GPU and its independent thread scheduling based on Cuda intrinsics. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher


Master or Bachelor Thesis: Exploratory Data Analysis with Google TPU

The Tensor Processing Unit (TPU) was announced in 2016 at Google I/O, when the company said that the TPU had already been used inside their data centers for over a year. The chip has been specifically designed for Google’s TensorFlow framework, a symbolic math library which is used for machine learning applications such as neural networks. Here, we will enhance traditional Data Mining algorithms with the use of the TPU. Candidate Algorithms are K-Means, DBSCAN, Apriori-Algorithm or dimensionality reduction techniques such as SVD or PCA.

Contact: Martin Perdacher


Master or Bachelor Thesis: Predicting driving behavior based on smartphone data

Identifying drivers who have a higher risk of getting involved in traffic accidents can have several real-world benefits. The driver can be made aware of his/her risky behavior, other road users can be warned early, or insurance companies can improve risk assessments and reduce information asymmetry. The models used for risk prediction consist of many submodules, which are embedded in a larger pipeline. The aim of this project is to predict whether a car drives in a rural or urban area, based on data from smartphone sensors. This context information is an important feature for risk prediction models and is currently only available by labeling the data by hand.

The project focuses on the evaluation of more than one million collected user trips from different countries. The data consists of time series from smartphone sensors and cross-sectional metadata. During the project, the entire KDD process is applied, from pre-processing and feature engineering to modelling and evaluation. The scope of the thesis can be discussed and decided based on individual strengths and preferences. Students wanting to work on this project are expected to have a basic understanding of machine learning and data mining techniques and a solid knowledge of Python, preferably including its data science stack (pandas, scikit-learn, numpy, ...).

Contact: Lukas Miklautz


Master or Bachelor Thesis: Causality by compression schemes

Minimum description length principle can be used to assess causal direction. The goal of this theses is to develop models and algorithms based on compression schemes. The models and algorithms will be validated on the state-of the art models.

Contact: Katerina Schindlerova


Master Thesis: Temporal anomaly detection in meteorological time series

Having time series of speed and orientation of wind in various sites in Austria, we are interested to detect temporal anomalies in meteorological sites over a long time observation period. The goal of this thesis is develop an algorithm fitted on the specifics of the data and compare and validate it to the state-of-the art methods.

Contact: Katerina Schindlerova


Master or Bachelor Thesis: Mining in time series: Nonlinear and non-additive causal modeling

The goal of this thesis is to develop models and algorithms able to detect causal patterns in time series which are of nonlinear or nonadditive nature. The models and algorithms will be validated on the state-of the art causal models.

Contact: Katerina Schindlerova


Master or Bachelor Thesis: Spectral clustering and graph sparsification

Spectral clustering is a well-know graph-clustering approach leading to efficiency and effectiveness while you do not cluster data points directly in their native data space but instead from a similarity matrix. On the other hand, graph sparsification, i.e. approximating a given graph by a graph with fewer edges or vertices, is a useful and versatile primitive in designing efficient graph algorithms. There are various criteria to achive a sparse graph. In this project we aim at sparsifying the complex graphs by preserving the cultering structure.

Contact: Sahar Behzadi Soheil


Master Thesis: Data Mining on Real World Accelerometer Time Series

In Data Mining and Machine Learning the choice of distance measure is a crucial design decision that strongly depends on the data structure and application. There are plenty of distance measures: Euclidean, Manhattan, edit distance, Dynamic Time Warping, SAX, Hemming, and many more. When analyzing time series data key information is given by the ordering of the observations, so the distance measure of choice should also regard this ordering, as Dynamic Time Warping (DTW) does.

Another research question is: How much data is enough? How many features are enough? What sampling rate is high enough?

You will start with literature research about time series distance measures and test them on accelerometer time series for supervised and unsupervised data mining tasks.

Contact: Maximilian Leodolter


Master Thesis: Predicting urban heat island intensity in a changing climate

Cities are critical areas where climate change is expected to have severe impacts. One of the well-known problems is the increase in temperatures in cities as compared to their surroundings caused by the modification of the energy balance in the built-up urban environment. This phenomenon is called the urban heat island effect, which is responsible for enhanced heat stress, health risks and reduced quality of life for the urban population. Although urban heat islands are primarily characterized by the structure of the urban environment, their intensity and pattern depend on the large-scale atmospheric conditions too. Combining weather pattern statistics and high-resolution urban heat island information, this connection can be investigated in detail.

The aim of this project is to determine the relationship between large-scale weather patterns and the urban heat island intensity and pattern of Vienna using machine learning techniques. As a second step, predictions of the urban heat island intensity in a changing climate can be performed using those results combined with climate model data for past and future climatic periods.

The work is co-supervised by the Urban Modelling team (part of the Section for Numerical Weather Prediction Applications) of the ZAMG (Zentralanstalt für Meteorologie und Geodynamik, Austria’s national weather service), and is part of the team’s research activities. As a result, the programming environment used within the project should be compatible with ZAMG’s IT infrastructure.

Contact: Rosmarie de Wit


Implementation of Data Mining Approach for Short-range Temperature Forecasts

Short-range forecasts of wind speeds (i.e., 1 - 72 hours into the future) and in particular nowcasting (i.e., very short-range forecasts with a time horizon of up to 6 hours) are vital for a wide range of applications. In contrast to wind, temperature typically changes gradually and may thus need a different setup than wind speed. Temperature involves daily fluctuations, which are well predictable but dependable on the daytime, which must be considered in the training dataset. Depending on the location, also rapid temperature changes may occur which are related, for example, to cold air pools or Föhn. Thus, also temperature highly depends on location (specific topography, prevailing weather conditions, and atmospheric dynamics). Depending on the application, we can either give point or area based predictions. Points refer to a particular location (e.g., a weather station) whereas spatial forecasts typically give a forecast for each grid cell over a region (i.e., each forecast is valid for the whole cell).

ZAMG (Zentralanstalt für Meteorologie und Geodynamik) employs (gridded) numerical weather prediction models in conjunction with observation data for short-range forecasting and a nowcasting system, INCA, for the prediction of meteorological parameters. Alternatively, machine learning methods are now being implemented. In particular, an artificial neural network (ANN) and a random forests (RF) are used in an experimental setup to show the skills of these methods and, possibly, be used as additional wind speed point forecasting method for the 10-meter wind. The existing methods can be used for temperature as well with the same training set, but the setup needs to be adapted.

The proposed student project shall address temperature forecasting (in two meters height) at meteorological observation sites by machine learning and data mining methods (e.g.: random forest, feed forward/BP artificial neural networks, kernel methods, etc.) and input feature selection of the training data set. It is possible to experiment with related meteorological parameters as well (e.g.: drew point temperature and relative humidity). Related work suggests to use back propagation neural network for predicting the 2-m drew point temperature and 2-m temperature. This can be used as a starting point to find a suitable selection of the training data for the current model and then extend the our current approaches by a back propagation neural network in order to set up a first prototype for temperature prediction by machine learning methods. The new model shall be tested on various scenarios (e.g.: different prevailing weather condition, locations, seasons) in order to compare the new data mining based model with the currently employed nowcasting system INCA.

The work is co-supervised by ZAMG's Section for numerical weather predictions (NWP) Applications. The developed method shall have a Python based frontend, C/C++ backend, and use csv or sqlite based meteorological data (provided by ZAMG) in order to align with other machine learning implementations running in our IT environment. Finally, the developed method will be set up in our development environment (Python 2.7/Linux 64-bit, multi-cored shared memory machine) to provide forecasts and validation of the method for selected test scenarios.

Contact: Petrina Papazek

 

Completed