Johann Gamper
Free University of Bozen-Bolzano
Faculty of Computer Science
Dominikanerplatz 3, 39100 Bozen-Bolzano, Italy
email: gamper at inf dot unibz dot it
phone: +39-0471-016140
PostDoc/PhD position in spatio-temporal databases available! More information is here.

Projects for Internships and BSc/MSc Theses

This page describes projects for BSc/MSc internships and theses in the area of DB, in particular temporal and spatio-temporal databases as well as data warehousing and data analysis. A project typically consists in the development, implementation, and evaluation of algorithmic solutions. Some projects are embedded in a collaboration with external partners. For more information or new project proposals, please contact me at gamper_at_inf.unibz.it.

A Tool to Support Quality Control in GWAS

Genome-wide association studies (GWAS) scan the entire human genome to assess whether any specific genetic variation is associated with a disease. The output is a relational table with millions of rows and dozens of columns. Each row corresponds to a different genetic marker. Columns store various marker attributes (e.g., id, chromosomal position, genotype) and statistics about the association between markers and a disease (e.g. effect size, standard error, p-value). With current technologies the typical dimensionality of a single GWAS result table is up to 10 million of rows (or > 1GB of data).

To increase the power of association tests, the results of different GWASs are combined together into GWAS meta-analyses, typically involving from 10 to 100 studies that provide hundreds of gigabytes of data that are stored in tabular files. To ensure high quality results, an accurate quality control (QC) of all GWAS data files is important. The QC involves single file checks for, e.g., correct formatting, duplicates, and the distribution of summary statistics. The GWAtoobox (C. Fuchsberger, D. Taliun, P. P. Pramstaller, and C. Pattaro on behalf of the CKDGen consortium. GWAtoolbox: an R package for fast quality control and handling of GWAS meta-analysis data. Bioinformatics 28(3): 444-445, 2012) is the first software that supports time and memory efficient QC of massive data from GWAS and provides visual and textual data quality reports. The current limitation of GWAtoolbox is the absence of between-file comparisons, when data in every single GWAS file is checked against the data in all other GWAS files. (see also http://www.eurac.edu/en/research/institutes/geneticmedicine/Software/GWAtoolbox.html)

The aim of this thesis is to extend the GWAtoolbox from an unidimensional file checking to a multidimensional QC, with the inclusion of systematic between-file comparisons supporting both textual and visual reports. Given the large amounts of data, time and memory efficiency of the developed solution is crucial.

The student should have C/C++ programming experience. Some experience in R would be beneficial, but is not mandatory.

Comparison of SES Pattern Matching with Regular Expression Matching

Event pattern matching tries to match input events against a complex query pattern that specifies constraints on extent, order, and values of matching events. It is applied in different areas, including finance, click stream analysis, or RFID-based tracking and monitoring. Past research is limited to patterns consisting of sequences of single events. We proposed sequenced event set (SES) pattern matching that allows to match sequences of sets of events. While the order of events in each set is irrelevant, the order of events in distinct sets must follow the specified order. For the evaluation, we propose an automaton-based algorithm. For more details see: B. Cadonna, J. Gamper, M.H. Böhlen. Sequenced Event Set Pattern Matching. In Proc. of EDBT-11, pages 33-44, 2011.

The aim of this thesis is to analytically and experimentally compare SES pattern matching with regular expression matching (i.e., SES automata vs. conventional automata).

The student should have some interest in regular expressions, automata, and event pattern matching as well as willingness to code in C (some experience in C would be beneficial, but is NOT mandatory).

Incremental Computation of Isochrones in Multimodal Networks

In urban planning, it is important to assess the coverage of the city by various kinds of public services. An effective way to do so is to compute isochrones. An isochrone for a given query point, q, is defined as the set of all points on a road network from where q is reachable in a given timespan. A preliminary solution for the computation of ischrones is available here.

The aim of this project is to develop an efficient algorithm that incrementally computes isochrones when the values of one or more input parameters change, e.g., when they are controlled by a slider. This allows, for example, to visualize in steps of 5 minutes how an isochrone changes during a day, depending on the availability of buses.

Medical Data Warehousing and Analysis

In the MEDAN project, we investigate in close collaboration with the Hospital of Meran-Merano efficient methods for the maintenance and analysis of huge amounts of medical data. MEDAN offers a number of different BSc/MSc projects (for more details see the MEDAN page):

  • Efficient generation of patient histories in OncoNet
  • Historical queries in OncoNet
  • Trend recognition
  • Analysis of blood thinning terapies in TaoNet