Anton Dignös

Publications

This page contains a list of research publications with abstracts and links to online versions. If you cannot access the file you are interested in, please, feel free to contact me. Also, have a look at my dblp, google scholar and researchgate pages.

◼ Journal ◼ Conference proceedings ◼ Workshop proceedings ◼ Editor ◼ Book or book chapter ◼ Other

2025

Mohamed Sabri Hafidi, Ozan Kahramanogullari, Anton Dignös, and Johann Gamper: “Relational Data Models for Genetic VCF data”, in Proceedings of the VLDB Endowment (PVLDB), 18(11): 4045-4053, July 2025.
View info

Abstract The Variant Call Format (VCF) and its binary counterpart (BCF) are commonly used in bioinformatics for storing gene sequence data. While VCF files provide compact storage, they require specific tools and scripts for querying, thereby missing the rich functionality arsenal of database management systems and their potential for integration in multiomics pipelines. In this paper, we leverage Relational Database Management Systems (RDBMS) to enhance efficiency and flexibility in storing and querying large-scale genetic datasets. We map the VCF file structure to narrow, wide, and array-based data models that are further refined using JSON data structures, resulting in eight data models. Our experimental evaluation shows that RDBMS provide competitive performance in comparison with specialized state-of-the-art tools while making full-fledged database capabilities available for genetic data analysis.

Paper | Slides | Poster | Code
Muhammad Adnan, Diego Calvanese, Julien Corman, Anton Dignös, Werner Nutt, and Ognjen Savkovic: “Compact Answers to Temporal Path Queries”, in Proceedings of the 24th International Semantic Web Conference (ISWC), Nara, Japan, accepted for publication, November 2-6, 2025.
View info

Abstract We study path-based graph queries that in addition to navigation over edges also perform navigation over time. This allows one to ask questions about the dynamics of networks, like traffic movement, cause effect relationships, or spread of diseases. In this setting, graphs consist of triples annotated with validity intervals while a query produces pairs of nodes where each pair is associated with a binary relation over time. For instance, such a pair could be two airports and the relation could map potential departure times to possible arrival times. An open question is how to represent such a relation in a compact form and how to maintain such properties during query evaluation. We investigate four compact representations of answers to a such a query that rely on alternative ways of encoding sets of intervals over both discrete and dense time. We discuss their respective advantages and drawbacks, in terms of conciseness, uniqueness, and computational cost. Notably, the most refined encoding guarantees that query answers over dense time can be finitely represented.

Paper | Technical report (extended version)
Sameen Mustafa, Elias Ganthaler, Attaullah Buriro, Anton Dignös, Thomas Villgrattner, Franco Concli, and Angelika Peer: “Crack Detection in Powder Compacts using Machine Learning Models”, in International Journal of Structural Integrity, pp.1-19, September 2025.
View info

Abstract Cracks in powder metallurgy components, being a common problem, pose significant manufacturing challenges, but are detrimental to be detected as they affect the material's mechanical properties. To detect these cracks, Non-Destructive Testing (NDTs) methods are often used, but they come with high costs and time delays as samples need to be extracted from production in given intervals. To overcome these limitations, an indirect method of crack detection, that is, modelling it as a binary classification, is explored in this work. This study introduces a supervised machine learning approach using force signal feature extraction to detect cracks. More specifically, a supervised learning algorithm is developed and validated for the classification of samples into samples with or without crack based on the sensory data of the hydraulic press used for production. We compare different ensemble classifiers including random forest (RF), AdaBoost (ADA), Bagging, gradient boosting (GB), and extra trees (ET) in terms of their ability to classify workpieces using a dataset from real production. To this end, the present study deals with experimental workpieces of a specific type produced by manually adjusting the press parameters to artificially induce cracks in parts of the workpieces. The best performing model resulted in a classification accuracy as high as 99% offering a cost-effective and efficient alternative to traditional NDT methods. This study provides a novel and indirect method for detecting cracks in powder metallurgy components using machine learning models trained on press sensor data, which can significantly reduce the need for costly and time-consuming NDT techniques.

Paper
Muhammad Adnan, Diego Calvanese, Julien Corman, Anton Dignös, Werner Nutt, and Ognjen Savkovic: “Computing Compact Answers to Temporal Path Queries Using SQL”, in Companion Proceedings of the 9th International Joint Conference on Rules and Reasoning (RuleML+RR), Istanbul, Turkey, accepted for publication, September 22-24, 2025.
View info

Abstract Temporal Regular Path Queries (TRPQs) are a recent extension of regular path queries over a graph where facts are annotated with time intervals. They enable navigation both in time and over the structure of the graph. A TRPQ return pairs of entities, each associated with a binary temporal relation, which relates the two entities through time. This allows modelling phenomena such as the propagation of a virus, or mapping the possible departures of a trip to its possible arrival times, when there is uncertainly about traffic. A key challenge of TRPQs is representing binary temporal relations in a compact way, and ensuring that these compact representations can be computed efficiently. While these problems have been recently investigated from the theoretical side, little attention has been paid to corresponding implementation techniques. In this work, we address this gap by introducing the first SQL-based implementation of TRPQ answering that produces compact answers. We investigate two alternative formats for compact answers. For each format, we first lay the foundations for an efficient implementation by translating TRPQ operations into operations over compact answers, thus preserving compactness during the evaluation process. In addition, we apply state-of-the-art interval coalescing techniques to reduce the cost of temporal joins and ensure that our results have minimal cardinality. We also present a dedicated benchmark and parameterized experiments that illustrate the trade-offs between the two compact representations, depending on the length of intervals in the input data and query. Our empirical findings also reveal the critical role of coalescing for efficient query answering.

Paper | Code
Saifullah Burero, Anton Dignös, Jerry W. Sangma, and Johann Gamper: “Blending Contextual Data with Heterogeneous Time Dimensions for Improved Time Series Analysis”, in Proceedings of the 36th International Conference on Database and Expert Systems Applications (DEXA), Bangkok, Thailand, pp.23-34, August 25-27, 2025.
View info

Abstract In modern industrial environments, sensors play a crucial role for automation by continuously analyzing large volumes of time series data vital for process optimization. However, analyzing this data in isolation poses significant challenges, particularly in time series analysis, due to the influence of external contextual factors that are not always directly observable. Integrating these is essential for time series analysis. While, data fusion is a technique that aims at integrating or blending data with different modalities for time series analysis, such as images or videos, contextual factors may not always be heterogeneous in modality, but rather heterogeneous in time dimension, which makes its integration challenging. Therefore, we identified four different types of time dimensions that often appear in industrial environments, namely constant, time series, events, and intervals, and we aim at introducing the foundation towards a systematic approach for integrating contextual factors with heterogeneous time dimensions. This enables the transformation of data with heterogeneous time dimensions into a format that can be effectively processed by traditional machine learning models for time series analysis.

Paper | Slides
Maryam Mozaffari, Anton Dignös, Oswald Lanz, Dominik Matt, Gabriele Pasetti Monizza, Matthias Gauly, and Johann Gamper: “Onfoods: A Substitute Recommendation System in Food Recipes”, in Proceedings of the 36th International Conference on Database and Expert Systems Applications (DEXA), Bangkok, Thailand, pp. 66-79, August 25-27, 2025.
View info

Abstract Food waste is a serious problem in modern society. A specific aspect of food waste concerns meat consumption in gastronomy, where typically only prime cuts of meat are used in the kitchen. To facilitate the usage of all parts of animals and thereby reducing food waste, we present Onfoods, a system that recommends alternative meat cuts in recipes and integrates inventory data to help with the creation of menus. Onfoods uses an ontology and a knowledge graph to model recipes, meat cuts and the relationships between the two, similarity measures to find candidates for alternative meat cuts, and inventory data to track the availability of different meat cuts. An intuitive user interface allows the user on one hand to update the knowledge graph and inventory data, and on the other hand to navigate through recipes and choose alternative meat cuts.

Paper | Slides | Demo
Fabio Persia, Anton Dignös, Sven Helmer, Johann Gamper, and Daniela D'Auria: “Modeling and Detecting High-Level Events in Healthcare Applications Exploiting ISEQL+”, in Soft Computing, in press, 2025.
View info

Abstract Modeling and automatically detecting complex events in different domains, such as video surveillance and healthcare, is becoming an increasingly topical issue nowadays. In fact, deriving knowledge on higher level from low-level events by combining the latter to complex structures is the task of an Event Query Language (EQL), whose main issue is the lack of formal semantics. Consequently, in order to cope with this issue, in this paper we propose ISEQL+, an extension of ISEQL (an Interval-based Surveillance Event Query Language, that we previously defined), aimed at further improving its expressiveness. More specifically, we provide formal proofs demonstrating that the language fully covers the well-known Allen's interval relationships, additionally supports conditional overlap ratio and conditional cardinality constraints over the interval relationships, provides robustness with respect to small variations in the intervals, and can be formalized as relational algebra extension, which will in turn allow a very efficient implementation exploiting an existing algorithm. Eventually, we also show how typical events in the healthcare domain can be easily expressed via ISEQL+.

Paper
Matteo Ceccarello, Anton Dignös, Johann Gamper, and Christina Khnaisser: “Indexing Temporal Relations for Range-Duration Queries”, in Distributed and Parallel Databases, Volume 43, Number 7, 2025.
View info

Abstract Temporal information plays a crucial role in many database applications, however support for queries on such data is limited. We present an index structure, termed RD-INDEX, to support range-duration queries over interval timestamped relations, which constrain both the range of the tuples’ positions on the timeline and their duration. RD-INDEX is a grid structure in the two-dimensional space, representing the position on the timeline and the duration of timestamps, respectively. Instead of using a regular grid, we consider the data distribution for the construction of the grid in order to ensure that each grid cell contains approximately the same number of intervals. RD-INDEX features provable bounds on the running time of all the operations, allows for a simple implementation, supports very predictable query performance, and can be constructed and queried in parallel using multithreading. We benchmark our solution on a variety of datasets and query workloads, investigating both the query rate and the behavior of the individual queries. The results show that RD-INDEX performs better than the baselines on range-duration queries, for which it is explicitly designed. Furthermore, it outperforms state of the art indexes also on mixed workloads containing queries that constrain either only the duration or the range along with range-duration queries. Finally, the size of the RD-INDEX is in all settings smaller than the competitors, its construction scales with the number of threads, and parallelization helps improving the runtime of expensive moderate and lowly selective queries.

Paper | Code

2024

Maryam Mozaffari, Anton Dignös, Johann Gamper, Uta Störl: “Self-tuning Database Systems: A Systematic Literature Review of Automatic Schema Design and Tuning”, in ACM Computing Surveys, 56(11), June 2024.
View info

Abstract Self-tuning is a feature of autonomic databases that includes the problem of automatic schema design. It aims at providing an optimized schema that increases the overall database performance. While in relational databases automatic schema design focuses on the automated design of the physical schema, in NoSQL databases all levels of representation are considered: conceptual, logical, and physical. This is mainly because the latter are mostly schema-less and lack a standard schema design procedure as is the case for SQL databases. In this work, we carry out a systematic literature survey on automatic schema design in both SQL and NoSQL databases. We identify the levels of representation and the methods that are used for the schema design problem, and we present a novel taxonomy to classify and compare different schema design solutions. Our comprehensive analysis demonstrates that, despite substantial progress that has been made, schema design is still a developing field and considerable challenges need to be addressed, notably for NoSQL databases. We highlight the most important findings from the results of our analysis and identify areas for future research work.

Paper
Luca Althaus, Mourad Khayati, Abdelouahab Khelifati, Anton Dignös, Djellel Difallah, and Philippe Cudré-Mauroux: “SEER: An End-to-End Toolkit for Benchmarking Time Series Database Systems in Monitoring Applications”, in Proceedings of the VLDB Endowment (PVLDB), 17(12): 4361-4364, August 2024.
View info

Abstract Time series database systems (TSDBs) are prevalent in many applications ranging from monitoring and IoT devices to scientific research. Those systems are specifically designed to efficiently manage data indexed by time. Because of the variety of workloads, the diversity of time series features, and the sophistication of existing TSDBs, there is no clear way to pick the most suitable system. In this demo, we introduce SEER, an automated, configurable, and interactive toolkit to evaluate TSDBs. SEER is based on TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. It implements an end-to-end pipeline for database benchmarking from data generation and feature contamination to workload evaluation. Users can define their portfolio by configuring and parameterizing their own queries, specifying their frequencies, controlling the type and level of data features, and indicating the type of workloads. Moreover, they can deploy new systems and/or reconfigure the pre-installed ones. SEER would process users' requests and gracefully recommend the best system on a use-case basis.

Paper | Online demo
Ioannis Reppas, Meghdad Mirabi, Leila Fathi, Carsten Binnig, Anton Dignös, and Johann Gamper: “Parallel Processing of Temporal Anti-Joins in Memory”, in Proceedings of the 29th International Conference on Database Systems for Advanced Applications (DASFAA), Gifu, Japan, pp. 86-102, July 2-5, 2024.
View info

Abstract Efficient and scalable processing of temporal anti-joins remains a significant research challenge in temporal databases. To address this issue, this paper introduces a novel temporal primitive designed for transforming a temporal anti-join, including conjunctive equality predicates on non-temporal attributes, into an equivalent algebraic expression involving a temporal inner join. The rationale behind this transformation is that the new expression can be decomposed into subtasks, allowing for parallel execution across multiple CPUs. Experimental results using real-world datasets demonstrate the superior efficiency and scalability of our solution for in-memory processing compared to existing solutions.

Paper | Code
Christina Khnaisser, Hind Hamrouni, David B. Blumenthal, Anton Dignös, and Johann Gamper: “Efficiently labeling and retrieving temporal anomalies in relational databases”, in Information Systems Frontiers, 2024.
View info

Abstract Time and temporal constraints are implicit in most databases. To facilitate data analysis and quality assessment, a database should provide explicit operations to identify the violation of temporal constraints. Against this background, the purpose of this paper is threefold: (1) we identify and provide a formal definition of five common anomalies in temporal databases, (2) we propose two new relational operations that allow, respectively, to label anomalous tuples in and to retrieve the anomalous tuples from a dataset, and (3) we provide three different SQL implementations of these operations for current relational database management systems. The healthcare domain is used to illustrate the usage and utility of the temporal anomalies. Finally, an experimental evaluation on real-world and synthetic data analyses the performance of the different implementations of the anomaly operators.

Paper | Code

2023

Abdelouahab Khelifati, Mourad Khayati, Anton Dignös, Djellel Difallah, and Philippe Cudré-Mauroux: “TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications”, in Proceedings of the VLDB Endowment (PVLDB), 16(11): 3363-3376, August 2023.
View info

Abstract Time series databases are essential for the large-scale deployment of many critical industrial applications. In infrastructure monitoring, for instance, a database system should be able to process large amounts of sensor data in real-time, execute continuous queries, and handle complex analytical queries such as anomaly detection or forecasting. Several benchmarks have been proposed to evaluate and understand how existing systems and design choices handle specific use cases and workloads. Unfortunately, none of them fully covers the peculiar requirements of monitoring applications. Furthermore, they fall short of providing an automated way to generate representative real-world data and workloads for testing and evaluating these systems. We present TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. Our key contributions consist of (1) representative queries that meet the requirements that we collected from a water monitoring use case, and (2) a new scalable data generator method based on Generative Adversarial Networks (GAN) and Locality Sensitive Hashing (LSH). We demonstrate, through an extensive set of experiments, how TSM-Bench provides a comprehensive evaluation of the performance of seven leading time series database systems while offering a detailed characterization of their capabilities and trade-offs.

Paper | Poster | Slides | Code
Zhifeng Bao, Panagiotis Bouros, Reynold Cheng, Byron Choi, Anton Dignös, Wei Ding, Yixiang Fang, Boyang Han, Jilin Hu, Arijit Khan, Wenqing Lin, Xuemin Lin, Cheng Long, Nikos Mamoulis, Jian Pei, Matthias Renz, Shashi Shekhar, Jieming Shi, Zacharatou, Eleni Tzirita, Sibo Wang, Xiao Wang, Xue Wang, Raymond Chi-Wing Wong, Da Yan, Xifeng Yan, Bin Yang, Dezhong Yao, Ce Zhang, Peilin Zhao, and Rong Zhu: “A Summary of ICDE 2022 Research Session Panels”, in IEEE Data Engineering Bulletin, 47(4), December 2023.
View info

Abstract In the 38th IEEE International Conference on Data Engineering (ICDE), 2022, panel discussions were introduced after paper presentations to facilitate in-depth exploration of research topics and encourage partici- pation. These discussions, enriched by diverse perspectives from experts and active audience involvement, provided fresh insights and a broader understanding of each topic. The introduction of panel discussions exceeded expectations, attracting a larger number of participants to the virtual sessions. This article summarizes the virtual panels held during ICDE’22, focusing on sessions such as Data Mining and Knowledge Discovery, Federated Learning, Graph Data Management, Graph Neural Networks, Spatial and Temporal Data Management, and Spatial and Temporal Data Mining. By showcasing the success of panel discussions in generating inspiring discussions and promoting participation, this article aims to benefit the data engineering community, providing a valuable resource for researchers and suggesting a compelling format of holding research sessions for future conferences.

Paper
Matteo Ceccarello, Anton Dignös, Johann Gamper, and Christina Khnaisser: “Indexing Temporal Relations for Range-Duration Queries”, in Proceedings of the 35th International Conference on Scientific and Statistical Database Management (SSDBM), Los Angeles, CA, USA, pp. 3:1-3:12, July 10-12, 2023.
View info

Abstract Temporal information plays a crucial role in many database applications, however support for queries on such data is limited. We present an index structure, termed RD-index, to support range-duration queries over interval timestamped relations, which constrain both the range of the tuples’ positions on the timeline and their duration. RD-index is a grid structure in the two-dimensional space, representing the position on the timeline and the duration of timestamps, respectively. Instead of using a regular grid, we consider the data distribution for the construction of the grid in order to ensure that each grid cell contains approximately the same number of intervals. RD-index features provable bounds on the running time of all the operations, allow for a simple implementation, and supports very predictable query performance. We benchmark our solution on a variety of datasets and query workloads, investigating both the query rate and the behavior of the individual queries. The results show that RD-index performs better than the baselines on range-duration queries, for which it is explicitly designed. Furthermore, it outperforms state of the art indexes also on mixed workloads containing queries that constrain either only the duration or the range along with range-duration queries. Finally, the size of the RD-index is in all settings smaller than the competitors.

Paper | Slides | Code
Adam Przybylek, Aleksandra Karpus, Allel Hadjali, Anton Dignös, Carmem S. Hara, Danae Pla Karidi, Ester Zumpano, Fabio Persia, Genoveva Vargas-Solar, George Papastefanatos, Giancarlo Sperli, Giorgos Giannopoulos, Ivan Lukovic, Julien Aligon, Manolis Terrovitis, Marek Grzegorowski, Mariella Bonomo, Mirian Halfeld Ferrari Alves, Nicolas Labroche, Paul Monsarrat, Richard Chbeir, Sana Sellami, Seshu Tirupathi, Simona E. Rombo, Slavica Kordic, Sonja Ristic, Tommaso Di Noia, Torben Bach Pedersen, and Vincenzo Moscato: “Databases and Information Systems: Contributions from ADBIS 2023 Workshops and Doctoral Consortium”, in Proceedings of the 27th European Conference on Advances in Databases and Information Systems (ADBIS - Short Papers), Barcelona, Spain, pp. 293-311, September 4-7, 2023.
View info

Abstract The 27th European Conference on Advances in Databases and Information Systems (ADBIS) aims at providing a forum where researchers and practitioners in the fields of databases and information systems can interact, exchange ideas and disseminate their accomplishments and visions.

Paper
Maryam Mozaffari, Anton Dignös, Hind Hamrouni, and Johann Gamper: “NoSQL Schema Extraction from Temporal Conceptual Model: A Case for Cassandra, in New Trends in Database and Information Systems (ADBIS 2023). Communications in Computer and Information Science, Barcelona, Spain, pp. 280-290, September 4-7, 2023.
View info

Abstract NoSQL data stores have been proposed to handle the different breed of scale and challenges caused by Big Data. While a suitable schema design is of vital importance in NoSQL databases, in contrast to relational databases no standard schema design procedure exists yet. Instead, manual schema design is applied by using often vague and generic rules of thumb, which must be adapted to each application. Additionally, many applications require the management and processing of temporal data, for which NoSQL databases lack explicit support. To overcome such limitations, in this paper we propose an MDA-approach for mapping an existing conceptual UML class extension with temporal features into a NoSQL wide-column store schema. Then, we evaluate the schemas generated by our approach with the Cassandra wide-column store.

Paper
Meghdad Mirabi, Leila Fathi, Anton Dignös, Johann Gamper, and Carsten Binnig: “A New Primitive for Processing Temporal Joins”, in Proceedings of the 18th International Symposium on Spatial and Temporal Databases (SSTD), Calgary, AB, Canada, pp. 106-109, August 23-25, 2023.
View info

Abstract This paper presents the extended temporal aligner as a temporal primitive, and proposes a set of reduction rules that employ this primitive to convert a temporal join operator to its non-temporal equivalent. The rules cover all types of temporal joins, including inner join, outer joins, and anti-join. Preliminary experimental results demonstrate that the integration of the extended temporal aligner and the reduction rules can efficiently process temporal join queries.

Paper
David Massimo, Elias Ganthaler, Attaullah Buriro, Francesco Barile, Marco Moraschini, Anton Dignös, Thomas Villgrattner, Angelika Peer, and Francesco Ricci: “Estimation of Mass and Lengths of Sintered Workpieces using Machine Learning Models”, in IEEE Transactions on Instrumentation and Measurement, Volume 72, pp. 1-14, 2023.
View info

Abstract Powder Metallurgy (PM) is the branch of Metallurgy that deals with the design/production of near net-shaped sintered workpieces with different shapes and characteristics. The produced sintered workpieces are used in automotive, aviation, and aerospace industries, just to name a few. The quality of the produced sintered workpieces largely depends on powder compaction techniques and the accurate adjustments of process parameters. Currently, adjustments of these process parameters are done manually and thus resulting in laborious and time-intensive effort. To this end, this paper explores the use of Machine Learning (ML) in the compaction process and proposes an accurate and light-weight ML-based pipeline to estimate the quality characteristics of the produced workpieces in the powder metallurgy domain. More specifically, it presents a pipeline for workpiece’s mass and lengths estimation by exploiting some novel hand-crafted features and comparing well selected ML prediction models, namely, Random Forest (RF), Adaboost (ADA), and Gradient Boosting (GB). The chosen models are trained on a combination of features extracted from environmental and sensory raw data to estimate the mass and lengths of the next produced workpiece. We have implemented and evaluated our scheme on a dataset collected in a real production environment and we have found that GB is the most consistent and accurate one with the lowest Root Mean Squared Error (≈ 0.0886%). The results of an extensive experimentation have proven the relevance of the selected features and the accuracy of GB.

Paper

2022

Johann Gamper, Matteo Ceccarello, and Anton Dignös: “What's New in Temporal Databases?”, in Proceedings of the 26th European Conference on Advances in Databases and Information Systems (ADBIS), Turin, Italy, pp. 45-58, September 5-8, 2022.
View info

Abstract Temporal databases has been an active research area since many decades, ranging from research work on query processing, most dominantly on selection and join queries, to new directions in models and semantics, such as for instance temporal probabilistic or streaming data. At the same time more database vendors have been integrating temporal features into their systems, most notably, the temporal features of the SQL standard. In this paper, we summarize the latest research developments as presented in 30 research papers over the last five years in the context of temporal relational databases. Additionally, we also describe the developments of industrial database systems and vendors.

Paper
Christina Khnaisser, Hind Hamrouni, David B. Blumenthal, Anton Dignös, and Johann Gamper: “Querying Temporal Anomalies in Healthcare Information Systems and Beyond”, in Proceedings of the 26th European Conference on Advances in Databases and Information Systems (ADBIS), Turin, Italy, pp. 209-222, September 5-8, 2022.
View info

Abstract Finding anomalies in temporal relational databases is a difficult and challenging task, in particular if data is integrated from different sources. The problem is especially pressing in healthcare information systems, where temporal anomalies can pinpoint critical events such as erroneous drug administration or prescription. In this paper, we define three different temporal anomalies, which we call temporal redundancy, contradiction, and incompleteness. We define two different operators for each of these anomalies: the retrieval operator to retrieve all tuples of a relation that cause anomalous behaviour, and the labelling operator to annotate a temporal relation with additional information that marks normal and anomalous tuples. Finally, we present and evaluate different implementation techniques for the two operators for relational database systems.

Paper | Slides
David B. Blumenthal, Sébastien Bougleux, Anton Dignös, and Johann Gamper: “Enumerating dissimilar minimum cost perfect and error-correcting bipartite matchings for robust data matching”, in Information Sciences, Volume 596, pp. 202-221, June 2022.
View info

Abstract Matchings between objects from two datasets, domains, or ontologies have to be computed in various application scenarios. One often used meta-approach - which we call bipartite data matching - is to leverage domain knowledge for defining costs between the objects that should be matched, and to then use the classical Hungarian algorithm to compute a minimum cost bipartite matching. In this paper, we introduce and study the problem of enumerating K dissimilar minimum cost bipartite matchings. We formalize this problem, prove that it is NP-hard, and present heuristics based on greedy dynamic programming. The presented enumeration techniques are not only interesting in themselves, but also mitigate an often overlooked shortcoming of bipartite data matching, namely, that it is sensitive w. r. t. the storage order of the input data. Extensive experiments show that our enumeration heuristics clearly outperform existing algorithms in terms of dissimilarity of the obtained matchings, that they are effective at rendering bipartite data matching approaches more robust w. r. t. random storage order, and that they significantly improve the upper bounds of state-of-the art algorithms for graph edit distance computation that are based on bipartite data matching.

Paper
Yuri Borgianni, Lorenzo Maccioni, Anton Dignös, and Demis Basso: “A Framework to Evaluate Areas of Interest for Sustainable Products and Designs”, in Sustainability, Volume 14, Issue 13, 2022.
View info

Abstract Experience and evaluation research on sustainable products’ design is increasingly supported by eye-tracking tools. In particular, many studies have investigated the effect of gazing at or fixating on Areas of Interest on products’ evaluations, and in a number of cases, they have inferred the critical graphical elements leading to the preference of sustainable products. This paper is motivated by the lack of generalizability of the results of these studies, which have predominantly targeted specific products and Areas of Interest. In addition, it has also been overlooked that the observation of some Areas of Interest, despite not specifically targeting sustainable aspects, can lead consumers to prefer or appreciate sustainable products in any case. Furthermore, it has to be noted that sustainable products can be recognized based on their design (shape, material, lack of waste generated) and/or, more diffusedly, information clearly delivered on packaging and in advertising. With reference to the latter, this paper collected and classified Areas of Interest dealt with in past studies, markedly in eco-design and green consumption, and characterized by their potential generalizability. Specifically, the identified classes of Areas of Interest are not peculiar to specific products or economic sectors. These classes were further distinguished into “Content”, i.e., the quality aspect they intend to highlight, and “Form”, i.e., the graphical element used as a form of communication. This framework of Areas of Interest is the major contribution of the paper. Such a framework is needed to study regularities across multiple product categories in terms of how the observation of Areas of Interest leads to product appreciation and value perception. In addition, the potential significant differences between sustainable and commonplace products can be better investigated.

Paper
Anton Dignös, Michael H. Böhlen, Johann Gamper, Christian S. Jensen, and Peter Moser: “Leveraging Range Joins for the Computation of Overlap Joins”, in The VLDB Journal, Volume 31, Issue 1, pp. 75-99, January 2022.
View info

Abstract Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals.

Paper | Poster
Adam Charane, Matteo Ceccarello, Anton Dignös, and Johann Gamper: “Efficient Computation of All-Window Length Correlations”, in Proceedings of the 15th International Baltic Conference on Digital Business and Intelligent Systems (DB&IS), Riga, Latvia, pp. 251-266, July 4-6, 2022.
View info

Abstract The interactive exploration of time series is an important task in data analysis. In this paper, we concentrate on the investigation of linear correlations between time series. Since the correlation of time series might change over time, we consider the analysis of all possible subsequences of two time series. Such an approach allows identifying, at different levels of window length, periods over which two time series correlate and periods over which they do not correlate. We provide a solution to compute the correlations over all window lengths in O(n²)time, which is the size of the output and hence the best we can achieve. Furthermore, we propose a visualization of the result in the form of a heatmap, which provides a compact overview on the structure of the correlations amenable for a data analyst. An experimental evaluation shows that the tool is efficient to allow for interactive data exploration.

Paper | Slides

2021

Michael Shekelyan, Anton Dignös, Johann Gamper, and Minos Garofalakis: “Approximating Multidimensional Range Counts with Maximum Error Guarantees”, in Proceedings of the 37th IEEE International Conference on Data Engineering (ICDE), Chania, Crete, Greece, pp. 1595-1606, April 19-22, 2021.
View info

Abstract We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.

Paper
Tong Liu, Paolo Coletti, Anton Dignös, Johann Gamper, and Maurizio Murgia: “Correlation graph analytics for stock time series data”, in Proceedings of the 24th International Conference on Extending Database Technology (EDBT), Demo track, Nicosia, Cyprus, pp. 666-669, March 23-26, 2021.
View info

Abstract Stock market events are hard to model. In recent years, one approach that has been receiving increasing attention is to analyze graphs induced by price correlations of different stock companies. By analyzing the structure of such graphs, it is possible to identify critical events, e.g., market crises. To the best of our knowledge, there are no tools available that offer comprehensive support for such analyses. This paper introduces a novel tool that offers in-depth analysis with the ability of fine tuning parameters with an intuitive user interface. With a proposed workflow to handle time series data, the tool becomes versatile and it can analyze correlation graphs of different semantics: minimum spanning tree, graphs with edge thresholds, and evolving graphs. It also provides a rich set of functions that enable users to explore easily, interactively and systematically the correlation graphs starting from a file of raw time series data. With real-world stock data, we demonstrate how straightforward yet effective it is to accomplish various analytical tasks with the proposed tool.

Paper | Video | Online demo
Danila Piatov, Sven Helmer, Anton Dignös, and Fabio Persia: “Cache-Efficient Sweeping-Based Interval Joins for Extended Allen Relation Predicates”, in The VLDB Journal, Volume 30, Issue 3, pp. 379-402, May 2021.
View info

Abstract We develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen's relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Paper | Technical report (extended version)

2020

Johann Gamper and Anton Dignös: “Processing Temporal and Time Series Data: Present State and Future Challenges”, in Proceedings of the 24th European Conference on Advances In Databases and Information Systems (ADBIS), Lyon, France, pp. 8-14, August 25-28, 2020.
View info

Abstract Temporal data is ubiquitous, and its importance has been witnessed by the research efforts for several decades as well as by the increased interest in the last years from both academia and industry. Two prominent research directions in this context are the field of temporal databases and the field of time series data. This extended abstract aims at providing a concise overview about the state of the art in processing temporal and time series data as well as to discuss open research problems and challenges.

Paper

2019

Vincenzo Del Fatto, Anton Dignös, Guerriero Raimato, Lorenzo Maccioni, Yuri Borgianni, and Johann Gamper: “Visual Time Period Analysis: a Multimedia Analytics Application for Summarizing and Analyzing Eye-tracking Experiments”, in Multimedia Tools and Applications, Volume 78, Issue 23, pp. 32779-32804, December 2019.
View info

Abstract Recently, an increasing need for sophisticated multimedia analytics tools has been observed, which is triggered by a rapid growth of multimedia collections and by an increasing number of scientific fields embedding images in their studies. Although temporal data is ubiquitous and crucial in many applications, such tools typically do not support the analysis of data along the temporal dimension, especially for time periods. An appropriate visualization and comparison of period data associated with multimedia collections would help users to infer new information from such collections. In this pa- per, we present a novel multimedia analytics application for summarizing and analyzing temporal data from eye-tracking experiments. The application combines three different visual approaches: Time°diff, visual-information-seeking mantra, and multi-viewpoint. A qualitative evaluation with domain experts confirmed that our application helps decision makers to summarize and analyze multimedia collections containing period data.

Paper
Andreas Behrend, Anton Dignös, Johann Gamper, Philip Schmiegelt, Hannes Voigt, Matthias Rottmann, and Karsten Kahl: “Period Index: A Learned 2D Hash Index for Range and Duration Queries”, in Proceedings of the 16th International Symposium on Spatial and Temporal Databases (SSTD), Vienna, Austria, pp. 100-109, August 19-21, 2019.
View info

Abstract Today, most commercial database systems provide some support for the management of temporal data, but the index support for efficiently accessing such data is rather limited. Existing access paths neglect the fact that time intervals are located on the timeline and have a duration, two important pieces of information for querying temporal data. In this paper, we tackle this problem and introduce a novel index structure, termed Period Index, for efficiently accessing temporal data based on these two pieces of information. The index supports temporal queries that constrain the position of an interval on the timeline (range queries), its interval duration (duration queries), or both (range-duration queries). The key idea of the new index is to split the timeline into fixed-length buckets, each of which is divided into a set of cells that are organized in levels. The cells encode the position of intervals on the timeline, whereas the levels encode their duration. This grid-based index is well-suited for parallelization and non-uniform memory access (NUMA) architectures as it is common for modern hardware with large main-memories and multi-core servers. The Period Index is independent of the physical order of the data and has predictable performance due to the underlying hashing approach. We also propose an enhanced version of our index structure, termed Period Index∗, which continuously adapts the optimal bucket length to the distribution of the data. Our experiments show that Period Index∗ significantly beats other indexes for the class of queries that constrain both the position and the length of the time intervals, and it is competitive for queries that involve solely one temporal dimension.

Paper | Slides
Necati Duran, Giovanni Mahlknecht, Anton Dignös, and Johann Gamper: “HotPeriods: Visual Correlation Analysis of Interval Data”, in Proceedings of the 16th International Symposium on Spatial and Temporal Databases (SSTD), Demo track, Vienna, Austria, pp. 178-181, August 19-21, 2019.
View info

Abstract With the ever increasing amount and complexity of data, visual analysis becomes a fundamental tool to spot correlations and other relationships in data. Most of the previous techniques (e.g., scatter plots or heatmaps) focus on point data, i.e., data with point measures, such as prices or volumes. In this demo paper, we focus on data with interval measures, that is data where measures consist of an interval or range of values, such as price ranges or time intervals. We present a tool, termed HOTPERIODS, which allows to visualize correlations between two interval measures in the two-dimensional space, where the two measures represent a rectangle. To visualize such data, we first perform a rectangle aggregation. The result of this aggregation is a density matrix, where each cell stores the number of rectangles that cover the corresponding points in space. For the visualization of the density matrix, color-coding is used to represent different density values similar to heatmaps. We illustrate the usefulness of HOTPERIODS for the analysis of stock market data and tourism data, both of which show interval measures.

Paper | Poster | Online demo
Anton Dignös, Boris Glavic, Xing Niu, Michael H. Böhlen, and Johann Gamper: “Snapshot Semantics for Temporal Multiset Relations”, in Proceedings of the VLDB Endowment (PVLDB), 12(6): 639-652, February 2019.
View info

Abstract Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations.

Paper | Technical report (extended version) | Poster | Slides | Website and code | Reproducibility
Giovanni Mahlknecht, Anton Dignös, and Natalija Kozmina: “Modeling and querying facts with period timestamps in data warehouses”, in International Journal of Applied Mathematics and Computer Science, Volume 29, Number 1, pp. 31-49, March 2019.
View info

Abstract In this paper, we study different ways of representing and querying fact data that is time-stamped with a time period in a data warehouse. The main focus is on how to represent the time periods that are associated with the facts in order to support convenient and efficient aggregations over time. We propose three distinct logical models that represent time periods, respectively, as sets of all time points in a period (instant model), as pairs of start and end time points of a period (period model), and as atomic units that are explicitly stored in a new period dimension (period^*model). The period dimension is enriched with information about the days of each period, thereby combining the two former models. We use four different classes of aggregation queries to analyze query formulation, query execution, and query performance over the three models. An extensive empirical evaluation on synthetic and real-world datasets and the analysis of the query execution plans reveals that the period model is the best choice in terms of runtime and space for all four query classes.

Paper
Danila Piatov, Sven Helmer, Anton Dignös, and Johann Gamper: “Interactive and space-efficient multi-dimensional time series subsequence matching”, in Information Systems, Volume 82, pp. 121-135, May 2019.
View info

Abstract We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a multi-dimensional time series data set. Compared to a naive implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. Additionally, we apply cache optimization techniques to speed up the construction of the index. Finally, we demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.

Paper
Michael Shekelyan, Anton Dignös, and Johann Gamper: “Sparse prefix sums: constant-time range sum queries over sparse multidimensional data cubes”, in Information Systems, Volume 82, pp. 136-147, May 2019.
View info

Abstract Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in O(1) time by looking up a constant number of values in an array of size O(N) where N is the number of cells in the multi-dimensional array. However, the technique suffers from O(N) update and storage costs. Relative prefix sums address the high update costs by partitioning the array into blocks, thereby breaking the dependency between cells. In this paper, we present sparse prefix sums that exploit data sparsity to reduce the high storage costs of relative prefix sums. By building upon relative prefix sums, sparse prefix sums achieve the same update complexity as relative prefix sums. The authors of relative prefix sums erroneously claimed that the update complexity is O(sqrt(N)) for any number of dimensions. We show that this claim holds only for two dimensions, whereas the correct complexity for an arbitrary number of d dimensions is O(N^((d-1)/d)). To reduce the storage costs, the sparse prefix sums technique exploits sparsity in the data and avoids to materialize prefix sums for empty rows and columns in the data grid; instead, look-up tables are used to preserve constant query time. Sparse prefix sums are the first approach to achieve O(1) query time with sub-linear storage costs for range-sum queries over sparse low-dimensional arrays. A thorough experimental evaluation shows that the approach works very well in practice. On the tested real-world data sets the storage costs are reduced by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.

Paper

2018

Michael H. Böhlen, Anton Dignös, Johann Gamper, and Christian S. Jensen: “Database technology for processing temporal data (invited paper)”, in Proceedings of the 25th International Symposium on Temporal Representation and Reasoning (TIME), Warsaw, Poland, pp. 2:1-2:7, October 15-17, 2018.
View info

Abstract Despite the ubiquity of temporal data and considerable research on processing such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in processing historical or temporal data. The SQL:2011 standard introduced some temporal features, and commercial database management systems have started to offer temporal functionalities in a step-by-step manner. There has also been a proposal for a more fundamental and comprehensive solution for sequenced temporal queries, which allows a tight integration into relational database systems, thereby taking advantage of existing query optimization and evaluation technologies. New challenges for processing temporal data arise with multiple dimensions of time and the increasing amounts of data, including time series data that represent a special kind of temporal data.

Paper
Michael H. Böhlen, Anton Dignös, Johann Gamper, and Christian S. Jensen: “Temporal data management - an overview”, in Business Intelligence and Big Data (eBISS), Lecture Notes in Business Information Processing, Volume 324, pp. 51-83, 2018.
View info

Abstract Despite the ubiquity of temporal data and considerable research on the effective and efficient processing of such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in the processing of temporal data that captures multiple states of reality. The SQL:2011 standard incorporates some temporal support, and commercial DBMSs have started to offer temporal functionality in a step-by-step manner, such as the representation of temporal intervals, temporal primary and foreign keys, and the support for so-called time-travel queries that enable access to past states. This tutorial gives an overview of state-of-the-art research results and technologies for storing, managing, and processing temporal data in relational database management systems. Following an introduction that offers a historical perspective, we provide an overview of basic temporal database concepts. Then we survey the state-of-the-art in temporal database research, followed by a coverage of the support for temporal data in the current SQL standard and the extent to which the temporal aspects of the standard are supported by existing systems. The tutorial ends by covering a recently proposed framework that provides comprehensive support for processing temporal data and that has been implemented in PostgreSQL.

Paper
Vincenzo Del Fatto, Anton Dignös, and Johann Gamper: “Time°diff: a visual approach to compare period data”, in Proceedings of the 22nd International Conference on Information Visualisation (IV), Salerno, Italy, pp. 38-43, July 10-13, 2018.
View info

Abstract Temporal data, and in particular time periods, are crucial to many applications in different sectors, such as industry, medicine, insurance, finance, tourism, and management. Such applications often consult historical information in order to compare and optimize processes. Generally, the time periods in this data represent the period of validity in the real-world, such as the period of a specific assignment, but may also represent the periods when the data was stored, i.e., believed to be true. Inferring new information from this data is eased by visualizing and comparing their different time periods. In this paper, we present Time°diff, a novel visualization approach based on timebar charts, which is suitable for comparing data with time periods and enabling decision makers to easily analyze information containing period data.

Paper | Slides

2017

Michael Shekelyan, Anton Dignös, and Johann Gamper: “DigitHist: a histogram-based data summary with tight error bounds”, in Proceedings of the VLDB Endowment (PVLDB), 10(11): 1514-1525, August 2017.
View info

Abstract We propose DigitHist, a histogram summary for selectivity estimation on multidimensional data with tight error bounds. By combining multidimensional and one-dimensional histograms along regular grids of different resolutions, DigitHist provides an accurate and reliable histogram approach for multidimensional data. To achieve a compact summary, we use a sparse representation combined with a novel histogram compression technique that chooses a higher resolution in dense regions and a lower resolution elsewhere. For the construction of DigitHist, we propose a new error measure, termed u-error, which minimizes the width between the guaranteed upper and lower bounds of the selectivity estimate. The construction algorithm performs a single data scan and has linear time complexity. An in-depth experimental evaluation shows that DigitHist delivers superior precision and error bounds than state-of-the-art competitors at a comparable query time.

Paper | Poster | Slides
Michael Shekelyan, Anton Dignös, and Johann Gamper: “Sparse prefix sums”, in Proceedings of the 21st European Conference on Advances In Databases and Information Systems (ADBIS), Nicosia, Cyprus, pp. 120-135, September 24-27, 2017.
View info

Abstract The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.

Paper | Slides
Giovanni Mahlknecht, Michael H. Böhlen, Anton Dignös, and Johann Gamper: “VISOR: visualizing summaries of ordered data”, in Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM), Demo track, Chicago, IL, USA, pp. 40:1-40:5, June 27-29, 2017.
View info

Abstract In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an epsilon-graph and delta-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.

Paper | Poster | Code
Kevin Wellenzohn, Michael H. Böhlen, Anton Dignös, Johann Gamper, and Hannes Mitterer: “Continuous imputation of missing values in streams of pattern-determining time series”, in Proceedings of the 20th International Conference on Extending Database Technology (EDBT), Venice, Italy, pp. 330-341, March 21-24, 2017.
View info

Abstract Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before further data processing is possible. We propose Top-k Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situation is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series pattern-determine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time series and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using real-world and synthetic data shows that we outperform the state-of-the-art solutions.

Paper | Poster | Slides | Code
Giovanni Mahlknecht, Anton Dignös, and Johann Gamper: “A scalable dynamic programming scheme for the computation of optimal k-segments for ordered data”, in Information Systems, Volume 70, pp. 2-17, October 2017.
View info

Abstract The optimal k-segments of an ordered dataset of size n consists of k tuples that are obtained by merging consecutive tuples such that a given error metric is minimized. The problem is general and has been studied in various flavors, e.g., piecewise-constant approximation, parsimonious temporal aggregation, and v-optimal histograms. A well-known computation scheme for the optimal k-segments is based on dynamic programming, which computes a k * n error matrix E and a corresponding split point matrix J of the same size. This yields O(n * k) space and O(n^2 * k) runtime complexity. In this article, we propose three optimization techniques for the runtime complexity and one for the space complexity. First, diagonal pruning identifies regions of the error matrix E that need not to be computed since they cannot lead to a valid solution. Second, for those cells in E that are computed, we provide a heuristic to determine a better seed value, which in turn leads to a tighter lower bound for the potential split points to be considered for the calculation of the minimal error. Third, we show how the algorithm can be effectively parallelized. The space complexity is dominated by the split point matrix J, which needs to be kept till the end. To tackle this problem, we replace the split point matrix by a dynamic split point graph, which eliminates entries that are not needed to retrieve the optimal solution. A detailed experimental evaluation shows the effectiveness of the proposed solutions. Our optimization techniques significantly improve the runtime of state-of-the-art matrix implementations, and they guarantee a comparable performance of an implementation that uses the split point graph. The split point graph reduces the memory consumption up to two orders of magnitude and allows us to process large datasets for which the memory explodes if the matrix is used.

Paper

2016

Anton Dignös, Michael H. Böhlen, Johann Gamper, and Christian S. Jensen: “Extending the kernel of a relational DBMS with comprehensive support for sequenced temporal queries”, in ACM Transactions on Database Systems (TODS), 41(4), Article 26, 46 pages, November 2016.
View info

Abstract Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This paper demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the paper's proposal. The new system is available as open source software.

Paper | Online demo and code
Danila Piatov, Sven Helmer, and Anton Dignös: “An interval join optimized for modern hardware”, in Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE), Helsinki, Finland, pp. 1098-1109, May 16-20, 2016.
View info

Abstract We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.

Paper | Poster | Slides

2015

Giovanni Mahlknecht, Anton Dignös, and Johann Gamper: “Efficient computation of parsimonious temporal aggregation”, in Proceedings of the 19th East-European Conference on Advances In Databases and Information Systems (ADBIS), Poitiers, France, pp. 320-333, September 8-11, 2015.
View info

Abstract Parsimonious temporal aggregation (PTA) has been introduced to overcome limitations of previous temporal aggregation operators, namely to provide a concise yet data sensitive summary of temporal data. The basic idea of PTA is to first compute instant temporal aggregation (ITA) as an intermediate result and then to merge similar adjacent tuples in order to reduce the final result size. The best known algorithm to compute a correct PTA result is based on dynamic programming (DP) and requires O(n^2) space to store a so-called split point matrix, where n is the size of the intermediate data. The matrix stores the split points between which the intermediate tuples are merged. In this paper, we propose two optimizations of the DP algorithm for PTA queries. The first optimization is termed diagonal pruning and identifies regions of the matrix that need not to be computed. This reduces the runtime complexity. The second optimization addresses the space complexity. We observed that only a subset of the elements in the split point matrix are actually needed. Therefore, we propose to replace the split point matrix by a so-called split point graph, which stores only those split points that are needed to restore the optimal PTA solution. This step reduces the memory consumption. An empirical evaluation shows the effectiveness of the two optimizations both in terms of runtime and memory consumption.

Paper | Slides

2014

Anton Dignös, Michael H. Böhlen, and Johann Gamper: “Overlap interval partition join”, in Proceedings of the 2014 ACM SIGMOD International Conference on the Management of Data (SIGMOD), Snowbird, UT, USA, pp. 1459-1470, June 22-27, 2014.
View info

Abstract Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short- and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms state-of-the-art techniques for the overlap join.

Paper | Poster | Slides
Anton Dignös: “Interval-Dependent Attributes in Relational Database Systems”, PhD Thesis, University of Zurich, 2014.
View info

Abstract Data with time intervals is prominently present in finance, accounting, medicine and many other application domains. When querying such data, it is important to perform operations on aligned intervals, i.e., data is processed together only for the common interval where it is valid in the real world. For instance, an employee contributed to a project only for the time period where both the project was running and the employee was employed by the company, i.e., the employee contributed to the project only over their aligned time interval. A temporal join is thus only evaluated over the aligned interval of an employee and a project. The problem of performing temporal operations, such as temporal aggregation or temporal joins, on data with time intervals using relational database systems can be attributed to the lack of primitives for the alignment of intervals. Even more challenges arise, when the data includes attribute values that are interval-dependent, such as project budgets or cumulative costs, and need to be scaled along with the alignment of intervals during processing. The goal of this thesis is to provide systematic and built-in support for querying data with intervals in relational database systems. The solution we propose uses two temporal primitives a temporal normalizer and a temporal aligner for the alignment of intervals. Temporal operators on interval data are defined by reduction rules that map a temporal operator to an operation with a temporal primitive followed by the corresponding traditional non-temporal operator that uses equality on aligned intervals. A key feature of our approach is that operators can access the original time intervals in predicates and functions, such as join conditions and aggregation functions, using timestamp propagation. Our approach, through timestamp propagation, supports the scaling of attribute values that are interval-dependent. When intervals are aligned during query processing, scaling can be performed at query time with the help of user-defined functions. This allows users to choose whether and how attribute values should be scaled. This is necessary since they may be interested in the total value in one query and the scaled value according to days or even working days in another query. We integrated our solution into the kernel of the open source database system PostgreSQL, which allows to leverage existing query optimization techniques and algorithms.

Thesis

2013

Anton Dignös, Michael H. Böhlen, and Johann Gamper: “Query time scaling of attribute values in interval timestamped databases”, in Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), Demo track, Brisbane, QLD, Australia, pp. 1304-1307, April 8-11, 2013.
View info

Abstract In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time.

Paper | Poster

2012

Anton Dignös, Michael H. Böhlen, and Johann Gamper: “Temporal alignment”, in Proceedings of the 2012 ACM SIGMOD International Conference on the Management of Data (SIGMOD), Scottsdale, AZ, USA, pp. 433-444, May 20-24, 2012.
View info

Abstract In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques.

Paper | Poster 1 | Poster 2 | Slides