This page contains a list of research publications with abstracts and links to online versions. If you cannot access the file you are interested in, please, feel free to contact me. Also, have a look at my dblp, google scholar and researchgate pages.
◼ Journal ◼ Conference proceedings ◼ Workshop proceedings ◼ Editor ◼ Book or book chapter ◼ Other
Abstract Matchings between objects from two datasets, domains, or ontologies have to be computed in various application scenarios. One often used meta-approach - which we call bipartite data matching - is to leverage domain knowledge for defining costs between the objects that should be matched, and to then use the classical Hungarian algorithm to compute a minimum cost bipartite matching. In this paper, we introduce and study the problem of enumerating K dissimilar minimum cost bipartite matchings. We formalize this problem, prove that it is NP-hard, and present heuristics based on greedy dynamic programming. The presented enumeration techniques are not only interesting in themselves, but also mitigate an often overlooked shortcoming of bipartite data matching, namely, that it is sensitive w. r. t. the storage order of the input data. Extensive experiments show that our enumeration heuristics clearly outperform existing algorithms in terms of dissimilarity of the obtained matchings, that they are effective at rendering bipartite data matching approaches more robust w. r. t. random storage order, and that they significantly improve the upper bounds of state-of-the art algorithms for graph edit distance computation that are based on bipartite data matching.
Abstract Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals.
Abstract We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.
Abstract Stock market events are hard to model. In recent years, one approach that has been receiving increasing attention is to analyze graphs induced by price correlations of different stock companies. By analyzing the structure of such graphs, it is possible to identify critical events, e.g., market crises. To the best of our knowledge, there are no tools available that offer comprehensive support for such analyses. This paper introduces a novel tool that offers in-depth analysis with the ability of fine tuning parameters with an intuitive user interface. With a proposed workflow to handle time series data, the tool becomes versatile and it can analyze correlation graphs of different semantics: minimum spanning tree, graphs with edge thresholds, and evolving graphs. It also provides a rich set of functions that enable users to explore easily, interactively and systematically the correlation graphs starting from a file of raw time series data. With real-world stock data, we demonstrate how straightforward yet effective it is to accomplish various analytical tasks with the proposed tool.
Abstract We develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen's relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.
Abstract Temporal data is ubiquitous, and its importance has been witnessed by the research efforts for several decades as well as by the increased interest in the last years from both academia and industry. Two prominent research directions in this context are the field of temporal databases and the field of time series data. This extended abstract aims at providing a concise overview about the state of the art in processing temporal and time series data as well as to discuss open research problems and challenges.
Abstract Recently, an increasing need for sophisticated multimedia analytics tools has been observed, which is triggered by a rapid growth of multimedia collections and by an increasing number of scientific fields embedding images in their studies. Although temporal data is ubiquitous and crucial in many applications, such tools typically do not support the analysis of data along the temporal dimension, especially for time periods. An appropriate visualization and comparison of period data associated with multimedia collections would help users to infer new information from such collections. In this pa- per, we present a novel multimedia analytics application for summarizing and analyzing temporal data from eye-tracking experiments. The application combines three different visual approaches: Time°diff, visual-information-seeking mantra, and multi-viewpoint. A qualitative evaluation with domain experts confirmed that our application helps decision makers to summarize and analyze multimedia collections containing period data.
Abstract Today, most commercial database systems provide some support for the management of temporal data, but the index support for efficiently accessing such data is rather limited. Existing access paths neglect the fact that time intervals are located on the timeline and have a duration, two important pieces of information for querying temporal data. In this paper, we tackle this problem and introduce a novel index structure, termed Period Index, for efficiently accessing temporal data based on these two pieces of information. The index supports temporal queries that constrain the position of an interval on the timeline (range queries), its interval duration (duration queries), or both (range-duration queries). The key idea of the new index is to split the timeline into fixed-length buckets, each of which is divided into a set of cells that are organized in levels. The cells encode the position of intervals on the timeline, whereas the levels encode their duration. This grid-based index is well-suited for parallelization and non-uniform memory access (NUMA) architectures as it is common for modern hardware with large main-memories and multi-core servers. The Period Index is independent of the physical order of the data and has predictable performance due to the underlying hashing approach. We also propose an enhanced version of our index structure, termed Period Index∗, which continuously adapts the optimal bucket length to the distribution of the data. Our experiments show that Period Index∗ significantly beats other indexes for the class of queries that constrain both the position and the length of the time intervals, and it is competitive for queries that involve solely one temporal dimension.
Abstract With the ever increasing amount and complexity of data, visual analysis becomes a fundamental tool to spot correlations and other relationships in data. Most of the previous techniques (e.g., scatter plots or heatmaps) focus on point data, i.e., data with point measures, such as prices or volumes. In this demo paper, we focus on data with interval measures, that is data where measures consist of an interval or range of values, such as price ranges or time intervals. We present a tool, termed HOTPERIODS, which allows to visualize correlations between two interval measures in the two-dimensional space, where the two measures represent a rectangle. To visualize such data, we first perform a rectangle aggregation. The result of this aggregation is a density matrix, where each cell stores the number of rectangles that cover the corresponding points in space. For the visualization of the density matrix, color-coding is used to represent different density values similar to heatmaps. We illustrate the usefulness of HOTPERIODS for the analysis of stock market data and tourism data, both of which show interval measures.
Abstract Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations.
Abstract In this paper, we study different ways of representing and querying fact data that is time-stamped with a time period in a data warehouse. The main focus is on how to represent the time periods that are associated with the facts in order to support convenient and efficient aggregations over time. We propose three distinct logical models that represent time periods, respectively, as sets of all time points in a period (instant model), as pairs of start and end time points of a period (period model), and as atomic units that are explicitly stored in a new period dimension (period*model). The period dimension is enriched with information about the days of each period, thereby combining the two former models. We use four different classes of aggregation queries to analyze query formulation, query execution, and query performance over the three models. An extensive empirical evaluation on synthetic and real-world datasets and the analysis of the query execution plans reveals that the period model is the best choice in terms of runtime and space for all four query classes.
Abstract We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a multi-dimensional time series data set. Compared to a naive implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. Additionally, we apply cache optimization techniques to speed up the construction of the index. Finally, we demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.
Abstract Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in O(1) time by looking up a constant number of values in an array of size O(N) where N is the number of cells in the multi-dimensional array. However, the technique suffers from O(N) update and storage costs. Relative prefix sums address the high update costs by partitioning the array into blocks, thereby breaking the dependency between cells. In this paper, we present sparse prefix sums that exploit data sparsity to reduce the high storage costs of relative prefix sums. By building upon relative prefix sums, sparse prefix sums achieve the same update complexity as relative prefix sums. The authors of relative prefix sums erroneously claimed that the update complexity is O(sqrt(N)) for any number of dimensions. We show that this claim holds only for two dimensions, whereas the correct complexity for an arbitrary number of d dimensions is O(N^((d-1)/d)). To reduce the storage costs, the sparse prefix sums technique exploits sparsity in the data and avoids to materialize prefix sums for empty rows and columns in the data grid; instead, look-up tables are used to preserve constant query time. Sparse prefix sums are the first approach to achieve O(1) query time with sub-linear storage costs for range-sum queries over sparse low-dimensional arrays. A thorough experimental evaluation shows that the approach works very well in practice. On the tested real-world data sets the storage costs are reduced by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Abstract Despite the ubiquity of temporal data and considerable research on processing such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in processing historical or temporal data. The SQL:2011 standard introduced some temporal features, and commercial database management systems have started to offer temporal functionalities in a step-by-step manner. There has also been a proposal for a more fundamental and comprehensive solution for sequenced temporal queries, which allows a tight integration into relational database systems, thereby taking advantage of existing query optimization and evaluation technologies. New challenges for processing temporal data arise with multiple dimensions of time and the increasing amounts of data, including time series data that represent a special kind of temporal data.
Abstract Despite the ubiquity of temporal data and considerable research on the effective and efficient processing of such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in the processing of temporal data that captures multiple states of reality. The SQL:2011 standard incorporates some temporal support, and commercial DBMSs have started to offer temporal functionality in a step-by-step manner, such as the representation of temporal intervals, temporal primary and foreign keys, and the support for so-called time-travel queries that enable access to past states. This tutorial gives an overview of state-of-the-art research results and technologies for storing, managing, and processing temporal data in relational database management systems. Following an introduction that offers a historical perspective, we provide an overview of basic temporal database concepts. Then we survey the state-of-the-art in temporal database research, followed by a coverage of the support for temporal data in the current SQL standard and the extent to which the temporal aspects of the standard are supported by existing systems. The tutorial ends by covering a recently proposed framework that provides comprehensive support for processing temporal data and that has been implemented in PostgreSQL.
Abstract Temporal data, and in particular time periods, are crucial to many applications in different sectors, such as industry, medicine, insurance, finance, tourism, and management. Such applications often consult historical information in order to compare and optimize processes. Generally, the time periods in this data represent the period of validity in the real-world, such as the period of a specific assignment, but may also represent the periods when the data was stored, i.e., believed to be true. Inferring new information from this data is eased by visualizing and comparing their different time periods. In this paper, we present Time°diff, a novel visualization approach based on timebar charts, which is suitable for comparing data with time periods and enabling decision makers to easily analyze information containing period data.
Abstract We propose DigitHist, a histogram summary for selectivity estimation on multidimensional data with tight error bounds. By combining multidimensional and one-dimensional histograms along regular grids of different resolutions, DigitHist provides an accurate and reliable histogram approach for multidimensional data. To achieve a compact summary, we use a sparse representation combined with a novel histogram compression technique that chooses a higher resolution in dense regions and a lower resolution elsewhere. For the construction of DigitHist, we propose a new error measure, termed u-error, which minimizes the width between the guaranteed upper and lower bounds of the selectivity estimate. The construction algorithm performs a single data scan and has linear time complexity. An in-depth experimental evaluation shows that DigitHist delivers superior precision and error bounds than state-of-the-art competitors at a comparable query time.
Abstract The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Abstract In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an epsilon-graph and delta-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.
Abstract Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before further data processing is possible. We propose Top-k Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situation is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series pattern-determine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time series and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using real-world and synthetic data shows that we outperform the state-of-the-art solutions.
Abstract The optimal k-segments of an ordered dataset of size n consists of k tuples that are obtained by merging consecutive tuples such that a given error metric is minimized. The problem is general and has been studied in various flavors, e.g., piecewise-constant approximation, parsimonious temporal aggregation, and v-optimal histograms. A well-known computation scheme for the optimal k-segments is based on dynamic programming, which computes a k * n error matrix E and a corresponding split point matrix J of the same size. This yields O(n * k) space and O(n^2 * k) runtime complexity. In this article, we propose three optimization techniques for the runtime complexity and one for the space complexity. First, diagonal pruning identifies regions of the error matrix E that need not to be computed since they cannot lead to a valid solution. Second, for those cells in E that are computed, we provide a heuristic to determine a better seed value, which in turn leads to a tighter lower bound for the potential split points to be considered for the calculation of the minimal error. Third, we show how the algorithm can be effectively parallelized. The space complexity is dominated by the split point matrix J, which needs to be kept till the end. To tackle this problem, we replace the split point matrix by a dynamic split point graph, which eliminates entries that are not needed to retrieve the optimal solution. A detailed experimental evaluation shows the effectiveness of the proposed solutions. Our optimization techniques significantly improve the runtime of state-of-the-art matrix implementations, and they guarantee a comparable performance of an implementation that uses the split point graph. The split point graph reduces the memory consumption up to two orders of magnitude and allows us to process large datasets for which the memory explodes if the matrix is used.
Abstract Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This paper demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the paper's proposal. The new system is available as open source software.
Abstract We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.
Abstract Parsimonious temporal aggregation (PTA) has been introduced to overcome limitations of previous temporal aggregation operators, namely to provide a concise yet data sensitive summary of temporal data. The basic idea of PTA is to first compute instant temporal aggregation (ITA) as an intermediate result and then to merge similar adjacent tuples in order to reduce the final result size. The best known algorithm to compute a correct PTA result is based on dynamic programming (DP) and requires O(n^2) space to store a so-called split point matrix, where n is the size of the intermediate data. The matrix stores the split points between which the intermediate tuples are merged. In this paper, we propose two optimizations of the DP algorithm for PTA queries. The first optimization is termed diagonal pruning and identifies regions of the matrix that need not to be computed. This reduces the runtime complexity. The second optimization addresses the space complexity. We observed that only a subset of the elements in the split point matrix are actually needed. Therefore, we propose to replace the split point matrix by a so-called split point graph, which stores only those split points that are needed to restore the optimal PTA solution. This step reduces the memory consumption. An empirical evaluation shows the effectiveness of the two optimizations both in terms of runtime and memory consumption.
Abstract Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short- and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms state-of-the-art techniques for the overlap join.
Abstract Data with time intervals is prominently present in finance, accounting, medicine and many other application domains. When querying such data, it is important to perform operations on aligned intervals, i.e., data is processed together only for the common interval where it is valid in the real world. For instance, an employee contributed to a project only for the time period where both the project was running and the employee was employed by the company, i.e., the employee contributed to the project only over their aligned time interval. A temporal join is thus only evaluated over the aligned interval of an employee and a project. The problem of performing temporal operations, such as temporal aggregation or temporal joins, on data with time intervals using relational database systems can be attributed to the lack of primitives for the alignment of intervals. Even more challenges arise, when the data includes attribute values that are interval-dependent, such as project budgets or cumulative costs, and need to be scaled along with the alignment of intervals during processing. The goal of this thesis is to provide systematic and built-in support for querying data with intervals in relational database systems. The solution we propose uses two temporal primitives a temporal normalizer and a temporal aligner for the alignment of intervals. Temporal operators on interval data are defined by reduction rules that map a temporal operator to an operation with a temporal primitive followed by the corresponding traditional non-temporal operator that uses equality on aligned intervals. A key feature of our approach is that operators can access the original time intervals in predicates and functions, such as join conditions and aggregation functions, using timestamp propagation. Our approach, through timestamp propagation, supports the scaling of attribute values that are interval-dependent. When intervals are aligned during query processing, scaling can be performed at query time with the help of user-defined functions. This allows users to choose whether and how attribute values should be scaled. This is necessary since they may be interested in the total value in one query and the scaled value according to days or even working days in another query. We integrated our solution into the kernel of the open source database system PostgreSQL, which allows to leverage existing query optimization techniques and algorithms.
Abstract In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time.
Abstract In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques.