Anton Dignös  Publications
This page contains a list of research publications with abstracts and links to online versions. If you cannot access the file you are interested in, please, feel free to contact me. Also, have a look at my dblp and google scholar pages.

"DigitHist: a HistogramBased Data Summary with Tight Error Bounds", in Proceedings of the VLDB Endowment (PVLDB), 10(11): 15141525, August 2017. ,
We propose DigitHist, a histogram summary for selectivity estimation on multidimensional data with tight error bounds. By combining multidimensional and onedimensional histograms along regular grids of different resolutions, DigitHist provides an accurate and reliable histogram approach for multidimensional data. To achieve a compact summary, we use a sparse representation combined with a novel histogram compression technique that chooses a higher resolution in dense regions and a lower resolution elsewhere. For the construction of DigitHist, we propose a new error measure, termed uerror, which minimizes the width between the guaranteed upper and lower bounds of the selectivity estimate. The construction algorithm performs a single data scan and has linear time complexity. An indepth experimental evaluation shows that DigitHist delivers superior precision and error bounds than stateoftheart competitors at a comparable query time.
Paper  Poster  Slides 
"Sparse Prefix Sums", in Proceedings of the 21st European Conference on Advances In Databases and Information Systems (ADBIS), Nicosia, Cyprus, pp. 120135, September 2427, 2017. , Best paper award.
The prefix sum approach is a powerful technique to answer rangesum queries over multidimensional arrays in constant time by requiring only a few lookups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sublinear update costs and storage costs for rangesum queries over sparse lowdimensional arrays. Experiments on realworld data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecondfast query answering.
Paper 
"VISOR: Visualizing Summaries of Ordered Data", in Proceedings of the 29th International Conference on Scientific and Statistical Database Management (SSDBM), Demo track, Chicago, IL, USA, pp. 40:140:5, June 2729, 2017. ,
In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an epsilongraph and deltagraph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or Voptimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.
Paper  Poster 
"Continuous Imputation of Missing Values in Streams of PatternDetermining Time Series", in Proceedings of the 20th International Conference on Extending Database Technology (EDBT), Venice, Italy, pp. 330341, March 21–24, 2017. ,
Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before further data processing is possible. We propose Topk Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situation is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series patterndetermine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time series and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using realworld and synthetic data shows that we outperform the stateoftheart solutions.
Paper  Poster  Slides 
"A scalable dynamic programming scheme for the computation of optimal ksegments for ordered data", in Information Systems, Volume 70, pp. 217, October 2017. ,
The optimal ksegments of an ordered dataset of size n consists of k tuples that are obtained by merging consecutive tuples such that a given error metric is minimized. The problem is general and has been studied in various flavors, e.g., piecewiseconstant approximation, parsimonious temporal aggregation, and voptimal histograms. A wellknown computation scheme for the optimal ksegments is based on dynamic programming, which computes a k * n error matrix E and a corresponding split point matrix J of the same size. This yields O(n * k) space and O(n^2 * k) runtime complexity. In this article, we propose three optimization techniques for the runtime complexity and one for the space complexity. First, diagonal pruning identifies regions of the error matrix E that need not to be computed since they cannot lead to a valid solution. Second, for those cells in E that are computed, we provide a heuristic to determine a better seed value, which in turn leads to a tighter lower bound for the potential split points to be considered for the calculation of the minimal error. Third, we show how the algorithm can be effectively parallelized. The space complexity is dominated by the split point matrix J, which needs to be kept till the end. To tackle this problem, we replace the split point matrix by a dynamic split point graph, which eliminates entries that are not needed to retrieve the optimal solution. A detailed experimental evaluation shows the effectiveness of the proposed solutions. Our optimization techniques significantly improve the runtime of stateoftheart matrix implementations, and they guarantee a comparable performance of an implementation that uses the split point graph. The split point graph reduces the memory consumption up to two orders of magnitude and allows us to process large datasets for which the memory explodes if the matrix is used.
Paper 
"Extending the kernel of a relational DBMS with comprehensive support for sequenced temporal queries", in ACM Transactions on Database Systems (TODS), 41(4), Article 26, 46 pages, November 2016. , Selected as part of the 21st Annual Best of Computing by ACM Computing Reviews.
Many databases contain temporal, or timereferenced, data and use intervals to capture the temporal aspect. While SQLbased database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a stepbystep manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This paper demonstrates how it is possible to extend the relational database engine to achieve a fullfledged, industrialstrength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of socalled scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the paper's proposal. The new system is available as open source software.
Paper 
"An interval join optimized for modern hardware", in Proceedings of the 32nd IEEE International Conference on Data Engineering (ICDE), Helsinki, Finland, pp. 10981109, May 16–20, 2016. ,
We develop an algorithm for efficiently joining relations on intervalbased attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with realworld datasets our algorithm is able to outperform the stateoftheart by an order of magnitude.
Paper  Poster  Slides 
"Efficient computation of parsimonious temporal aggregation", in Proceedings of the 19th EastEuropean Conference on Advances In Databases and Information Systems (ADBIS), Poitiers, France, pp. 320333, September 811, 2015. ,
Parsimonious temporal aggregation (PTA) has been introduced to overcome limitations of previous temporal aggregation operators, namely to provide a concise yet data sensitive summary of temporal data. The basic idea of PTA is to first compute instant temporal aggregation (ITA) as an intermediate result and then to merge similar adjacent tuples in order to reduce the final result size. The best known algorithm to compute a correct PTA result is based on dynamic programming (DP) and requires O(n^2) space to store a socalled split point matrix, where n is the size of the intermediate data. The matrix stores the split points between which the intermediate tuples are merged. In this paper, we propose two optimizations of the DP algorithm for PTA queries. The first optimization is termed diagonal pruning and identifies regions of the matrix that need not to be computed. This reduces the runtime complexity. The second optimization addresses the space complexity. We observed that only a subset of the elements in the split point matrix are actually needed. Therefore, we propose to replace the split point matrix by a socalled split point graph, which stores only those split points that are needed to restore the optimal PTA solution. This step reduces the memory consumption. An empirical evaluation shows the effectiveness of the two optimizations both in terms of runtime and memory consumption.
Paper  Slides 
"Overlap interval partition join", in Proceedings of the 2014 ACM SIGMOD International Conference on the Management of Data (SIGMOD), Snowbird, UT, USA, pp. 14591470, June 2227, 2014. ,
Each tuple in a validtime relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two validtime relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes longlived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short and longlived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations onthefly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms stateoftheart techniques for the overlap join.
Paper  Poster  Slides 
"Query time scaling of attribute values in interval timestamped databases", in Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), Demo track, Brisbane, QLD, Australia, pp. 13041307, April 8–11, 2013. ,
In validtime databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time.
Paper  Poster 
"Temporal alignment", in Proceedings of the 2012 ACM SIGMOD International Conference on the Management of Data (SIGMOD), Scottsdale, AZ, USA, pp. 433444, May 2024, 2012. ,
In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques.
Paper  Poster 1  Poster 2  Slides