This page contains a list of research publications with abstracts and links to online versions. If you cannot access the file you are interested in, please, feel free to contact me. Also, have a look at my dblp, google scholar and researchgate pages.
◼ Journal ◼ Conference proceedings ◼ Workshop proceedings ◼ Editor ◼ Book or book chapter ◼ Other
2024
Abstract Self-tuning is a feature of autonomic databases that includes the problem of automatic schema design. It aims at providing an optimized schema that increases the overall database performance. While in relational databases automatic schema design focuses on the automated design of the physical schema, in NoSQL databases all levels of representation are considered: conceptual, logical, and physical. This is mainly because the latter are mostly schema-less and lack a standard schema design procedure as is the case for SQL databases. In this work, we carry out a systematic literature survey on automatic schema design in both SQL and NoSQL databases. We identify the levels of representation and the methods that are used for the schema design problem, and we present a novel taxonomy to classify and compare different schema design solutions. Our comprehensive analysis demonstrates that, despite substantial progress that has been made, schema design is still a developing field and considerable challenges need to be addressed, notably for NoSQL databases. We highlight the most important findings from the results of our analysis and identify areas for future research work.
Abstract Time series database systems (TSDBs) are prevalent in many applications ranging from monitoring and IoT devices to scientific research. Those systems are specifically designed to efficiently manage data indexed by time. Because of the variety of workloads, the diversity of time series features, and the sophistication of existing TSDBs, there is no clear way to pick the most suitable system. In this demo, we introduce SEER, an automated, configurable, and interactive toolkit to evaluate TSDBs. SEER is based on TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. It implements an end-to-end pipeline for database benchmarking from data generation and feature contamination to workload evaluation. Users can define their portfolio by configuring and parameterizing their own queries, specifying their frequencies, controlling the type and level of data features, and indicating the type of workloads. Moreover, they can deploy new systems and/or reconfigure the pre-installed ones. SEER would process users' requests and gracefully recommend the best system on a use-case basis.
Abstract Efficient and scalable processing of temporal anti-joins remains a significant research challenge in temporal databases. To address this issue, this paper introduces a novel temporal primitive designed for transforming a temporal anti-join, including conjunctive equality predicates on non-temporal attributes, into an equivalent algebraic expression involving a temporal inner join. The rationale behind this transformation is that the new expression can be decomposed into subtasks, allowing for parallel execution across multiple CPUs. Experimental results using real-world datasets demonstrate the superior efficiency and scalability of our solution for in-memory processing compared to existing solutions.
Abstract Time and temporal constraints are implicit in most databases. To facilitate data analysis and quality assessment, a database should provide explicit operations to identify the violation of temporal constraints. Against this background, the purpose of this paper is threefold: (1) we identify and provide a formal definition of five common anomalies in temporal databases, (2) we propose two new relational operations that allow, respectively, to label anomalous tuples in and to retrieve the anomalous tuples from a dataset, and (3) we provide three different SQL implementations of these operations for current relational database management systems. The healthcare domain is used to illustrate the usage and utility of the temporal anomalies. Finally, an experimental evaluation on real-world and synthetic data analyses the performance of the different implementations of the anomaly operators.
2023
Abstract Time series databases are essential for the large-scale deployment of many critical industrial applications. In infrastructure monitoring, for instance, a database system should be able to process large amounts of sensor data in real-time, execute continuous queries, and handle complex analytical queries such as anomaly detection or forecasting. Several benchmarks have been proposed to evaluate and understand how existing systems and design choices handle specific use cases and workloads. Unfortunately, none of them fully covers the peculiar requirements of monitoring applications. Furthermore, they fall short of providing an automated way to generate representative real-world data and workloads for testing and evaluating these systems. We present TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. Our key contributions consist of (1) representative queries that meet the requirements that we collected from a water monitoring use case, and (2) a new scalable data generator method based on Generative Adversarial Networks (GAN) and Locality Sensitive Hashing (LSH). We demonstrate, through an extensive set of experiments, how TSM-Bench provides a comprehensive evaluation of the performance of seven leading time series database systems while offering a detailed characterization of their capabilities and trade-offs.
Abstract In the 38th IEEE International Conference on Data Engineering (ICDE), 2022, panel discussions were introduced after paper presentations to facilitate in-depth exploration of research topics and encourage partici- pation. These discussions, enriched by diverse perspectives from experts and active audience involvement, provided fresh insights and a broader understanding of each topic. The introduction of panel discussions exceeded expectations, attracting a larger number of participants to the virtual sessions. This article summarizes the virtual panels held during ICDE’22, focusing on sessions such as Data Mining and Knowledge Discovery, Federated Learning, Graph Data Management, Graph Neural Networks, Spatial and Temporal Data Management, and Spatial and Temporal Data Mining. By showcasing the success of panel discussions in generating inspiring discussions and promoting participation, this article aims to benefit the data engineering community, providing a valuable resource for researchers and suggesting a compelling format of holding research sessions for future conferences.
Abstract Temporal information plays a crucial role in many database applications, however support for queries on such data is limited. We present an index structure, termed RD-index, to support range-duration queries over interval timestamped relations, which constrain both the range of the tuples’ positions on the timeline and their duration. RD-index is a grid structure in the two-dimensional space, representing the position on the timeline and the duration of timestamps, respectively. Instead of using a regular grid, we consider the data distribution for the construction of the grid in order to ensure that each grid cell contains approximately the same number of intervals. RD-index features provable bounds on the running time of all the operations, allow for a simple implementation, and supports very predictable query performance. We benchmark our solution on a variety of datasets and query workloads, investigating both the query rate and the behavior of the individual queries. The results show that RD-index performs better than the baselines on range-duration queries, for which it is explicitly designed. Furthermore, it outperforms state of the art indexes also on mixed workloads containing queries that constrain either only the duration or the range along with range-duration queries. Finally, the size of the RD-index is in all settings smaller than the competitors.
Abstract The 27th European Conference on Advances in Databases and Information Systems (ADBIS) aims at providing a forum where researchers and practitioners in the fields of databases and information systems can interact, exchange ideas and disseminate their accomplishments and visions.
Abstract NoSQL data stores have been proposed to handle the different breed of scale and challenges caused by Big Data. While a suitable schema design is of vital importance in NoSQL databases, in contrast to relational databases no standard schema design procedure exists yet. Instead, manual schema design is applied by using often vague and generic rules of thumb, which must be adapted to each application. Additionally, many applications require the management and processing of temporal data, for which NoSQL databases lack explicit support. To overcome such limitations, in this paper we propose an MDA-approach for mapping an existing conceptual UML class extension with temporal features into a NoSQL wide-column store schema. Then, we evaluate the schemas generated by our approach with the Cassandra wide-column store.
Abstract This paper presents the extended temporal aligner as a temporal primitive, and proposes a set of reduction rules that employ this primitive to convert a temporal join operator to its non-temporal equivalent. The rules cover all types of temporal joins, including inner join, outer joins, and anti-join. Preliminary experimental results demonstrate that the integration of the extended temporal aligner and the reduction rules can efficiently process temporal join queries.
Abstract Powder Metallurgy (PM) is the branch of Metallurgy that deals with the design/production of near net-shaped sintered workpieces with different shapes and characteristics. The produced sintered workpieces are used in automotive, aviation, and aerospace industries, just to name a few. The quality of the produced sintered workpieces largely depends on powder compaction techniques and the accurate adjustments of process parameters. Currently, adjustments of these process parameters are done manually and thus resulting in laborious and time-intensive effort. To this end, this paper explores the use of Machine Learning (ML) in the compaction process and proposes an accurate and light-weight ML-based pipeline to estimate the quality characteristics of the produced workpieces in the powder metallurgy domain. More specifically, it presents a pipeline for workpiece’s mass and lengths estimation by exploiting some novel hand-crafted features and comparing well selected ML prediction models, namely, Random Forest (RF), Adaboost (ADA), and Gradient Boosting (GB). The chosen models are trained on a combination of features extracted from environmental and sensory raw data to estimate the mass and lengths of the next produced workpiece. We have implemented and evaluated our scheme on a dataset collected in a real production environment and we have found that GB is the most consistent and accurate one with the lowest Root Mean Squared Error (≈ 0.0886%). The results of an extensive experimentation have proven the relevance of the selected features and the accuracy of GB.
2022
Abstract Temporal databases has been an active research area since many decades, ranging from research work on query processing, most dominantly on selection and join queries, to new directions in models and semantics, such as for instance temporal probabilistic or streaming data. At the same time more database vendors have been integrating temporal features into their systems, most notably, the temporal features of the SQL standard. In this paper, we summarize the latest research developments as presented in 30 research papers over the last five years in the context of temporal relational databases. Additionally, we also describe the developments of industrial database systems and vendors.
Abstract Finding anomalies in temporal relational databases is a difficult and challenging task, in particular if data is integrated from different sources. The problem is especially pressing in healthcare information systems, where temporal anomalies can pinpoint critical events such as erroneous drug administration or prescription. In this paper, we define three different temporal anomalies, which we call temporal redundancy, contradiction, and incompleteness. We define two different operators for each of these anomalies: the retrieval operator to retrieve all tuples of a relation that cause anomalous behaviour, and the labelling operator to annotate a temporal relation with additional information that marks normal and anomalous tuples. Finally, we present and evaluate different implementation techniques for the two operators for relational database systems.
Abstract Matchings between objects from two datasets, domains, or ontologies have to be computed in various application scenarios. One often used meta-approach - which we call bipartite data matching - is to leverage domain knowledge for defining costs between the objects that should be matched, and to then use the classical Hungarian algorithm to compute a minimum cost bipartite matching. In this paper, we introduce and study the problem of enumerating K dissimilar minimum cost bipartite matchings. We formalize this problem, prove that it is NP-hard, and present heuristics based on greedy dynamic programming. The presented enumeration techniques are not only interesting in themselves, but also mitigate an often overlooked shortcoming of bipartite data matching, namely, that it is sensitive w. r. t. the storage order of the input data. Extensive experiments show that our enumeration heuristics clearly outperform existing algorithms in terms of dissimilarity of the obtained matchings, that they are effective at rendering bipartite data matching approaches more robust w. r. t. random storage order, and that they significantly improve the upper bounds of state-of-the art algorithms for graph edit distance computation that are based on bipartite data matching.
Abstract Experience and evaluation research on sustainable products’ design is increasingly supported by eye-tracking tools. In particular, many studies have investigated the effect of gazing at or fixating on Areas of Interest on products’ evaluations, and in a number of cases, they have inferred the critical graphical elements leading to the preference of sustainable products. This paper is motivated by the lack of generalizability of the results of these studies, which have predominantly targeted specific products and Areas of Interest. In addition, it has also been overlooked that the observation of some Areas of Interest, despite not specifically targeting sustainable aspects, can lead consumers to prefer or appreciate sustainable products in any case. Furthermore, it has to be noted that sustainable products can be recognized based on their design (shape, material, lack of waste generated) and/or, more diffusedly, information clearly delivered on packaging and in advertising. With reference to the latter, this paper collected and classified Areas of Interest dealt with in past studies, markedly in eco-design and green consumption, and characterized by their potential generalizability. Specifically, the identified classes of Areas of Interest are not peculiar to specific products or economic sectors. These classes were further distinguished into “Content”, i.e., the quality aspect they intend to highlight, and “Form”, i.e., the graphical element used as a form of communication. This framework of Areas of Interest is the major contribution of the paper. Such a framework is needed to study regularities across multiple product categories in terms of how the observation of Areas of Interest leads to product appreciation and value perception. In addition, the potential significant differences between sustainable and commonplace products can be better investigated.
Abstract Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals.
Abstract The interactive exploration of time series is an important task in data analysis. In this paper, we concentrate on the investigation of linear correlations between time series. Since the correlation of time series might change over time, we consider the analysis of all possible subsequences of two time series. Such an approach allows identifying, at different levels of window length, periods over which two time series correlate and periods over which they do not correlate. We provide a solution to compute the correlations over all window lengths in O(n2)time, which is the size of the output and hence the best we can achieve. Furthermore, we propose a visualization of the result in the form of a heatmap, which provides a compact overview on the structure of the correlations amenable for a data analyst. An experimental evaluation shows that the tool is efficient to allow for interactive data exploration.
2021
Abstract We address the problem of compactly approximating multidimensional range counts with a guaranteed maximum error and propose a novel histogram-based summary structure, termed SliceHist. The key idea is to operate a grid histogram in an approximately rank-transformed space, where the data points are more uniformly distributed and each grid slice contains only a small number of points. Then, the points of each slice are summarised again using the same technique. As each query box partially intersects only few slices and each grid slice has few data points, the summary is able to achieve tight error guarantees. In experiments and through analysis of non-asymptotic formulas we show that SliceHist is not only competitive with existing heuristics in terms of performance, but additionally offers tight error guarantees.
Abstract Stock market events are hard to model. In recent years, one approach that has been receiving increasing attention is to analyze graphs induced by price correlations of different stock companies. By analyzing the structure of such graphs, it is possible to identify critical events, e.g., market crises. To the best of our knowledge, there are no tools available that offer comprehensive support for such analyses. This paper introduces a novel tool that offers in-depth analysis with the ability of fine tuning parameters with an intuitive user interface. With a proposed workflow to handle time series data, the tool becomes versatile and it can analyze correlation graphs of different semantics: minimum spanning tree, graphs with edge thresholds, and evolving graphs. It also provides a rich set of functions that enable users to explore easily, interactively and systematically the correlation graphs starting from a file of raw time series data. With real-world stock data, we demonstrate how straightforward yet effective it is to accomplish various analytical tasks with the proposed tool.
Paper | Video | Online demo
Abstract We develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen's relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.
2020
Abstract Temporal data is ubiquitous, and its importance has been witnessed by the research efforts for several decades as well as by the increased interest in the last years from both academia and industry. Two prominent research directions in this context are the field of temporal databases and the field of time series data. This extended abstract aims at providing a concise overview about the state of the art in processing temporal and time series data as well as to discuss open research problems and challenges.
2019
Abstract Recently, an increasing need for sophisticated multimedia analytics tools has been observed, which is triggered by a rapid growth of multimedia collections and by an increasing number of scientific fields embedding images in their studies. Although temporal data is ubiquitous and crucial in many applications, such tools typically do not support the analysis of data along the temporal dimension, especially for time periods. An appropriate visualization and comparison of period data associated with multimedia collections would help users to infer new information from such collections. In this pa- per, we present a novel multimedia analytics application for summarizing and analyzing temporal data from eye-tracking experiments. The application combines three different visual approaches: Time°diff, visual-information-seeking mantra, and multi-viewpoint. A qualitative evaluation with domain experts confirmed that our application helps decision makers to summarize and analyze multimedia collections containing period data.
Abstract Today, most commercial database systems provide some support for the management of temporal data, but the index support for efficiently accessing such data is rather limited. Existing access paths neglect the fact that time intervals are located on the timeline and have a duration, two important pieces of information for querying temporal data. In this paper, we tackle this problem and introduce a novel index structure, termed Period Index, for efficiently accessing temporal data based on these two pieces of information. The index supports temporal queries that constrain the position of an interval on the timeline (range queries), its interval duration (duration queries), or both (range-duration queries). The key idea of the new index is to split the timeline into fixed-length buckets, each of which is divided into a set of cells that are organized in levels. The cells encode the position of intervals on the timeline, whereas the levels encode their duration. This grid-based index is well-suited for parallelization and non-uniform memory access (NUMA) architectures as it is common for modern hardware with large main-memories and multi-core servers. The Period Index is independent of the physical order of the data and has predictable performance due to the underlying hashing approach. We also propose an enhanced version of our index structure, termed Period Index∗, which continuously adapts the optimal bucket length to the distribution of the data. Our experiments show that Period Index∗ significantly beats other indexes for the class of queries that constrain both the position and the length of the time intervals, and it is competitive for queries that involve solely one temporal dimension.
Abstract With the ever increasing amount and complexity of data, visual analysis becomes a fundamental tool to spot correlations and other relationships in data. Most of the previous techniques (e.g., scatter plots or heatmaps) focus on point data, i.e., data with point measures, such as prices or volumes. In this demo paper, we focus on data with interval measures, that is data where measures consist of an interval or range of values, such as price ranges or time intervals. We present a tool, termed HOTPERIODS, which allows to visualize correlations between two interval measures in the two-dimensional space, where the two measures represent a rectangle. To visualize such data, we first perform a rectangle aggregation. The result of this aggregation is a density matrix, where each cell stores the number of rectangles that cover the corresponding points in space. For the visualization of the density matrix, color-coding is used to represent different density values similar to heatmaps. We illustrate the usefulness of HOTPERIODS for the analysis of stock market data and tourism data, both of which show interval measures.
Paper | Poster | Online demo
Abstract Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations.
Paper | Technical report (extended version) | Poster | Slides | Website and code | Reproducibility
Abstract In this paper, we study different ways of representing and querying fact data that is time-stamped with a time period in a data warehouse. The main focus is on how to represent the time periods that are associated with the facts in order to support convenient and efficient aggregations over time. We propose three distinct logical models that represent time periods, respectively, as sets of all time points in a period (instant model), as pairs of start and end time points of a period (period model), and as atomic units that are explicitly stored in a new period dimension (period*model). The period dimension is enriched with information about the days of each period, thereby combining the two former models. We use four different classes of aggregation queries to analyze query formulation, query execution, and query performance over the three models. An extensive empirical evaluation on synthetic and real-world datasets and the analysis of the query execution plans reveals that the period model is the best choice in terms of runtime and space for all four query classes.
Abstract We develop a highly efficient access method, called Delta-Top-Index, to answer top-k subsequence matching queries over a multi-dimensional time series data set. Compared to a naive implementation, our index has a storage cost that is up to two orders of magnitude smaller, while providing answers within microseconds. Additionally, we apply cache optimization techniques to speed up the construction of the index. Finally, we demonstrate the efficiency and effectiveness of our technique in an experimental evaluation with real-world data.
Abstract Prefix sums are a powerful technique to answer range-sum queries over multi-dimensional arrays in O(1) time by looking up a constant number of values in an array of size O(N) where N is the number of cells in the multi-dimensional array. However, the technique suffers from O(N) update and storage costs. Relative prefix sums address the high update costs by partitioning the array into blocks, thereby breaking the dependency between cells. In this paper, we present sparse prefix sums that exploit data sparsity to reduce the high storage costs of relative prefix sums. By building upon relative prefix sums, sparse prefix sums achieve the same update complexity as relative prefix sums. The authors of relative prefix sums erroneously claimed that the update complexity is O(sqrt(N)) for any number of dimensions. We show that this claim holds only for two dimensions, whereas the correct complexity for an arbitrary number of d dimensions is O(N^((d-1)/d)). To reduce the storage costs, the sparse prefix sums technique exploits sparsity in the data and avoids to materialize prefix sums for empty rows and columns in the data grid; instead, look-up tables are used to preserve constant query time. Sparse prefix sums are the first approach to achieve O(1) query time with sub-linear storage costs for range-sum queries over sparse low-dimensional arrays. A thorough experimental evaluation shows that the approach works very well in practice. On the tested real-world data sets the storage costs are reduced by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
2018
Abstract Despite the ubiquity of temporal data and considerable research on processing such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in processing historical or temporal data. The SQL:2011 standard introduced some temporal features, and commercial database management systems have started to offer temporal functionalities in a step-by-step manner. There has also been a proposal for a more fundamental and comprehensive solution for sequenced temporal queries, which allows a tight integration into relational database systems, thereby taking advantage of existing query optimization and evaluation technologies. New challenges for processing temporal data arise with multiple dimensions of time and the increasing amounts of data, including time series data that represent a special kind of temporal data.
Abstract Despite the ubiquity of temporal data and considerable research on the effective and efficient processing of such data, database systems largely remain designed for processing the current state of some modeled reality. More recently, we have seen an increasing interest in the processing of temporal data that captures multiple states of reality. The SQL:2011 standard incorporates some temporal support, and commercial DBMSs have started to offer temporal functionality in a step-by-step manner, such as the representation of temporal intervals, temporal primary and foreign keys, and the support for so-called time-travel queries that enable access to past states. This tutorial gives an overview of state-of-the-art research results and technologies for storing, managing, and processing temporal data in relational database management systems. Following an introduction that offers a historical perspective, we provide an overview of basic temporal database concepts. Then we survey the state-of-the-art in temporal database research, followed by a coverage of the support for temporal data in the current SQL standard and the extent to which the temporal aspects of the standard are supported by existing systems. The tutorial ends by covering a recently proposed framework that provides comprehensive support for processing temporal data and that has been implemented in PostgreSQL.
Abstract Temporal data, and in particular time periods, are crucial to many applications in different sectors, such as industry, medicine, insurance, finance, tourism, and management. Such applications often consult historical information in order to compare and optimize processes. Generally, the time periods in this data represent the period of validity in the real-world, such as the period of a specific assignment, but may also represent the periods when the data was stored, i.e., believed to be true. Inferring new information from this data is eased by visualizing and comparing their different time periods. In this paper, we present Time°diff, a novel visualization approach based on timebar charts, which is suitable for comparing data with time periods and enabling decision makers to easily analyze information containing period data.
2017
Abstract We propose DigitHist, a histogram summary for selectivity estimation on multidimensional data with tight error bounds. By combining multidimensional and one-dimensional histograms along regular grids of different resolutions, DigitHist provides an accurate and reliable histogram approach for multidimensional data. To achieve a compact summary, we use a sparse representation combined with a novel histogram compression technique that chooses a higher resolution in dense regions and a lower resolution elsewhere. For the construction of DigitHist, we propose a new error measure, termed u-error, which minimizes the width between the guaranteed upper and lower bounds of the selectivity estimate. The construction algorithm performs a single data scan and has linear time complexity. An in-depth experimental evaluation shows that DigitHist delivers superior precision and error bounds than state-of-the-art competitors at a comparable query time.
Abstract The prefix sum approach is a powerful technique to answer range-sum queries over multi-dimensional arrays in constant time by requiring only a few look-ups in an array of precomputed prefix sums. In this paper, we propose the sparse prefix sum approach that is based on relative prefix sums and exploits sparsity in the data to vastly reduce the storage costs for the prefix sums. The proposed approach has desirable theoretical properties and works well in practice. It is the first approach achieving constant query time with sub-linear update costs and storage costs for range-sum queries over sparse low-dimensional arrays. Experiments on real-world data sets show that the approach reduces storage costs by an order of magnitude with only a small overhead in query time, thus preserving microsecond-fast query answering.
Abstract In this paper, we present the VISOR tool, which helps the user to explore data and their summary structures by visualizing the relationships between the size k of a data summary and the induced error. Given an ordered dataset, VISOR allows to vary the size k of a data summary and to immediately see the effect on the induced error, by visualizing the error and its dependency on k in an epsilon-graph and delta-graph, respectively. The user can easily explore different values of k and determine the best value for the summary size. VISOR allows also to compare different summarization methods, such as piecewise constant approximation, piecewise aggregation approximation or V-optimal histograms. We show several demonstration scenarios, including how to determine an appropriate value for the summary size and comparing different summarization techniques.
Abstract Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before further data processing is possible. We propose Top-k Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situation is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series pattern-determine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time series and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using real-world and synthetic data shows that we outperform the state-of-the-art solutions.
Abstract The optimal k-segments of an ordered dataset of size n consists of k tuples that are obtained by merging consecutive tuples such that a given error metric is minimized. The problem is general and has been studied in various flavors, e.g., piecewise-constant approximation, parsimonious temporal aggregation, and v-optimal histograms. A well-known computation scheme for the optimal k-segments is based on dynamic programming, which computes a k * n error matrix E and a corresponding split point matrix J of the same size. This yields O(n * k) space and O(n^2 * k) runtime complexity. In this article, we propose three optimization techniques for the runtime complexity and one for the space complexity. First, diagonal pruning identifies regions of the error matrix E that need not to be computed since they cannot lead to a valid solution. Second, for those cells in E that are computed, we provide a heuristic to determine a better seed value, which in turn leads to a tighter lower bound for the potential split points to be considered for the calculation of the minimal error. Third, we show how the algorithm can be effectively parallelized. The space complexity is dominated by the split point matrix J, which needs to be kept till the end. To tackle this problem, we replace the split point matrix by a dynamic split point graph, which eliminates entries that are not needed to retrieve the optimal solution. A detailed experimental evaluation shows the effectiveness of the proposed solutions. Our optimization techniques significantly improve the runtime of state-of-the-art matrix implementations, and they guarantee a comparable performance of an implementation that uses the split point graph. The split point graph reduces the memory consumption up to two orders of magnitude and allows us to process large datasets for which the memory explodes if the matrix is used.
2016
Abstract Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This paper demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the paper's proposal. The new system is available as open source software.
Abstract We develop an algorithm for efficiently joining relations on interval-based attributes with overlap predicates, which, for example, are commonly found in temporal databases. Using a new data structure and a lazy evaluation technique, we are able to achieve impressive performance gains by optimizing memory accesses exploiting features of modern CPU architectures. In an experimental evaluation with real-world datasets our algorithm is able to outperform the state-of-the-art by an order of magnitude.
2015
Abstract Parsimonious temporal aggregation (PTA) has been introduced to overcome limitations of previous temporal aggregation operators, namely to provide a concise yet data sensitive summary of temporal data. The basic idea of PTA is to first compute instant temporal aggregation (ITA) as an intermediate result and then to merge similar adjacent tuples in order to reduce the final result size. The best known algorithm to compute a correct PTA result is based on dynamic programming (DP) and requires O(n^2) space to store a so-called split point matrix, where n is the size of the intermediate data. The matrix stores the split points between which the intermediate tuples are merged. In this paper, we propose two optimizations of the DP algorithm for PTA queries. The first optimization is termed diagonal pruning and identifies regions of the matrix that need not to be computed. This reduces the runtime complexity. The second optimization addresses the space complexity. We observed that only a subset of the elements in the split point matrix are actually needed. Therefore, we propose to replace the split point matrix by a so-called split point graph, which stores only those split points that are needed to restore the optimal PTA solution. This step reduces the memory consumption. An empirical evaluation shows the effectiveness of the two optimizations both in terms of runtime and memory consumption.
2014
Abstract Each tuple in a valid-time relation includes an interval attribute T that represents the tuple's valid time. The overlap join between two valid-time relations determines all pairs of tuples with overlapping intervals. Although overlap joins are common, existing partitioning and indexing schemes are inefficient if the data includes long-lived tuples or if intervals intersect partition boundaries. We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. OIP divides the time range of a relation into k base granules and defines overlapping partitions for sequences of contiguous granules. OIP is the first partitioning method for interval data that gives a constant clustering guarantee: the difference in duration between the interval of a tuple and the interval of its partition is independent of the duration of the tuple's interval. We offer a detailed analysis of the average false hit ratio and the average number of partition accesses for queries with overlap predicates, and we prove that the average false hit ratio is independent of the number of short- and long-lived tuples. To compute the overlap join, we propose the Overlap Interval Partition Join (OIPJoin), which uses OIP to partition the input relations on-the-fly. Only the tuples from overlapping partitions have to be joined to compute the result. We analytically derive the optimal number of granules, k, for partitioning the two input relations, from the size of the data, the cost of CPU operations, and the cost of main memory or disk IOs. Our experiments confirm the analytical results and show that the OIPJoin outperforms state-of-the-art techniques for the overlap join.
Abstract Data with time intervals is prominently present in finance, accounting, medicine and many other application domains. When querying such data, it is important to perform operations on aligned intervals, i.e., data is processed together only for the common interval where it is valid in the real world. For instance, an employee contributed to a project only for the time period where both the project was running and the employee was employed by the company, i.e., the employee contributed to the project only over their aligned time interval. A temporal join is thus only evaluated over the aligned interval of an employee and a project. The problem of performing temporal operations, such as temporal aggregation or temporal joins, on data with time intervals using relational database systems can be attributed to the lack of primitives for the alignment of intervals. Even more challenges arise, when the data includes attribute values that are interval-dependent, such as project budgets or cumulative costs, and need to be scaled along with the alignment of intervals during processing. The goal of this thesis is to provide systematic and built-in support for querying data with intervals in relational database systems. The solution we propose uses two temporal primitives a temporal normalizer and a temporal aligner for the alignment of intervals. Temporal operators on interval data are defined by reduction rules that map a temporal operator to an operation with a temporal primitive followed by the corresponding traditional non-temporal operator that uses equality on aligned intervals. A key feature of our approach is that operators can access the original time intervals in predicates and functions, such as join conditions and aggregation functions, using timestamp propagation. Our approach, through timestamp propagation, supports the scaling of attribute values that are interval-dependent. When intervals are aligned during query processing, scaling can be performed at query time with the help of user-defined functions. This allows users to choose whether and how attribute values should be scaled. This is necessary since they may be interested in the total value in one query and the scaled value according to days or even working days in another query. We integrated our solution into the kernel of the open source database system PostgreSQL, which allows to leverage existing query optimization techniques and algorithms.
2013
Abstract In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time.
2012
Abstract In order to process interval timestamped data, the sequenced semantics has been proposed. This paper presents a relational algebra solution that provides native support for the three properties of the sequenced semantics: snapshot reducibility, extended snapshot reducibility, and change preservation. We introduce two temporal primitives, temporal splitter and temporal aligner, and define rules that use these primitives to reduce the operators of a temporal algebra to their nontemporal counterparts. Our solution supports the three properties of the sequenced semantics through interval adjustment and timestamp propagation. We have implemented the temporal primitives and reduction rules in the kernel of PostgreSQL to get native database support for processing interval timestamped data. The support is comprehensive and includes outer joins, antijoins, and aggregations with predicates and functions over the time intervals of argument relations. The implementation and empirical evaluation confirms effectiveness and scalability of our solution that leverages existing database query optimization techniques.