coDBz
a Peer Database Management System

Free University of Bozen-Bolzano, Italy


   

Project Description

Context

The coDBz proposal shares the spirit of the Piazza system [Halevy et al., 2003; Tatarinov and Halevy, 2004]. The vision of the Piazza peer data management system (PDMS) project is to provide semantic mediation between an environment of autonomous and heterogeneous peers, each with its own schema. Rather than requiring the use of a single, uniform, centralised mediated schema to share data between peers, Piazza allows peers to define schema mappings between pairs of peers (or among small subsets of peers). In turn, transitive relationships among the schemas of the peers are exploited so the entire resources of the PDMS can be used. The original Piazza system is limited in the fact that it does not allow complex mapping rules (i.e., schema mappings must be safe rules with atomic heads), it does not allow for fully cyclic mapping rules, and it does not allow for dynamic networks (i.e., networks where peers may join or leave anytime).

Comparison with DHT

The problem of a PDMS with autonomous and heterogeneous peers exemplified by Piazza is different from the structured P2P systems, such as [Stoica et al., 2001; Ratnasamy et al., 2001]. The peers in structured P2P systems form an overlay network that has some superimposed logical structure, such as a ring in Chord, a d-dimensional coordinate space in CAN or a Skip Graph data structure. This logical overlay structure among the peers can be utilised to efficiently route a query. This is achieved by means of distributed hash tables (DHT) mapping data and queries to the logical structure of the network. For example, in Chord and Skip Graphs the queries can be routed in O(log n) overlay hops in a network of n peers. This can be achieved only if the schemas of the peers are (almost) homogeneous and known in advance, so that the DHT can be effectively computed. This is not the case of the scenario of a PDMS with autonomous and heterogeneous peers, where peers may join the network by autonomously deciding how to map their own heterogeneous schema with other schemas of arbitrarily chosen acquaintances, each one possibly with a different schema. As a consequence, it is evident that in each communication step between two peers the data is transformed according to the specific semantic schema mapping relating them; therefore, when some remote data is required to answer a query, the data has to flow through all the intermediate peers in order to be interpreted correctly by appropriate transformations.

Similar Approaches

Together with the work presented in [Halevy et al., 2003; Tatarinov and Halevy, 2004], other researchers investigated the theoretical underpinnings of peer database management systems. The work presented in [Calvanese et al., 2004] proposes a logical analysis of the theory behind a PDMS, but it lacks a distributed algorithm: it assumes that nodes may exchange both data and mappings, so that only the query node will eventually evaluate the query answer in one go - there is no distributed computation and the network may be flooded with data. The work presented in [Bernstein et al., 2002; Serafini et al., 2003] proposes a very general theoretical framework for PDMS, with expressive schema mapping languages (up to first order logic) and constraint languages (up to first order logic) applied to single peers. However, no computational characterisation is given. The paper [Serafini and Ghidini, 2000] describes a local algorithm to compute query answers in a P2P network, but it allows only safe schema mapping rules with atomic heads. The algorithm is exponential in the number of nodes and it floods the network with messages during query evaluation if the network contains cycles. None of the above PDMS approaches supports dynamic networks: in the case of peers joining or leaving the network during the computation, neither the termination of the query answering algorithm nor the properties of the possible query answer are guaranteed.

The coDBz Proposal

Considering the general ideas sketched above, the basic ideas of the coDBz PDMS were presented in the paper [Franconi et al., 2003], which introduces a general logical and computational characterisation of P2P networks of autonomous and heterogeneous databases, interconnected by means of schema mapping rules between pairs of peers. This paper defines a precise model-theoretic semantics of a PDMS (fully compatible with Piazza and the other PDMS framework presented above), it characterises the general computational properties for the problem of answering queries to a PDMS, and it presents tight complexity bounds and basic distributed procedures for important special cases. The paper [Franconi et al., 2004d] analyses a distributed procedure for the problem of local database update in a network of database peers. The problem of local database update is different from the problem of query answering. Given a PDMS, the answer to a local query may involve data that is distributed over the network, and this may requiring the participation of many the nodes at query time. On the other hand, given a PDMS, a ``batch'' update algorithm will be such that all the nodes consistently and optimally propagate all the relevant data to their neighbours, allowing for subsequent local queries to be answered locally within a node, without fetching data from other nodes at query time. The update problem has been considered important by the P2P literature; most notably, recent papers focused on the importance of data exchange and materialisation for a stable P2P network [Fagin et al., 2003; Daswani et al., 2003]. The papers [Franconi et al., 2004e; Franconi et al., 2004b,] introduce a basic distributed algorithm for query answering in a PDMS, together with a prototypical implementation in the JXTA framework. The proposed algorithm is polynomial in data complexity, but it is still exponential in the dimension of the network. These papers consider a network of databases, possibly with different schemas, interconnected by means of mapping rules having conjunctive queries both in the body and in the head, with possibly existential variables both in the body and in the head (called GLAV rules) as first suggested by [Calvanese et al., 2004]. Each node can be queried with a conjunctive query over its schema, for data which the node can possibly fetch from its neighbours using appropriate mapping rules. Unrestricted cyclic topologies of the network are allowed. The proposed PDMS framework is robust in the sense that it supports dynamic networks: even if nodes and mapping rules appear or disappear during the computation, the proposed algorithm will eventually terminate with a provably sound and complete result. The paper [Franconi et al., 2004a] extends the results presented above, by introducing and evaluating experimentally a fully distributed query processing algorithm for a PDMS, which is polynomial both in data complexity and in the dimension of the network. The algorithm handles dynamic networks, and it works without any starting assumption about the topology of the network. The paper presents also an experimental evaluation of the algorithm on some real and random cases.

Further Comparisons

Some work on distributed data replication uses similar techniques, such as ``lazy replication'' and ``epidemic algorithms'' [Holliday et al., 2003], but these are not directly applicable to the kind of PDMS considered by Piazza and by the other references presented above, since they rely on a different semantics of the mappings. Similarly, the routing indexes technology presented by [Crespo and Garcia-Molina, 2002] can not be applied to the kind of PDMS with cyclic mappings and dynamic networks, since the presence of cycles in the network together with the a-priori ignorance of the (dynamic) network topology invalidates any careful selection of neighbours providing answers. On the positive side, we are considering to integrate into our PDMS framework the idea of data mappings as described in [Kementsietsidis et al.2003b,Kementsietsidis et al.2003a], which introduces a table that maintains mappings to the neighbour's data, to mimic a sort of extensional constraint.
Another line of research that is necessary to compare with the PDMS framework proposed here, is the standard classical logic-based data integration technology, which has been summarised in a very clear way by [Lenzerini, 2002]; successful examples of classical logic-based data integration technology are the Information Manifold [Kirk et al.1995] and Tsimmis [Garcia-Molina et al.1997]. The main difference is in the role of the schema mapping rules between nodes: in a PDMS a schema mapping rule is intended for data migration and transformation between neighbours, as opposed to the role of global logical constraints in classical data integration systems. It can be proved (see, e.g., [Franconi et al., 2003]) that by adopting a PDMS semantics the complexity of query answering is reduced from exponential (or undecidable for GLAV) down to polynomial.

Bibliography

  • Bernstein et al., 2002

    P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu.
    Data management for peer-to-peer computing: A vision.
    In Workshop on the Web and Databases, WebDB 2002, 2002.

  • Calvanese et al., 2004

    Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati.
    Logical foundations of peer-to-peer data integration.
    In Proc. of the 23rd ACM SIGACT SIGMOD SIGART Sym. on Principles of Database Systems (PODS-2004), 2004.

  • Crespo and Garcia-Molina, 2002

    Arturo Crespo and Hector Garcia-Molina.
    Routing indices for peer-to-peer systems.
    In Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02), page 23. IEEE Computer Society, 2002.

  • Daswani et al., 2003

    Neil Daswani, Hector Garcia-Molina, and Beverly Yang.
    Open problems in data-sharing peer-to-peer systems.
    In ICDT 2003, 2003.

  • Fagin et al., 2003

    Ronald Fagin, Phokion G. Kolaitis, R. J. Miller, and Lucian Popa.
    Data exchange: Semantics and query answering.
    In Proceedings of the 9th International Conference on Database Theory, pages 207-224. Springer-Verlag, 2003.

  • Franconi et al., 2003 (download)

    Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Luciano Serafini.
    A robust logical and computational characterisation of peer-to-peer database systems.
    In Proceedings of the VLDB International Workshop on Databases, Information Systems and Peer-to-Peer Computing (DBISP2P'03), 2003.

  • Franconi et al., 2004a (download)

    Enrico Franconi, Gabriel Kuper, and Andrei Lopatenko.
    Efficient query processing in p2p dynamic networks of autonomous and heterogeneous databases.
    Technical report, Free University of Bozen-Bolzano, Italy, 2004.

  • Franconi et al., 2004b (download)

    Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
    The coDB robust peer-to-peer database system.
    In Proc. of the 2nd Workshop on Semantics in Peer-to-Peer and Grid Computing (SemPGrid'04), 2004.

  • Franconi et al., 2004c

    Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
    The coDB robust peer-to-peer database system.
    In Proc. of the 11th Italian Database Conference (SEBD'2004), 2004.

  • Franconi et al., 2004d (download)

    Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
    A distributed algorithm for robust data sharing and updates in p2p database networks.
    In Proceedings of the EDBT International Workshop on Peer-to-peer Computing and Databases (P2P&DB'04), March 2004.

  • Franconi et al., 2004e (download)

    Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
    Queries and updates in the coDB peer to peer database system.
    In Proc. of the 30th International Conference on Very Large Data Bases (VLDB'04), 2004.

  • Garcia-Molina et al.1997

    Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vassalos, and Jennifer Widom.
    The TSIMMIS approach to mediation: Data models and languages.
    Journal of Intelligent Information Systems, 8(2):117-132, 1997.

  • Halevy et al., 2003

    Alon Halevy, Zachary Ives, Dan Suciu, and Igor Tatarinov.
    Schema mediation in peer data management systems.
    In Proceedings of the 19th International Conference on Data Engineering (ICDE'03), 2003.

  • Holliday et al., 2003

    JoAnne Holliday, Robert C. Steinke, Divyakant Agrawal, and Amr El Abbadi.
    Epidemic algorithms for replicated databases.
    IEEE Trans. Knowl. Data Eng., 15(5):1218-1238, 2003.

  • Kementsietsidis et al.2003a
    Anastasios Kementsietsidis, Marcelo Arenas, and Renée Miller.
    Managing data mappings in the hyperion project.
    In Proceedings of the 19th International Conference on Data Engineering (ICDE'03), 2003.

  • Kementsietsidis et al.2003b
    Anastasios Kementsietsidis, Marcelo Arenas, and Renee Miller.
    Mapping data in peer-to-peer systems: Semantics and algorithmic issues.
    In Proceedings of the SIGMOD International Conference on Management of Data (SIGMOD'03), 2003.

  • Kirk et al.1995

    T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava.
    The Information Manifold.
    In C. Knoblock and A. Levy, editors, Proceedings of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments, Stanford University, Stanford, California, 1995.

  • Lenzerini, 2002

    Maurizio Lenzerini.
    Data integration: a theoretical perspective.
    In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 233-246. ACM Press, 2002.

  • Ratnasamy et al., 2001

    S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.
    A scalable content addressable network.
    In Proceedings of the ACM SIGCOMM '01 Conference, San Diego, California, August 2001.

  • Serafini and Ghidini, 2000

    Luciano Serafini and Chiara Ghidini.
    Using wrapper agents to answer queries in distributed information systems.
    In Proceedings of the First Biennial Int. Conf. on Advances in Information Systems (ADVIS-2000), 2000.

  • Serafini et al., 2003

    Luciano Serafini, Fausto Giunchiglia, John Mylopoulos, and Philip A. Bernstein.
    Local relational model: A logical formalization of database coordination.
    In CONTEXT 2003, pages 286-299, 2003.

  • Stoica et al., 2001

    Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan.
    Chord: A scalable peer-to-peer lookup service for internet applications.
    In Proceedings of the ACM SIGCOMM '01 Conference, San Diego, California, August 2001.

  • Tatarinov and Halevy, 2004

    Igor Tatarinov and Alon Halevy.
    Efficient query reformulation in peer data management systems.
    In Proceedings of the SIGMOD International Conference on Management of Data (SIGMOD'04), 2004.


   
Enrico Franconi