|
Project Description
Context
The coDBz proposal shares the spirit of the Piazza
system [Halevy et
al., 2003; Tatarinov and
Halevy, 2004]. The
vision of the Piazza peer data management system (PDMS) project is to
provide semantic mediation between an environment of autonomous and
heterogeneous peers, each with its own schema. Rather than requiring
the use of a single, uniform, centralised mediated schema to share
data between peers, Piazza allows peers to define schema mappings
between pairs of peers (or among small subsets of peers). In turn,
transitive relationships among the schemas of the peers are exploited
so the entire resources of the PDMS can be used. The original Piazza
system is limited in the fact that it does not allow complex mapping
rules (i.e., schema mappings must be safe rules with atomic heads), it
does not allow for fully cyclic mapping rules, and it does not allow
for dynamic networks (i.e., networks where peers may join or leave
anytime).
Comparison with DHT
The problem of a PDMS with autonomous and heterogeneous peers
exemplified by Piazza is different from the structured P2P
systems, such as [Stoica et
al., 2001; Ratnasamy et
al., 2001]. The peers
in structured P2P systems form an overlay network that has some
superimposed logical structure, such as a ring in Chord, a
d-dimensional coordinate space in CAN or a Skip Graph data structure.
This logical overlay structure among the peers can be utilised to
efficiently route a query. This is achieved by means of distributed
hash tables (DHT) mapping data and queries to the logical structure of
the network. For example, in Chord and Skip Graphs the queries can be
routed in O(log n) overlay hops in a network of n peers. This can be
achieved only if the schemas of the peers are (almost) homogeneous and
known in advance, so that the DHT can be effectively computed. This is
not the case of the scenario of a PDMS with autonomous and
heterogeneous peers, where peers may join the network by autonomously
deciding how to map their own heterogeneous schema with other schemas
of arbitrarily chosen acquaintances, each one possibly with a
different schema. As a consequence, it is evident that in each
communication step between two peers the data is transformed according
to the specific semantic schema mapping relating them; therefore, when
some remote data is required to answer a query, the data has to flow
through all the intermediate peers in order to be interpreted
correctly by appropriate transformations.
Similar Approaches
Together with the work presented
in [Halevy et
al., 2003; Tatarinov and
Halevy, 2004], other
researchers investigated the theoretical underpinnings of peer
database management systems. The work presented in
[Calvanese et
al., 2004] proposes a logical analysis of the theory
behind a PDMS, but it lacks a distributed algorithm: it assumes that
nodes may exchange both data and mappings, so that only the
query node will eventually evaluate the query answer in one go -
there is no distributed computation and the network may be flooded
with data. The work presented
in [Bernstein et
al., 2002; Serafini et
al., 2003] proposes
a very general theoretical framework for PDMS, with expressive schema
mapping languages (up to first order logic) and constraint languages
(up to first order logic) applied to single peers. However, no
computational characterisation is given. The
paper [Serafini and
Ghidini, 2000] describes a local algorithm to
compute query answers in a P2P network, but it allows only safe schema
mapping rules with atomic heads. The algorithm is exponential in the
number of nodes and it floods the network with messages during query
evaluation if the network contains cycles. None of the above PDMS
approaches supports dynamic networks: in the case of peers joining or
leaving the network during the computation, neither the termination of
the query answering algorithm nor the properties of the possible query
answer are guaranteed.
The coDBz Proposal
Considering the general ideas sketched above, the basic ideas of the
coDBz PDMS were presented in the
paper [Franconi et
al., 2003], which introduces a general logical
and computational characterisation of P2P networks of autonomous and
heterogeneous databases, interconnected by means of schema mapping
rules between pairs of peers. This paper defines a precise
model-theoretic semantics of a PDMS (fully compatible with Piazza and
the other PDMS framework presented above), it characterises the
general computational properties for the problem of answering queries
to a PDMS, and it presents tight complexity bounds and basic
distributed procedures for important special cases. The
paper [Franconi et
al., 2004d] analyses a distributed procedure
for the problem of local database update in a network of database
peers. The problem of local database update is different from the
problem of query answering. Given a PDMS, the answer to a local query
may involve data that is distributed over the network, and this may
requiring the participation of many the nodes at query time. On the
other hand, given a PDMS, a ``batch'' update algorithm will be such
that all the nodes consistently and optimally propagate all the
relevant data to their neighbours, allowing for subsequent local
queries to be answered locally within a node, without fetching data
from other nodes at query time. The update problem has been considered
important by the P2P literature; most notably, recent papers focused
on the importance of data exchange and materialisation for a stable
P2P network [Fagin et
al., 2003; Daswani et
al., 2003]. The
papers [Franconi et
al., 2004e; Franconi et
al., 2004b,] introduce a basic distributed algorithm for
query answering in a PDMS, together with a prototypical
implementation in the JXTA framework. The proposed algorithm is
polynomial in data complexity, but it is still exponential in the
dimension of the network. These papers consider a network of
databases, possibly with different schemas, interconnected by means of
mapping rules having conjunctive queries both in the body and in the
head, with possibly existential variables both in the body and in the
head (called GLAV rules) as first suggested by [Calvanese et
al., 2004]. Each
node can be queried with a conjunctive query over its schema, for data
which the node can possibly fetch from its neighbours using
appropriate mapping rules. Unrestricted cyclic topologies of the
network are allowed. The proposed PDMS framework is robust in the
sense that it supports dynamic networks: even if nodes and
mapping rules appear or disappear during the computation, the proposed
algorithm will eventually terminate with a provably sound and complete
result.
The paper [Franconi et
al., 2004a] extends the results
presented above, by introducing and evaluating experimentally a fully
distributed query processing algorithm for a PDMS, which is
polynomial both in data complexity and in the dimension of
the network. The algorithm handles dynamic networks, and it works
without any starting assumption about the topology of the
network. The paper presents also an experimental evaluation of the
algorithm on some real and random cases.
|
Further Comparisons
Some work on distributed data replication uses similar techniques,
such as ``lazy replication'' and ``epidemic
algorithms'' [Holliday et
al., 2003], but these are not directly applicable to the kind of PDMS
considered by Piazza and by the other references presented above,
since they rely on a different semantics of the mappings. Similarly,
the routing indexes technology presented
by [Crespo and
Garcia-Molina, 2002] can not be applied to the kind of
PDMS with cyclic mappings and dynamic networks, since the presence of
cycles in the network together with the a-priori ignorance of the
(dynamic) network topology invalidates any careful selection of
neighbours providing answers. On the positive side, we are
considering to integrate into our PDMS framework the idea of
data mappings as described
in [Kementsietsidis et
al.2003b,Kementsietsidis et
al.2003a], which introduces
a table that maintains mappings to the neighbour's data, to mimic a
sort of extensional constraint.
Another line of research that is necessary to compare with the PDMS
framework proposed here, is the standard classical logic-based data
integration technology, which has been summarised in a very clear way
by [Lenzerini, 2002]; successful examples of classical
logic-based data integration technology are the Information
Manifold [Kirk et
al.1995] and
Tsimmis [Garcia-Molina et
al.1997]. The main difference is in
the role of the schema mapping rules between nodes: in a PDMS a schema
mapping rule is intended for data migration and transformation between
neighbours, as opposed to the role of global logical constraints in
classical data integration systems. It can be proved (see,
e.g., [Franconi et
al., 2003]) that by adopting a PDMS
semantics the complexity of query answering is reduced from
exponential (or undecidable for GLAV) down to polynomial.
-
- Bernstein et
al., 2002
P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini,
and I. Zaihrayeu.
Data management for peer-to-peer computing: A vision.
In Workshop on the Web and Databases, WebDB 2002, 2002.
- Calvanese et
al., 2004
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati.
Logical foundations of peer-to-peer data integration.
In Proc. of the 23rd ACM SIGACT SIGMOD SIGART Sym. on Principles
of Database Systems (PODS-2004), 2004.
- Crespo and
Garcia-Molina, 2002
Arturo Crespo and Hector Garcia-Molina.
Routing indices for peer-to-peer systems.
In Proceedings of the 22 nd International Conference on
Distributed Computing Systems (ICDCS'02), page 23. IEEE Computer Society,
2002.
- Daswani et
al., 2003
Neil Daswani, Hector Garcia-Molina, and Beverly Yang.
Open problems in data-sharing peer-to-peer systems.
In ICDT 2003, 2003.
- Fagin et
al., 2003
Ronald Fagin, Phokion G. Kolaitis, R. J. Miller, and Lucian Popa.
Data exchange: Semantics and query answering.
In Proceedings of the 9th International Conference on Database
Theory, pages 207-224. Springer-Verlag, 2003.
- Franconi et
al., 2003 (download)
Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Luciano Serafini.
A robust logical and computational characterisation of peer-to-peer
database systems.
In Proceedings of the VLDB International Workshop on Databases,
Information Systems and Peer-to-Peer Computing (DBISP2P'03), 2003.
- Franconi et
al., 2004a (download)
Enrico Franconi, Gabriel Kuper, and Andrei Lopatenko.
Efficient query processing in p2p dynamic networks of autonomous and
heterogeneous databases.
Technical report, Free University of Bozen-Bolzano, Italy, 2004.
- Franconi et
al., 2004b (download)
Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
The coDB robust peer-to-peer database system.
In Proc. of the 2nd Workshop on Semantics in Peer-to-Peer and
Grid Computing (SemPGrid'04), 2004.
- Franconi et
al., 2004c
Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
The coDB robust peer-to-peer database system.
In Proc. of the 11th Italian Database Conference (SEBD'2004),
2004.
- Franconi et
al., 2004d (download)
Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
A distributed algorithm for robust data sharing and updates in p2p
database networks.
In Proceedings of the EDBT International Workshop on
Peer-to-peer Computing and Databases (P2P&DB'04), March 2004.
- Franconi et
al., 2004e (download)
Enrico Franconi, Gabriel Kuper, Andrei Lopatenko, and Ilya Zaihrayeu.
Queries and updates in the coDB peer to peer database system.
In Proc. of the 30th International Conference on Very Large Data
Bases (VLDB'04), 2004.
- Garcia-Molina et
al.1997
Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman,
Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vassalos, and Jennifer Widom.
The TSIMMIS approach to mediation: Data models and languages.
Journal of Intelligent Information Systems, 8(2):117-132,
1997.
- Halevy et
al., 2003
Alon Halevy, Zachary Ives, Dan Suciu, and Igor Tatarinov.
Schema mediation in peer data management systems.
In Proceedings of the 19th International Conference on Data
Engineering (ICDE'03), 2003.
- Holliday et
al., 2003
JoAnne Holliday, Robert C. Steinke, Divyakant Agrawal, and Amr El Abbadi.
Epidemic algorithms for replicated databases.
IEEE Trans. Knowl. Data Eng., 15(5):1218-1238, 2003.
- Kementsietsidis et
al.2003a
-
Anastasios Kementsietsidis, Marcelo Arenas, and Renée Miller.
Managing data mappings in the hyperion project.
In Proceedings of the 19th International Conference on Data
Engineering (ICDE'03), 2003.
- Kementsietsidis et
al.2003b
-
Anastasios Kementsietsidis, Marcelo Arenas, and Renee Miller.
Mapping data in peer-to-peer systems: Semantics and algorithmic
issues.
In Proceedings of the SIGMOD International Conference on
Management of Data (SIGMOD'03), 2003.
- Kirk et
al.1995
T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava.
The Information Manifold.
In C. Knoblock and A. Levy, editors, Proceedings of the AAAI
1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed
Environments, Stanford University, Stanford, California, 1995.
- Lenzerini, 2002
Maurizio Lenzerini.
Data integration: a theoretical perspective.
In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages 233-246. ACM Press,
2002.
- Ratnasamy et
al., 2001
S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.
A scalable content addressable network.
In Proceedings of the ACM SIGCOMM '01 Conference, San
Diego, California, August 2001.
- Serafini and
Ghidini, 2000
Luciano Serafini and Chiara Ghidini.
Using wrapper agents to answer queries in distributed information
systems.
In Proceedings of the First Biennial Int. Conf. on Advances in
Information Systems (ADVIS-2000), 2000.
- Serafini et
al., 2003
Luciano Serafini, Fausto Giunchiglia, John Mylopoulos, and Philip A. Bernstein.
Local relational model: A logical formalization of database
coordination.
In CONTEXT 2003, pages 286-299, 2003.
- Stoica et
al., 2001
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari
Balakrishnan.
Chord: A scalable peer-to-peer lookup service for internet
applications.
In Proceedings of the ACM SIGCOMM '01 Conference, San
Diego, California, August 2001.
- Tatarinov and
Halevy, 2004
Igor Tatarinov and Alon Halevy.
Efficient query reformulation in peer data management systems.
In Proceedings of the SIGMOD International Conference on
Management of Data (SIGMOD'04), 2004.
|