The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.
A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes NOM, PRENOM, whereas the Erasmus database about European students may use a European student number and the attributes fiRSTNAME, LASTNAME and HOME UNIVERSITY.
In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried. One sometimes use the term virtual data integration. This is in contrast to a warehousing approach where the data is extracted from the data sources ahead of query time, transformed, and loaded in the warehouse. At query time, the warehouse is accessed but not the data sources. Warehouses approaches are typically preferred for very complex queries, e.g., for data mining. On the other hand, to have access to “fresh” information, a mediator approach is preferred since it avoids having to propagate in real time, data source updates to the warehouse. Figure 9.1 illustrates these two approaches of data integration.
In the mediator approach, one starts by designing a global schema (also called mediated schema) that serves as a unique entry point on which global queries are posed by users. A main issue is then to specify the relationships, namely semantic mappings, between the schemas of the data sources and the global schema. Based on these mappings, one can answer queries over the global schema using queries over the data sources. Typically, query answering in the mediator approach is performed as follows. First, independently of the data in the sources, the user’s query posed over the global schema is transformed into local queries that refer to the schemas of the data sources. A global query combines the data provided by sources. Queries are optimized and transformed into query plans. The local query plans are executed and their results combined by the global query plan.
In the following, for presentation purposes, we consider that the global schema and the schemas of the data sources to integrate are all relational. In practice, each non-relational data source (e.g., XML or HTML) is abstracted as a relational database with the help of a wrapper. Wrappers are small programs that translate local relational queries into appropriate requests understood by specific data sources, and transform their results into relations. The role of wrappers is to allow the mediator to see each data source as relational, no matter which actual format it uses.
Let us consider in more detail the specification of semantic mappings between the data sources and the global schema. Let S1, ..., Sn be the local schemas of n pre-existing data sources. To simplify the presentation, let us assume that each local schema Si is made of a single relation that we denote also Si. The relations S1, ..., Sn are called the local relations. Suppose the global schema G consists of the global relations G1, ..., Gm. The goal is to specify semantic relations between the local relations Si and the global relations Gj. The Gj are logically (intentionally) defined by the Si.
An example of simple relationship (not very interesting) is based on equality, e.g., G1 = S1. One can find more complicated relationships, e.g., G2 = S1 ∪S2 or G3 = S1⋈S3. In these last two examples, a global relation is defined as a query over the local relations. In other words, the global relation is a view of the local relations. Indeed, one typically prefers more flexible constraints such as G3 ⊇S1⋈S3. Using containment instead of equality leaves open the possibility for other sources of providing data about G3, e.g., G3 ⊇σA=”yes”(S4). Because global relations are constrained by views of the local relations, one uses the term global-as-view for such specifications.
In a somewhat dual manner, one can use local-as-view constraints such as: S4 ⊆G1⋈G3. This leaves even more flexibility since the contribution of each data source can be specified (e.g., by its owner) independently of the other sources of the system. This kind of autonomy is typically well-adapted to a Web setting.
More generally, to express semantic mappings between {S1,...,Sn} and {G1,...,Gm}, one can use inclusion statements, i.e., logical constraints, of the form v(S1,...,Sn) ⊆v′(G1,...,Gm), where v and v′ are query expressions called views. All the constraints we consider in this chapter will be of this general form. Now, given an instance I of {S1,...,Sn} (i.e., an instance of the data sources), we don’t know the instance J of the global schema. But we know that:
also equivalently denoted
where each Vi is a view over the local schemas, i.e., a query built on local relations.
In our development, we will consider conjunctive queries. Using negation in queries greatly complicates the issues. In the next section, we recall some standard material on containment of conjunctive queries, i.e., of the queries at the heart of our formal development. In Sections 9.3 and 9.4, we study GAV and LAV mediators, respectively. For each of these languages, we describe appropriate query rewriting algorithms. In Section 9.5, we show the impact on query rewriting of adding DL-LITE constraints in the global schema. Finally, in Section 9.6, we lay the basis of a peer-to-peer approach for data integration. In contrast with the mediator approach which offers a unique entry point to data, peer-to-peer data management systems (PDMS for short) are decentralized data integration systems.
In this section, we recall some basic notions on comparing conjunctive queries that we will use in the following.
We recall that a conjunctive query is an expression of the form:
Given an instance I of the relations appearing in the body of the query, an answer is a tuple ⟨ν(x1),...,ν(xn)⟩ for some valuation ν of the variables in the query, such that for each i, Ai(ν(ui)) holds in I. We denote q(I) the set of answers.
We sometimes denote this query q(x1,...xn) when its body is understood. Observe that the interpretation of such a conjunctive query in logical terms is:
The data integration techniques rely on conjunctive query containment. This problem has been extensively studied because it is at the core of query optimization. We use known techniques that we recall next.
A query q1 is contained in q2, denoted q1 ⊆q2, if for each I, q1(I) ⊆q2(I). It is known that the containment between a conjunctive query q1 and a conjunctive query q2 can be tested by finding a “homomorphism” from q2 to q1.
Definition 9.2.1 Let q1(x1,...,xn) and q2(y1,...,yn) be two conjunctive queries. A (conjunctive query) homomorphism from q2 to q1 is a mapping ψ from the variables of q2 to the variables of q1 such that:
Example 9.2.2 Consider the following queries:
Consider a mapping ψ such that ψ(yi) = xi for each i, ψ(y′1) = x′1 and ψ(y′3) = x3. Then the required conditions hold, and it follows that q1 ⊆q2. Intuitively, q2 joins A1 and A2 on the second attribute, whereas q1 also joins on the third one. The additional condition induces the containment.
The following proposition states that the existence of a homomorphism is a necessary and sufficient condition for query containment.
Proposition 9.2.3 (Homomorphism theorem) Let q1 and q2 be two conjunctive queries. Then q1 is contained in q2 if and only if there exists a homomorphism from q2 to q1.
This provides a simple algorithm for testing conjunctive query containment. In the general case, deciding whether a conjunctive query is contained in another one is NP-complete in the size of the two queries. In fact, in many practical cases, there are polynomial-time algorithms for query containment.
Algorithm 4 checks whether a query q1 is contained in a query q2.
Input: Two conjunctive queries:
q1(x) :- g1(x1),…,gn(xn)
q2(y) :- h1(y1),…,hm(ym)
Output: Yes if q1 ⊆q2; no otherwise
(1)freeze q1: construct a canonical instance Dcan = {gi(ν(xi))∣1 ≤i≤n}
(2) for some valuation ν mapping each variable in q1
(4)if ν(x) ∈q2(Dcan) return yes
(5)else return no.
Example 9.2.4 Consider the queries of Example 9.2.2. The canonical instance Dcan is A1(a,b,c),A2(a′,b,c). It is easily verified that q2(Dcan) = (a,a′), which is ν(x,x′).
The main advantage of GAV is its conceptual and algorithmic simplicity. The global schema is simply defined using views over the data sources and specifies how to obtain tuples of the global relation Gi from tuples in the sources.
Definition 9.3.1 (GAV mapping) A GAV mapping is an expression of the form: R(x1,...,xn) ⊇ q(x1,...,xn), where q(x1,...,xn) :- A1(u1),...,Ak(uk) is a conjunctive query of the same arity as R. The semantics of this mapping is:
We write alternatively this GAV mapping as:
Example 9.3.2 Consider the following four data sources:
S1.Catalogue(nomUniv, programme).
S2.Erasmus(student, course, univ).
S3.CampusFr(student, program, university).
S4.Mundus(program,course)
Now, suppose we define a global schema with the following unary and binary relations:
MasterStudent(studentName), | University(uniName), |
MasterProgram(title), | MasterCourse(code), |
EnrolledIn(studentName,title), | RegisteredTo(studentName, uniName). |
These relations are defined in terms of the local relations by the following GAV mappings:
MasterStudent(N)⊇S2.Erasmus(N,C,U), S4.Mundus(P,C) |
MasterStudent(N)⊇S3.CampusFr(N,P,U), S4.Mundus(P,C) |
University(U)⊇S1.Catalogue(U,P) |
University(U)⊇S2.Erasmus(N,C,U) |
University(U)⊇S3.CampusFr(N,P,U) |
MasterProgram(T)⊇S4.Mundus(T,C) |
MasterCourse(C)⊇S4.Mundus(T,C) |
EnrolledIn(N,T)⊇S2.Erasmus(N,C,U), S4.Mundus(T,C) |
EnrolledIn(N,T)⊇S3.CampusFr(N,T,U), S4.Mundus(T,C) |
RegisteredTo(N,U)⊇S3.CampusFr(N,T,U) |
Note that in a warehousing approach, one would simply evaluate all the queries that define the global view, and populate the warehouse using standard relational query evaluation. In a mediator approach, we try to only derive data that is relevant to a specific query posed on the global view by a user. We show how to rewrite a global query into queries over the local relations and combine their results. This is achieved by a technical trick that consists in unfolding the atoms of the global query.
Observe the first two mappings. They specify (using joins) how to obtain tuples of the unary relation MasterStudent. Now consider the following global query asking for universities with registered master students:
The rewriting of this query into source queries is obtained by unfolding, i.e., by replacing each atom which can be matched with the head of some view, by the body of the corresponding view. (For readers familiar with logic programming, this is some very simple form of resolution.)
In the example, there is a single mapping whose head can be matched with RegisteredTo(s,x), and two mappings that match MasterStudent(s). Thus, we obtain the following two unfoldings:
Observe that q2 can be simplified. Replacing the conjunction of its first two atoms by the single atom S3.CampusFr(s,v6,x) leads to an equivalent query. We thus obtain the following two GAV rewritings of the initial query:
The result is obtained by computing r1∪r2. Now, observe that each rℓ is a conjunctive query. It can be optimized using standard query optimization to obtain an optimized physical query plan. Of course, the choice of the particular physical query plan that is selected depends on the statistics that are available and the capabilities of the sources. For instance, a plan may consist in querying S3 and then for each value a of v6 (i.e., a particular university program), asking the query q(X) :- S4.Mundus(a,X) to S4.
We now formalize the simple and intuitive notion of query unfolding.
Definition 9.3.3 (Query unfolding) Let q(x) :- G1(z1),…,Gn(zn) be a query and for each i, Gi(xi) ⊇qi(xi,yi) be a GAV mapping. An unfolding of qis the query uobtained from qby replacing, for each i, each conjunct Gi(zi) by qi(ψi(xi,yi)) where ψi is a function that maps xi to zi, and the existential variables yi to new fresh variables.
The renaming of the existential variables into fresh ones is necessary to avoid the introduction of unnecessary constraints in the unfolding. Indeed, consider an existential variable y occurring in two distinct atoms, say Gi and Gj. Then, the two atoms should be understood as ∃...y...Gi(zi) and ∃...y...Gj(zj). The scopes of y in both are disjoint and nothing requires that the two occurrences of y take the same value. Hence the renaming using fresh variables.
Example 9.3.4 Suppose we have the two mappings:
We next establish that each unfolding of a query computes a part of the desired results, and that their union computes the whole set of answers. To do so, we use two propositions. The first one ignores unfolding and focuses on the “materialization” of the global relations.
Proposition 9.3.5 Let S1,...,Sn be a set of source relations; G1,...,Gm a global schema defined by a set of GAV mappings over S1,...,Sn; and I be an instance over S1,...,Sn. Let J be the instance over G1,...,Gm defined by, for each j,
Proof (sketch). Let u be an answer. Then, by definition, q(u) is true in each instance J′ over G1,...,Gm such that I and J′ together satisfy the mappings. In particular, u belongs to q(J). Conversely, let u be in q(J). Let J′ be an instance such that I and J′ together satisfy the mappings. Since J′ satisfies the mappings, J ⊆J′. Since conjunctive queries are monotone, q(J) ⊆q(J′). Thus u∈J′. Since u belongs to all such J′, u is an answer. □
The second proposition deals with unfoldings.
Proposition 9.3.6 Let S be a set of source relations and G a set of global relations defined by a set of GAV mappings over S. Consider the query q(z) :- Gi1(zi1),…,Gin(zin) over G and the set {rℓ} of unfoldings of qgiven . Then for each I over S1,...,Sn, the answer of qis given by ∪rℓ(I).
Proof (sketch). Let J be as in Proposition 9.3.5. By the same proposition, it suffices to show that q(J) = ∪rℓ(I).
We can compute the answer using the unfoldings (also called the GAV rewritings). These unfoldings can be simplified by removing redundant conjuncts that may have been introduced by the technique. This simplification relies on checking conjunctive query containment. Given a conjunctive query with body A1(u1),...,Am(um), we verify whether each query obtained by removing some Ai(ui) is equivalent to the initial one. If yes, the atom is redundant and can be removed. We keep doing this until the query is “minimal”. This simplification test is costly but the resulting query may be much less expensive to evaluate that the initial one.
We must evaluate all the unfoldings to obtain the complete answer. If we are aware of some constraints on the local schemas or on the global one, this can be further simplified. For instance, the constraints may imply that the result of a particular unfolding is empty, in which case this particular unfolding needs not be evaluated. Also, the constraints may imply that the result of some unfolding, say rℓ, is always included in another one. Then rℓ needs not be evaluated. For instance, in the previous example, if it is known that students obtained from the source S2 are European students, while those obtained from the source S3 are non European students, we can be sure that the GAV rewriting rℓ obtained by unfolding will not provide any answer. This requires expressing and exploiting disjointness constraints over the local relations. Inclusion constraints on local relations would, on the other hand, permit to detect in advance that a given query plan provides answers that are redundant with those obtained by another query plan.
A main limitation of GAV is that adding or removing data sources to the integration system may require deeply revising all the views defining the global schema. In a Web context where sources may come and go, e.g., because of (non) availability of servers, this is really too constraining. The LAV approach does not suffer from this disadvantage.
The LAV approach takes a dual approach. The local relations are defined as views over global relations. The goal is to define the global schema in such a way that individual definitions do not change when data sources join or leave the integration system except for the definitions of the sources that are involved in the change.
Definition 9.4.1 (LAV mapping) A LAV mapping is a mapping of the form: S ⊆ q, for some conjunctive query q(x1,...,xn) :- A1(u1),...,Ak(uk) over the global relations. Its semantics is:
Again, S(x1,...,xn) is called the head of the view, whereas A1(u1),...,Ak(uk) is called the body of the view.
Example 9.4.2 We define the global schema as consisting of the following relations:
Student(studentName), | EuropeanStudent(studentName), |
University(uniName), | NonEuropeanStudent(studentName), |
FrenchUniversity(uniName), | EuropeanUniversity(uniName), |
NonEuropeanUniversity(uniName), | Program(title), |
MasterProgram(title), | EnrolledInProgram(studentName,title), |
Course(code), | EnrolledInCourse(studentName,code), |
PartOf(code, title), | RegisteredTo(studentName, uniName), |
OfferedBy(title, uniName). |
The four data sources considered in the previous example can be described by the following LAV mappings:
m1: S1.Catalogue(U,P)⊆ | FrenchUniversity(U), Program(P), OfferedBy(P,U), |
OfferedBy(P’,U), MasterProgram(P’) | |
m2: S2.Erasmus(S,C,U)⊆ | Student(S), EnrolledInCourse(S,C), PartOf(C,P), |
OfferedBy(P,U), EuropeanUniversity(U), | |
EuropeanUniversity(U’)RegisteredTo(S,U’), | |
U≠ U’ | |
m3: S3.CampusFr(S,P,U)⊆ | NonEuropeanStudent(S), Program(P), |
EnrolledInProgram(S,P), OfferedBy(P,U), | |
FrenchUniversity(U), RegisteredTo(S,U) | |
m4: S4.Mundus(P,C)⊆ | MasterProgram(P), OfferedBy(P,U), |
OfferedBy(P,U’), EuropeanUniversity(U), | |
NonEuropeanUniversity(U’), PartOf(C,P) |
LAV mappings enable quite fine-grained descriptions of the contents of data sources. For example, we are able to specify precisely the students that can be found in the Erasmus source: they are European students enrolled in courses of a given (European) university that is different from their home (European) University in which they remain registered.
LAV mappings express loose coupling between local and global relations, which is important for flexibility and robustness when the participating data sources change frequently. If we are interested in Master students, we do not need to know in advance (unlike the GAV approach) how to join two sources. We just define them as a global query:
MasterStudent(E) :- | Student(E), EnrolledInProgram(E,M), |
MasterProgram(M). |
The local sources that must be queried and combined to get the Master students will be discovered by the rewriting process. Recall that, in the GAV approach, they were predefined by the two mappings given in Example 9.3.2.
The price to pay for the flexibility of LAV compared to GAV is that the rewritings are more complicated to find. We describe three algorithms that achieve this rewriting. The Bucket algorithm and the Minicon algorithm follow the same approach. They first determine the local relations that are relevant to the query, then consider their combinations as candidate rewritings and verify whether they are indeed correct. Minicon is actually an optimization of Bucket that avoids the last verification step by a trickier first step. The third algorithm, namely the Inverse-rules algorithm, follows a completely different approach: it consists in transforming the logical rules supporting the LAV mappings (which are unsafe rules) into a set of safe rules with a single global relation. The global query is unfolded using these rules.
The principle of the Bucket algorithm is quite simple. It proceeds in three steps:
Let g be a query atom. The atoms in bucket(g) are the heads of mappings having in their body an atom from which g can be inferred. Intuitively, data comes from source relations, and a (global) query atom is satisfied by (local) data only if it can be matched to a (global) atom in the body of a mapping whose head can be matched to source facts. A match between g and some atom in the body of a mapping is thus an indication that the corresponding data source provides a relevant information for this particular query.
There is an extra constraint that has to be considered to guarantee that g can indeed be logically inferred, as illustrated next. In fact, the bucket of a query atom g includes a view atom v only if an atom in the body of v can be matched with g by a variable mapping such that the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.
Let us illustrate this on an example. Consider the LAV mappings of Example 9.4.2, and the global query:
Let us consider the query atom g= RegisteredTo(s,x), in which the variable x is distinguished.
We can find two mappings (m2 and m3) in which a body atom can be matched to
RegisteredTo(s,x).
First, consider the mapping m3:
m3: S3.CampusFr(S,P,U)⊆ | NonEuropeanStudent(S), Program(P), |
EnrolledInProgram(S,P), OfferedBy(P,U), | |
FrenchUniversity(U), RegisteredTo(S,U) | |
The atom RegisteredTo(s,x) matches the atom RegisteredTo(S,U) with the variable mapping {S/s,U/x}, where U is distinguished in the view defining the mapping (it occurs in the head of this LAV mapping).
Therefore, applying the variable mapping {S/s,U/x} to the head S3.CampusFr(S,P,U) of the mapping m3 enforces the matching of RegisteredTo(S,U) with the query atom RegisteredTo(s,x), and then:
S3.CampusFr(s,v1,x)∧FOL(m3)⊧ ∃sRegisteredTo(s,x)
Thus S3.CampusFr(s,v1,x) is added in Bucket(g). Note that v1 is simply a fresh variable mapped to the variable P appearing in S3.CampusFr(S,P,U) but not in the variable mapping {S/s,U/x}.
On the other hand, consider the mapping m2:
m2: S2.Erasmus(S,C,U)⊆ | Student(S), EnrolledInCourse(S,C), PartOf(C,P), |
OfferedBy(P,U), EuropeanUniversity(U), | |
EuropeanUniversity(U’)RegisteredTo(S,U’), U≠ U’ |
The match this time is between g = RegisteredTo(s,x) and RegisteredTo(S,U’) by the variable mapping {S/s,U′/x}. The difference with the previous situation is that the variable U’ is existentially quantified in the view defining the mapping. Applying the variable mapping {S/s,U′/x} to the head S2.Erasmus(S,C,U) of the mapping m2do not enforce the matching of RegisteredTo(S,U’) in its body with the query atom RegisteredTo(s,x).
More formally:
S2.Erasmus(s,v2,v3)∧FOL(m2)⊭ ∃sRegisteredTo(s,x).
To see why, consider the LAV mapping m2 and its logical meaning FOL(m2):
FOL(m2): | ∀S∀C∀U [ S2.Erasmus(S,C,U)⇒∃P∃U’ ( |
EuropeanStudent(S)∧EnrolledInCourse(S,C)∧ | |
PartOf(C,P)∧OfferedBy(P,U) | |
∧EuropeanUniversity(U)∧RegisteredTo(S,U’)∧U≠ U’) ] |
From the fact that S2.Erasmus(s,v2,v3), it follows that:
∃s∃U’RegisteredTo(s,U’).
However, this is a strictly weaker statement than ∃sRegisteredTo(s,x) where x is fixed. We prove this next. Consider an instance I over the domain Δ = {s,v2,v3,v4,v5,x} defined by:
As a consequence, S2.Erasmus(s,v2,v3) does not belong to the bucket and:
Input: An atom g = G(u1,...,um) of the query q and a set of LAV mappings
Output: The set of view atoms from which g can be inferred
(2)for each LAV mapping S(x) ⊆q(x,y)
(3) if there exists in q(x,y) an atom G(z1,...,zm) such that
(4) zi is distinguished for each i such that ui is distinguished in q;
(5) let ψ the variable mapping {z1/u1,....,zm/um}
(6) extended by mapping the head variables in x not
(7) appearing in {z1,...,zm} to new fresh variables;
(9)return Bucket(g);
Algorithm 5 constructs the buckets. Proposition 9.4.3 is a logical characterization of the view atoms put in the buckets of the atoms of the global query.
Proposition 9.4.3 Let G(u1,...,um) be an atom of the global query. Let u be the (possibly empty) subset of existential variables in {u1,...,um}. Let m: S(x) ⊆q(x,y) be a LAV mapping. Then
S(v), FOL(m)⊧∃uG(u1,...,um)
iff there exists a view atom in Bucket(g) that is equal to S(v) (up to a renaming of the fresh variables).
The proof is tedious and left as exercise.
In the worst-case, the Bucket algorithm applied to each atom of a query has a time complexity in O(N ×M×V) and produces N buckets containing each at most M×V view atoms, where N is the size of the query, M is the maximal size of the LAV mappings and V is the number of LAV mappings.
Returning to the example, we obtain by the Bucket algorithm, the following buckets for the three atoms of the query q.
RegisteredTo(s,x) | EnrolledInProgram(s,p) | MasterProgram(p) |
S3.CampusFr(s,v1,x) | S3.CampusFr(s,p,v2) | S1.Catalogue(v3,v4) |
S4.Mundus(p,v5) | ||
The candidate rewritings of the initial global query are then obtained by combining the view atoms of each bucket. In the worst-case, the number of candidate rewritings is in O((M×V)N). For instance, in our example, we obtain two candidate rewritings for the query q:
A candidate rewriting may not be a valid rewriting of the query. By Proposition 9.4.3, we only know that each candidate rewriting entails each atom of the query in isolation, i.e., without taking into account the possible bindings of the existential variables within the query.
It turns out that, in our example, the first candidate rewriting r1 is not a valid rewriting of the query q: the body of q is not logically entailed by the conjunction of the view atoms in the body of r1.
To see why, we first apply to each view atom in the body of r1 the corresponding LAV mapping to obtain the logical global expression (i.e., built on global relations). This step is called expanding r1, and its result, the expansion of r1. In our case, the expansion of r1 is the following query expression:
Note that new existential variables may be introduced by the expansion of some view atoms. For instance, the LAV mapping defining S1.Catalogue(v3,v4) contains the existential variable denoted P’ in the LAV mapping definition. Such variables are renamed with new fresh variables to avoid unnecessary constraints between the variables. In our example, this corresponds to variable v5 in the body of Exp_r1(x).
To check whether a rewriting is correct, it suffices to check with the Conjunctive Query Containment algorithm whether the query Exp_r1(x) is contained in the query q(x). For each variable v, let the corresponding constant, i.e., ψ(v), be "v". The canonical database obtained from r1 is given in Figure 9.2.
NonEuropean- | Program | EnrolledIn- | OfferedBy | French- | RegisteredTo | Master- |
Student | Program | University | Program | |||
"s" | "v1" | ("s", "v1") | ("v1", "x") | "x" | ("s", "x") | "v5" |
"p" | ("s", "p") | ("p", "v2") | "v2" | ("s", "v2") | ||
"v4" | ("v4", "v3") | "v3" | ||||
("v5", "v3") | ||||||
The evaluation of q(x) over this canonical database yields an empty result because there is no way of assigning the existential variables s and p to constants of the canonical database which satisfies the binding of the existential variable p between the two last atoms of the body of the query.
Expanding the rewriting r2 and checking that it is contained into the query q is left in exercise. This shows that among the two candidate rewritings, only r2 is a valid rewriting:
Remark 9.4.4 In spite of the apparent redundancy of the two first atoms, this rewriting cannot be simplified to
It is true that r2.1(x) is contained into r2(x). However, the two queries are not equivalent. For some data sets, it may be the case that there is a student s and there is a university x such that (based on S3.CampusFr), s is registered in x and also enrolled in a Mundus master program offered by another university. The containment would hold under a constraint that would forbid a student to be registered in more than one universities.
One can prove that each rewriting finally obtained does indeed provide answers and that their union constitutes the complete answer.
The idea underlying Minicon is to avoid putting in a bucket an atom that will only generate invalid rewritings. As we saw in the discussion of Bucket, the reason for an atom to be useless is that its binding of a variable does not match with the binding of other occurrences of that variable. This explains why a candidate rewriting (like r1) is not valid.
We now illustrate the Minicon algorithm by example. Consider the query q:
and the two LAV mappings:
V1(u,v)⊆T(w,u), U(v,w), R(v,u) |
V2(u,v,v’)⊆T(w,u), U(v,w), R(v’,w) |
Minicon proceeds in two steps that correspond to the first two steps of Bucket.
Minicon scans each atom in the query, but instead of creating buckets for them, it builds MCDs (short name for Minicon Descriptions). The first iteration of Minicon determines the relevance of the different LAV mappings to rewrite the first query atom U(y,z):
Minicon does not consider the query atom U(y,z) in isolation. Instead, since the variable w is existential in the view defining the mapping, and mapped to the variable z that has several occurrences in the query, it checks whether the variable mapping {v/y,w/z} also covers all the query atoms involving variable z, i.e., can be extended to also match R(x,z) and T(z,y). Because variable w is existential in the expansion of V1(u,v) (i.e., w does not appear in the head of the mapping), it is the only way to enforce the several occurrences of z in the query to be mapped to by the same variable w. Here, matching the query atom R(x,z) with an atom of the form R(_,w) in the expansion of V1(v1,y) is not possible: there does not exist such an atom in the expansion of V1(v1,y). Therefore, no MCD is created from V1 for covering the query atoms including an occurrence of the variable z.
MCD1 = (V2(y,y,x)) , {1,2,3} )
The last iteration of building MCDs corresponds to the last query atom: R(y’,x). The LAV mapping V1 has in its expansion the atom R(v,u) that can be matched to it by the variable mapping {v/y′,u/x)}. Since the distinguished variable x in the query is assigned to the distinguished variable (same condition as for adding to a bucket), and since the existential variable y’ of the query atom has a single occurrence in the query, the following MCD is created:
MCD2 = ((V1(x,y’), {4})
In contrast, there is no MCD created for R(y’,x) with the second LAV mapping: in the variable mapping {v′/y′,w/x)} that allows to match the query atom R(y’,x) with the atom R(v’,w) in the expansion of V2, the distinguished variable x in the query is assigned to the variable w which is not distinguished in the expansion of of V2. As for adding a view atom in a bucket, a MCD is created for a query atom g only if the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.
.
The second step of Minicon replaces the combination of the buckets by the combination of the MCDs. More precisely, the rewritings are obtained by combining MCDs that cover mutually disjoint subsets of query atoms, while together covering all the atoms of the query.
Because of the way in which the MCDs were constructed, the rewritings obtained that way are guaranteed to be valid. No containment checking is needed, unlike in the Bucket Algorithm. In our example, we would therefore obtain as single rewriting of q(x):
This algorithm takes a radically different approach. It transforms the LAV mappings into GAV mappings (called inverse rules) so that the complex operation of query rewriting using LAV mappings can then be replaced by the much simpler operation of query unfolding. A LAV mapping is replaced by several GAV mappings, one for each atom in the body of the rule. The subtlety is to keep bindings between the different occurrences of the same existential variable in the body. This is realized using a simple trick from first-order logic, namely by introducing Skolem functions.
Let us explain the Inverse-rules algorithm on the example we used for Minicon. A first important point that distinguishes it from the Bucket and Minicon algorithms is that the Inverse-rules algorithm is independent of the query. It only considers as input the set of LAV mappings:
V1(u,v)⊆T(w,u), U(v,w), R(v,u) |
V2(u,v,v’)⊆T(w,u), U(v,w), R(v’,w). |
Consider the first LAV mapping and recall that its logical meaning mapping is the formula:
But we can infer much more. We can also infer that there exists some constant d1 such that T(d1,a) and U(b,d1) are both true. We do not know the exact value of that constant d1, but we know it exists and that, in some way, it depends on the constants a,b. Since this dependency comes from the first rule, we denote this unknown d1 value: f1(a,b).
Creating the inverse rules This motivates the construction of three following GAV mappings for which we give also the FOL translation.
In the previous rules, the symbol f1 is a Skolem function of arity 2, and f1(u,v) is a Skolem term denoting some constant that depends on the values instantiating the variables u,v. Given two distinct Skolem terms, e.g. f1(1,2) and f1(2,v3), we cannot tell whether they refer to the same constant or not.
The Inverse-rules algorithm just scans the LAV mappings and creates n GAV mappings for each LAV mapping having n atoms. The result of this algorithm applied to the second LAV mappings in the example is:
Obtaining the rewritings by unfolding: The rewritings of any global query is now obtained by unfolding the query atoms using the (Inverse-rules) GAV mappings corresponding to the initial set of LAV mappings. The unfolding operation here is a bit trickier than the unfolding defined in Definition 9.3.3, because of the Skolem terms. In Definition 9.3.3, the unfolding was based on matching each query atom G(x1, .., xm) with an atom (in the right-hand side of a GAV mapping) of the form G(z1, .., zm) by equating each pair (zi, xi) of variables. Proposition 9.3.6 showed that unfolding each atom of the query in isolation builds valid rewritings of the query, i.e., conjunctions of view atoms which logically implies the conjunction of the query atoms. It is not the case anymore when atoms in the right-hand side of GAV mappings contain Skolem terms.
The unification of two atoms with functions is more complex than just equating variables, and it may fail. It may require the substitution of some variables with functional terms (in our case, Skolem terms). This may make impossible to unify the other atoms of the query with atoms in the right-hand side of GAV mappings.
Let us illustrate on our example the subtleties of unfolding queries in presence of functional terms. Consider again the same query q:
The query atom U(y,z) can be unified with the atom U(v,f1(u,v)) in the right-hand side of the GAV mappings IN12 using a so-called most general unifier (mgu). In this case, the mgu is the substitution:
This substitution that makes the unfolding of the first query atom possible, now constrains the other occurrences in the query of the variables y and z for the unfolding of the other query atoms. After the application of σ to the whole body of the query and the unfolding of the first query atom made possible by σ, we obtain the following (partial) query rewriting:
The unfolding of the second atom R(x,f1(v2,v1)) yields V1(f1(v2,v1),x), and we obtain the (partial) rewriting:
It is useless to continue unfolding the remaining query atoms of pr2(x). As soon as a given unfolding has produced a view atom with Skolem terms, we can be sure that the evaluation of the query plan under construction will not produce any answer: there is no way to match V1(f1(v2,v1),x) with any fact in the data source which are of the form V1(a,b) where a,b are constants. Since we don’t know f1(v2,v1), there is absolutely no reason to believe that it is equal to a.
Using the inverse rule IN23 to unfold R(x,f1(v2,v1)) does not help because unifying R(x,f1(v2,v1)) and R(v’,f2(u,v,v’)) fails because of the two different Skolem functions. Thus, the (partial) rewriting issued from unfolding U(y,z) using the inverse rule IN12 is abandoned.
Let us try now to unfold U(y,z) using IN22 made possible by the substitution
Now, unfolding R(x,f2(v2,v1,v3)) using the inverse rule IN23 is possible thanks to the substitution
in which one of the first two atoms can be dropped.
Now, we examine the unfolding of the query atom T(f2(v2,v1,x),v1), which requires checking whether T(f2(v2,v1,x),v1) and T(f2(u,v,v’),u) are unifiable. This is the case thanks to the substitution {v2/v3,u/v3,v1/v3,v/v3,v′/x}, which leads to the (partial) rewriting:
Again, we can remove the first atom that is redundant and obtain the equivalent (partial) rewriting:
Finally the unfolding of R(y’,x) using IN23 leads to the final rewriting:
The three algorithms have the same (worst-case) complexity and they guarantee to provide the correct answer. Some experiments have shown that in practice Minicon outperforms both Bucket and Inverse-rules. The main advantage of the Inverse-rules algorithm over the Bucket and Minicon algorithms is that the step producing the inverse rules is done independently of the queries. Furthermore, the unfolding step can also be applied to Datalog queries, i.e., to recursive queries.
The common limitation of the three algorithms is that they do not handle additional knowledge (ontology statements) that can be known about the domain of application. In the next section, we see how to extend both the Local-As-Views and Global-As-Views approaches with DL-LITE ontologies, i.e., we consider global schemas that include constraints expressed as DL-LITE axioms.
We first show a negative result: as soon as we add functionality constraints over the global schema, the number of conjunctive rewritings of a query to be considered, may become infinite. This is a severe limitation for extending the LAV or GAV approaches since such constraints are rather natural. So these approaches to data integration fail when we consider the DL-LITE dialect of previous chapters. On the positive side, we show how to extend the GAV and LAV approaches to constraints expressible in DL-LITE.
We illustrate on an example the problem raised by taking into account functionality constraints in the global schema. Let us consider a global schema with one unary relation C and two binary relations R and R’. In both R and R1, we impose that the first attribute is a key. Let us consider two LAV mappings:
To prove it, we expand r1(x) and show that the resulting expansion together with the logical axiom expressing the functionality of R1 logically implies the conjunction of atoms in the body of the query. The expansion of r1(x) is:
It is important to note that to properly check containment, the standard query containment algorithm seen in the previous section would have to be modified in a standard manner to take into account functional dependencies. Intuitively, one would have to proceed pretty much as we did in the example, equating variables as implied by the functional dependencies.
It turns out that the situation is even more subtle. Surprisingly, this rewriting r1(x) is not the only one. In fact there exists an infinite number of different rewritings for q(x). Let k ≥ 2. The following query is a valid rewriting of q(x):
The expansion of rk(x) is:
One can also show that for each k, each such rewriting may return answers that are not returned with k- 1. Thus, there exists an infinite number of non redundant conjunctive rewritings. The reader familiar with Datalog will observe that this infinite collection of rewritings can be captured in Datalog by the following recursive rewriting:
Querying data through DL-LITE ontologies has been detailed in the previous chapter. It has been shown how the positive and negative constraints expressed in the ontology are exploited both for data consistency checking and for query answering. In particular, the first step of query answering is the query reformulation step which is performed by the PerfectRef algorithm: using the positive constraints, called the PIs, it computes a set of reformulations, which are then evaluated over the data to produce the answer set of the original query. The negative constraints, called the NIs, are used to check data consistency, by translating each (declared or entailed) NI into a Boolean conjunctive query qunsat that must be evaluated over the data.
In this section, we show how to extend both the LAV and GAV approaches to rewrite queries in term of views when the global schema includes some DL-LITE Tbox.
Two observations explain how this can be realized:
These two observations follow from the completeness of the PerfectRef and Consistent algorithms for DL-LITE presented in the previous chapter, and that of the rewriting algorithms of Sections 9.3 and 9.4; namely Unfolding for GAV, Minicon, Bucket or Inverse-rules for LAV.
The argument may be somewhat too abstract for some readers. We next illustrate these two points with examples. We use the global schema considered in Example 9.4.2 page 439, enriched with the DL-LITE Tbox of Figure 9.3. Note in particular that we add the subclass College of the class University, the subproperty EnrolledInCollege of the property RegisteredTo, for which the domain is declared as being the class MasterStudent. In addition, we add the property EnrolledInMasterProgram that we declare as a subproperty of the property EnrolledInProgram. Finally, we declare a mandatory property for the class College: any college must have students enrolled in it.
DL notation | FOL notation |
PIs: | |
MasterStudent ⊑Student | MasterStudent(X) ⇒Student(X) |
EuropeanStudent ⊑Student | EuropeanStudent(X) ⇒Student(X) |
NonEuropeanStudent ⊑Student | NonEuropeanStudent(X) ⇒Student(X) |
College ⊑University | College(X) ⇒University(X) |
FrenchUniversity ⊑University | FrenchUniversity(X) ⇒University(X) |
EuropeanUniversity ⊑University | EuropeanUniversity(X) ⇒University(X) |
NonEuropeanUniversity ⊑University | NonEuropeanUniversity(X) ⇒University(X) |
∃EnrolledInCollege ⊑MasterStudent | EnrolledInCollege(X,Y) ⇒MasterStudent(X) |
College ⊑∃EnrolledInCollege- | College(X) ⇒∃YEnrolledInCollege(Y,X) |
EnrolledInCollege ⊑RegisteredTo | EnrolledInCollege(X,Y) ⇒RegisteredTo(X,Y) |
MasterStudent ⊑∃EnrolledInMasterProgram | MasterStudent(X) ⇒∃YEnrolledInMasterProgram(X,Y) |
∃EnrolledInMasterProgram-⊑MasterProgram | EnrolledInMasterProgram(X,Y) ⇒MasterProgram(Y) |
EnrolledInMasterProgram ⊑EnrolledInProgram | EnrolledInMasterProgram(X,Y) ⇒EnrolledInProgram(X,Y) |
NIs: | |
NonEuropeanStudent ⊑¬EuropeanStudent | NonEuropeanStudent(X) ⇒¬EuropeanStudent(X) |
NonEuropeanUniversity ⊑¬EuropeanUniversity | NonEuropeanUniversity(X) ⇒¬EuropeanUniversity(X) |
NonEuropeanUniversity ⊑¬FrenchUniversity | NonEuropeanUniversity(X) ⇒¬FrenchUniversity(X) |
We revisit Example 9.3.2 by adding the data source S5 giving the list of French so-called Grandes Ecoles. Its local schema is made of the local relation: S5.GrandeEcole(nomEcole). According to this new source and also to the enriched global schema of Figure 9.3, we add the following GAV mappings to the ones already considered in Example 9.3.2:
College(U)⊇S5.GrandeEcole(U) |
EuropeanStudent(N)⊇S2.Erasmus(N,C,U) |
NonEuropeanStudent(N)⊇S3.CampusFr(N,P,U) |
Consider again the global query looking for universities with registered master students:
It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the reformulation:
By unfolding q(x), we obtain the same two rewritings as in Example 9.3.2:
By unfolding the reformulation q1(x), we get the additional rewriting:
It is important to note that even if we had the GAV mapping
College(U)⊇S5.GrandeEcole(U),
the rewriting r3(x) would not have been obtained without reformulated first the initial query q(x) into q1(x).
Now, in contrast with the standard GAV approach, we have to check the consistency of each of these rewritings. To do so:
We illustrate the consistency check by checking the consistency of the rewriting r1(x). First, its expansion replaces each of its local atoms S(z) with the conjunction of global atoms of the form G(z) that can be produced by a GAV mapping G(x) ⊇S(x), if such GAV mappings exist. For expanding r1(x), we apply the following GAV mappings:
NonEuropeanStudent(N)⊇S3.CampusFr(N,P,U) |
University(U)⊇S3.CampusFr(N,P,U) |
RegisteredTo(N,U)⊇S3.CampusFr(N,P,U) |
EuropeanStudent(N)⊇S2.Erasmus(N,C,U) |
University(U)⊇S2.Erasmus(N,C,U) |
MasterProgram(T)⊇S4.Mundus(T,C) |
MasterCourse(C)⊇S4.Mundus(T,C) |
As a result, we obtain the following expansion for r1(x):
We then apply the Consistent algorithm. For this, we evaluate qunsat1, qunsat2 and qunsat3 over the body of Exp_r1(x) seen as a relational database; i.e., we freeze its atoms to obtain a canonical instance. Query qunsat1 returns true, so an inconsistency has been detected and the rewriting r1(x) is rejected.
We revisit Example 9.4.2 by adding the same data source S5 as in Section 9.5.2. The GAV mapping is also a LAV mapping: S5.GrandeEcole(U)⊆College(U)
Consider again the global query considered in Section 9.4.1:
It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the following reformulation:
By applying the Minicon algorithm1 to the initial query q(x), we obtain the following rewriting (as shown in Section 9.4.1):
By applying the Minicon algorithm to the reformulation q1(x) of the initial query, we obtain the additional rewriting:
As for the extended GAV approach, the consistency of LAV rewritings is not guaranteed because of the NIs in the Tbox. We follow the same approach: at compile time, the closure of the NIs is computed and each (declared or inferred ) NI is compiled into a Boolean query qunsat. At query time, each of these qunsat queries is evaluated over the canonical instance corresponding to each rewriting.
In contrast with the centralized mediator model, a Peer-to-Peer data management system (PDMS for short) implements a decentralized view of data integration, in which data sources collaborate without any central authority. In a PDMS, each collaborating data source can also play the role of a mediator, so is at the same time a data server and a client for other data sources. Thus each participant to the system is a peer and there are mappings relating data from the different peers. A PDMS architecture is therefore very flexible in the sense that there is no need for a global schema defining in advance a set of terms to which each data source needs to adhere. Over time, data sources can join or leave the PDMS just by adding or removing mappings between them. PDMS are inspired by P2P file sharing systems but they enable answering fine-grained queries. Like in the mediator model, answering queries is performed by reformulating queries based on the mappings, but in a decentralized manner.
Each peer in a PDMS has a peer schema composed of peer relations and peer mappings that relate its schema to the schemas of other peers. To avoid confusing relations from different peers, we assume that each relation of peer p is of the form r@p for some local relation name r. A query to a PDMS is posed using the peer schema of one of the peers. A query is asked to a particular peer, as a query over his particular schema. It is reformulated using the peer mappings into a set of queries that may refer to other peer relations. This captures the intuition that we want to use the information available in the entire P2P system to answer the query.
For designing the mappings, the distinction made in the mediator model between local and global relations does not make sense anymore, since each peer relation may play the role at different times both of a local relation and of a global relation. Therefore, the notions of GAV and LAV mappings are relaxed to the more appropriate symmetric notion of GLAV mappings.
Definition 9.6.1 (GLAV mapping) Let S@i and S@j be the peer schemas of two peers i and j. A GLAV mapping between these two peers is an inclusion axiom of the form: qi(x) ⊆qj(x), where qi(x) and qj(x) are conjunctive queries over the peer schema S@i, S@j, respectively.
Let qi(x,yi) and qj(x,yj) be the bodies of qi(x) and qj(x)), respectively. The semantics of the GLAV mapping qi(x) ⊆qj(x) is: ∀x[∃yiqi(x,yi) ⇒∃yjqj(x,yj)].
In database terms, a GLAV mapping qi(x) ⊆qj(x) expresses that answers obtained by asking qi(x) at peer i should also be considered as answers to qj(x) asked at peer j. Note that with this semantics, each local query is assumed to be incompletely answered with local data since external data may bring in new information to it. As already mentioned, such an open-world assumption is fully appropriate for Web data.
We next show one negative and one positive result for PDMSs. In Section 9.6.1, we show that in general, answering queries with GLAV mappings is undecidable, so without further restriction, answering queries in a PDMS is undecidable. In Section 9.6.2, we show that if we restrict the peer mappings to be DL-LITE inclusion axioms, a decentralized version of the algorithm for DL-LITE can be used to answer queries in DL-LITE PDMSs.
We show that the Dependency Implication Problem (more precisely, the problem of the implication of an inclusion dependency from a set of inclusion and functional dependencies) can be reduced to the GLAV Query Answering Problem, i.e., the problem of answering queries in presence of GLAV mappings. Since the Dependency Implication Problem is known to be undecidable, this shows that the GLAV Query Answering Problem is also undecidable.
The reduction technique is standard for proving undecidability results. We first recall how it works and also recall the Dependency Implication Problem. We believe that these notions are important to know beyond the scope of this book. Finally, we use a reduction to show the undecidability of answering queries using GLAV mappings.
Let B be a Boolean function over a set X. The decision problem B is decidable if there exists an algorithm (in any computation model equivalent to Turing machines) that terminates on any input x∈B and returns “true” if and only if B(x) is true.
Let B,B′ be two decision problems. A reduction from B to B’ is an algorithm f computing a function (also denoted f ) from X to X’ such that: B(x) is true ⇔B’(f(x)) is true.
It is immediate to see that if there is a reduction f from B to B′:
We recall the class of dependencies that are used. Let R be a relation of arity n. Then:
if for each k∈ [1..m], aik = bik, then aj = bj.
We will use the following known result:
Theorem 9.6.2 (Undecidability of the Dependency Implication Problem). Let R = {R1,...,Rn} be a relational schema. Given a set Σ of functional and inclusion dependencies and an inclusion dependency σ over relations in R, one cannot decide whether Σ⊧σ (i.e., whether each instance over R satisfying Σ also satisfies σ).
The problem is undecidable when Σ contains both functional and inclusion dependencies. Note that the implication problem is decidable for functional dependencies alone, and for inclusion dependencies alone. Undecidability arises when they are considered together.
The GLAV Query Answering Problem is to decide, given a PDMS defined using a set of GLAV mappings and a query asked at one of the peers whether some particular tuple is in its answer.
Let us define a reduction from the Dependency Implication Problem to the GLAV Query Answering Problem. If we show that such a reduction exists, since the Dependency Implication Problem is undecidable, this will show that the GLAV Query Answering Problem is undecidable.
Surprisingly, we can show the reduction for a PDMS with a single peer. To do that, we will use some GLAV mapping of the form q@P ⊇q′@P, where both sides of the mapping involve the same peer. Note that the undecidability still holds if such “self” mappings are forbidden. Indeed, we can simulate such a mapping by using “clones” of relations. For instance, suppose that we want to enforce the mapping R@P(x1,...,xn) ⊇R′@P(y1,...,yn). Then we can use a dummy site P and a copy R@P of R@P with the mappings:
Let (Σ,σ) be an instance over {R1,...,Rn} of the Dependency Implication Problem with Σ a finite set of functional and inclusion dependencies, and σ an inclusion dependency. We build a PDMS defined as follows:
It is easy to see that the GLAV mappings force each Ri to satisfy the functional dependencies of Ri, and each Ri,Rj to satisfy the inclusion dependencies between Ri and Rj.
Let us assume that σ = Ri : i1 ⊆Rj : j1 for some Ri of arity n. (This is without loss of generality since the implication is already undecidable when σ is unary).
Let Ext(Ri) be the set of tuples t of arity n with values in [1..n] such that:
For instance if n = 3 and i1 = 2, Ext(Ri) = {⟨1,1,1⟩,⟨1,1,2⟩,⟨2,1,1⟩,⟨2,1,2⟩,⟨2,1,3⟩}.
We construct an instance ( ,Ext(Ri),q) of the GLAV Query Answering Problem where q is the query defined by q(x) :- Rj(y1,...,x,...yk) where the distinguished variable x is in position j1, and the existential variables yi are pairwise distinct.
We show that Σ⊧σ iff 1 is an answer to q in the PDMS in which the only data is Ext(Ri).
(⇒) Suppose that Σ⊧σ. Let I be a model of Ext(Ri) satisfying the GLAV mappings of . By construction of those GLAV mappings, I is a model of Σ. Because Σ⊧σ, I is a model of σ, and thus for each tuple ⟨a1,...,an⟩ in I(Ri), there exists a tuple ⟨b1,...,bk⟩ in I(Rj) such that ai1 = 1 = bj1. Therefore, I⊧∃y1,...,ykRj(y1,...,1,...yk), i.e., I⊧q(1). Thus 1 is an answer to q given the GLAV mapping of and the extension Ext(Ri).
(⇐) Conversely, suppose that 1 is an answer to q given the GLAV mapping of and the extension Ext(Ri). Note that 1 is also an answer to q if the extension of Ri is reduced to any tuple of the original Ext(Ri). Suppose that Σ⊭σ: there exists an interpretation I that satisfies Σ in which σ is not satisfied. This means that there exists a tuple ⟨e1,...,e,...en⟩ (where e is in position i1) in I(Ri) such that there does not exists a tuple in I(Rj) with the value e in position j1. Let t be the tuple of Ext(Ri) which corresponds to the equality pattern between values of ⟨e1,...,e,...en⟩. By extending I to interpret each value of t by the element ei at the same position in ⟨e1,...,e,...en⟩, we obtain a new interpretation I′ that satisfies Σ and thus each GLAV mapping of , and Ri(t). Since 1 is an answer to q given the GLAV mapping of and Ri(t), I′⊧q(1), i.e., I′(1) ∈I′(Rj[j1]). Since I′(1) = e and I′(Rj) = I(Rj), it means that there exists a tuple in I(Rj) with the value e in position j1, which contradicts our assumption that σ is not satisfied in I. Hence Σ⊧σ. □
If we restrict the GLAV mappings in a PDMS to be inclusion statements that are expressible in DL-LITE, we get what we will call a DL-LITE PDMS. The decidability of query answering over a DL-LITE PDMS results from the algorithmic machinery described in the previous chapter for answering queries over DL-LITE knowledge bases. Given a query posed to a given peer, the application of the PerfectRef algorithm to the set of all the GLAV mappings in the PDMS provides a set of reformulations. The union of the answer sets obtained by evaluating each reformulation provides the answer set of the initial query. Note that a reformulation is of the form:
R1@i1(z1), …, Rk@ik(zk)
where the different conjuncts Rj@ij(zj) may refer to relations of different peer schemas. Therefore, the evaluation of each reformulation may require the interrogation of different peers and the combination of the answers returned by each such sub-queries.
This provides a centralized algorithm for computing the reformulations of answering queries over a decentralized DL-LITE knowledge base. We next present a decentralized algorithm that computes exactly the same thing, i.e., we present a decentralized version of the PerfectRef algorithm seen in the previous chapter in order to deploy effectively DL-LITE PDMSs that avoids having to centralize all the GLAV mappings.
We denote PerfectRef i(q) the reformulation algorithm running on the peer Pi applied to a query q (asked to the peer Pi). The main procedure is the decentralized reformulation of each atom of the query using the positive inclusion statements that are distributed over the whole PDMS. Let us denote AtomRef i(g) the reformulation algorithm running on the peer Pi to reformulate the atom g (built on a relation of the schema of the peer Pi).
Within each peer Pi we distinguish the local positive inclusion axioms of the form Ci ⊆Di where Ci and Di are built over relations in the schema of the peer Pi, from the mappings which are positive inclusion mappings of the form Cj ⊆Di or Di ⊆Cj where Cj denotes a relation of another peer Pj (while Di refers to a relation in the schema of the peer Pi).
Let us denote LocalRef(g,PIi) the result of the reformulation of the atom g using the set PIi of local positive inclusion atoms of the peer Pi. We refer to the previous chapter (Definition 8.4.7, Section 8.4) for the computation of LocalRef(g,PIi) by backward application of the local PIs.
We just recall here that gr(g,I) denotes the reformulation of the atom g using the positive inclusion axiom I. We also recall that the atoms g that can be found as conjunct of a query q over a DL-LITE PDMS are of the following forms:
Running the algorithm AtomRef i on the peer Pi for reformulating the atom g consists first in computing the set LocalRef(g,PIi) of local reformulations of g, and then, for each mapping m with a peer Pj applicable to a local reformulation g′, in triggering the application of AtomRef j(gr(g′,m)) on Pj (by sending a message to Pj). Other peers Pk may be solicited in turn to run locally AtomRef k.
A loop may occur if a request of reformulation of an atom g initiated by a given peer P generates a branch of requests reaching a peer P′ which in turn requests P to reformulate g. Such loops can be easily handled by transmitting with every request the history of the current reasoning branch. More precisely, an history hist is a sequence [(gk,Pk),…,(g1,P1)] of pairs (gi,Pi) where gi is an atom of a peer Pi such that for each i∈ [1..k- 1], gi+1 is a reformulation of gi using a mapping between Pi and Pi+1.
This is summarized in Algorithm 6, which is the atom reformulation algorithm with history running on Peer i.
Input: An atom g in the vocabulary of the peer Pi, an history hist
Output: The set of its reformulations in the PDMS: R
(4) Let PIi be the local PIs of the peer Pi
(5) Let Mi be the mappings between the peer Pi and other peers
(6) for each g′∈LocalRef(g,PIi)
(7) for each mapping m∈Mi between Pi and a peer Pj applicable to g′
(8) R←R∪AtomRefHistj(gr(g′,m),[(g,Pi)|hist])
Algorithm 7 is the atom reformulation algorithm (denoted AtomRef i) running on peer Pi, which just calls AtomRefHisti with an empty history.
Input: An atom g in the vocabulary of the peer Pi
Output: The set of its reformulations in the PDMS
(1)AtomRefHisti(g,∅)
The decentralized version of the PerfectRef algorithm that computes all the reformulations of a conjunctive query q is provided in Algorithm 8. The main difference with the centralized version is that the simplification of the produced reformulations (which is required for making some PIs applicable) are delayed after (decentralized) computation of the reformulations of all the atoms in the query.
We recall here the notation used for denoting the simplification of some atoms within a query under reformulation, which were introduced in the previous chapter when describing the PerfectRef algorithm:
For each atom in the query, it computes first (in the decentralized manner explained previously) the set of all of its reformulations, and then a first set of reformulations of the original query by building all the conjunctions between the atomic reformulations (denoted ⊕i=1nAtomRefi(gi) at Line 5). These reformulations are then possibly simplified by unifying some of their atoms (Lines 8 to 11), and the reformulation process is iterated on these newly produced reformulations until no simplification is possible (general loop starting on Line 4).
Input: a conjunctive query q over the schema of the peer Pi
Output: a set of reformulations of the query using the union of PIs and mappings in the PDMS
(4) (a) foreach q′ = g1 ∧g2 ∧…∧gn ∈PR′
(10)PR′ := PR′∪{τ(reduce(q′′,g′1,g′2))}
(12)return PR
One can prove that the decentralized algorithm computes the same set of facts as the centralized one, and thus is correct. The proof results (1) from the observation that the centralized version of PerfectRefi (in which AtomRefi(gi) is computed by iterating the one-step application of PIs on each atom gi of the query) produces the same results than the original PerfectRef, and (2) from the completeness of AtomRefi(gi) ensuring the decentralized computation of all the reformulations of gi.
The Bucket and Minicon algorithms can be extended ( [113, 137]) to handle (union of) conjunctives queries with interpreted predicates. When a query q includes interpreted predicates, finding all answers to q given the LAV mappings is co-NP hard in the size of the data. This complexity result shows that answering such queries cannot be fully realized with a finite set of conjunctive rewriting (unlike what we showed here in absence of interpreted predicates). The Inverse-rule algorithm does not handle interpreted predicates but is able to build recursive query plans for data integration [59].
A survey on answering queries using views can be found in [87], and a survey on query containment for data integration systems in [124].
More material can be found on PDMS in [88, 86].
Distributed reasoning in a peer to peer setting has been investigated in [14] as a basis for querying distributed data through distributed ontologies [73, 1]. The subtle point that we have not treated in this chapter concerns consistency checking. In contrast with the centralized case, the global consistency of the PDMS cannot be checked at query time since the queried peer does not know all the peers in the PDMS. However, it can get the identifiers of the peers involved in a reformulation of the query. Then the (local) consistency of the union of the corresponding knowledge bases can be checked in a decentralized manner. The important point is that it can be shown that this local consistency is sufficient to guarantee that the answers obtained by evaluating the reformulations (computed by the decentralized algorithm that we have described) are well-founded.
The undecidability of the Dependency Implication Problem is shown in [42] even if σ is a unary inclusion dependency. More on this topic may be found in [9])
Exercise 9.8.1 By applying the query containment algorithm (see Algorithm 4), determine which query is contained in which one among the three following queries. Are there equivalent queries ? (two queries qand q′are equivalent if qis contained in q′and q′is contained in q).
Exercise 9.8.2 Consider a global schema defined by the following relations:
emp(E): E is an employee
phone(E,P): E has P as phone number
office(E,O): E has O as office
manager(E,M): M is the manager of E
dept(E,D): D is the department of E
Suppose that the three following data sources are available for providing data:
Source1 provides the phone number and the manager for some employees. It is modeled by the local relation s1(E,P,M).
Source2 provides the office and the department for some employees. It is modeled by the local relation s2(E,O,D).
Source3 provides the phone number of employees of the ’toy’ department. It is modeled by the local relation s3(E,P).
Compute the reformulation of the query in terms of local relations:
Which algorithm is easier ?
Exercise 9.8.3 Consider the three following LAV mappings:
V1(x) ⊆cite(x,y), cite(y,x)
V2(x,y) ⊆sameTopic(x,y)
V3(x,y) ⊆cite(x,z), cite(z,x), sameTopic(x,z)
1The same holds for the Bucket or Inverse-rules algorithm.