9.1 Introduction

9.2 Containment of conjunctive queries

9.3 Global-as-view mediation

9.4 Local-as-view mediation

9.4.1 The Bucket algorithm

9.4.2 The Minicon algorithm

9.4.3 The Inverse-rules algorithm

9.4.4 Discussion

9.5 Ontology-based mediators

9.5.1 Adding functionality constraints

9.5.2 Query rewriting using views in DL-LITE

9.6 Peer-to-Peer Data Management Systems

9.6.1 Answering queries using GLAV mappings is undecidable

9.6.2 Decentralized DL-LITE

9.7 Further reading

9.8 Exercices

9.2 Containment of conjunctive queries

9.3 Global-as-view mediation

9.4 Local-as-view mediation

9.4.1 The Bucket algorithm

9.4.2 The Minicon algorithm

9.4.3 The Inverse-rules algorithm

9.4.4 Discussion

9.5 Ontology-based mediators

9.5.1 Adding functionality constraints

9.5.2 Query rewriting using views in DL-LITE

9.6 Peer-to-Peer Data Management Systems

9.6.1 Answering queries using GLAV mappings is undecidable

9.6.2 Decentralized DL-LITE

9.7 Further reading

9.8 Exercices

The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.

A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes NOM, PRENOM, whereas the Erasmus database about European students may use a European student number and the attributes fiRSTNAME, LASTNAME and HOME UNIVERSITY.

In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried. One sometimes use the term virtual data integration. This is in contrast to a warehousing approach where the data is extracted from the data sources ahead of query time, transformed, and loaded in the warehouse. At query time, the warehouse is accessed but not the data sources. Warehouses approaches are typically preferred for very complex queries, e.g., for data mining. On the other hand, to have access to “fresh” information, a mediator approach is preferred since it avoids having to propagate in real time, data source updates to the warehouse. Figure 9.1 illustrates these two approaches of data integration.

In the mediator approach, one starts by designing a global schema (also called mediated schema) that serves as a unique entry point on which global queries are posed by users. A main issue is then to specify the relationships, namely semantic mappings, between the schemas of the data sources and the global schema. Based on these mappings, one can answer queries over the global schema using queries over the data sources. Typically, query answering in the mediator approach is performed as follows. First, independently of the data in the sources, the user’s query posed over the global schema is transformed into local queries that refer to the schemas of the data sources. A global query combines the data provided by sources. Queries are optimized and transformed into query plans. The local query plans are executed and their results combined by the global query plan.

In the following, for presentation purposes, we consider that the global schema and the schemas of the data sources to integrate are all relational. In practice, each non-relational data source (e.g., XML or HTML) is abstracted as a relational database with the help of a wrapper. Wrappers are small programs that translate local relational queries into appropriate requests understood by specific data sources, and transform their results into relations. The role of wrappers is to allow the mediator to see each data source as relational, no matter which actual format it uses.

Let us consider in more detail the specification of semantic mappings between the data sources
and the global schema. Let S_{1}, ..., S_{n} be the local schemas of n pre-existing data sources. To simplify
the presentation, let us assume that each local schema S_{i} is made of a single relation that we denote
also S_{i}. The relations S_{1}, ..., S_{n} are called the local relations. Suppose the global schema G consists of
the global relations G_{1}, ..., G_{m}. The goal is to specify semantic relations between the local
relations S_{i} and the global relations G_{j}. The G_{j} are logically (intentionally) defined by the
S_{i}.

An example of simple relationship (not very interesting) is based on equality, e.g., G_{1} = S_{1}. One
can find more complicated relationships, e.g., G_{2} = S_{1} ∪S_{2} or G_{3} = S_{1}⋈S_{3}. In these last two
examples, a global relation is defined as a query over the local relations. In other words, the global
relation is a view of the local relations. Indeed, one typically prefers more flexible constraints such
as G_{3} ⊇S_{1}⋈S_{3}. Using containment instead of equality leaves open the possibility for other
sources of providing data about G_{3}, e.g., G_{3} ⊇σ_{A=”yes”}(S_{4}). Because global relations are
constrained by views of the local relations, one uses the term global-as-view for such
specifications.

In a somewhat dual manner, one can use local-as-view constraints such as: S_{4} ⊆G_{1}⋈G_{3}. This
leaves even more flexibility since the contribution of each data source can be specified (e.g., by its
owner) independently of the other sources of the system. This kind of autonomy is typically
well-adapted to a Web setting.

More generally, to express semantic mappings between {S_{1},...,S_{n}} and {G_{1},...,G_{m}}, one can use
inclusion statements, i.e., logical constraints, of the form v(S_{1},...,S_{n}) ⊆v′(G_{1},...,G_{m}), where v and
v′ are query expressions called views. All the constraints we consider in this chapter
will be of this general form. Now, given an instance I of {S_{1},...,S_{n}} (i.e., an instance
of the data sources), we don’t know the instance J of the global schema. But we know
that:

- Global-As-View (GAV for short).
- The semantic mappings are of the form
also equivalently denoted

where each V

_{i}is a view over the local schemas, i.e., a query built on local relations. - Local-As-View (LAV for short).
- The semantic mappings are of the form
_{i}is a view over the global schema, i.e., a query built on global relations.

In our development, we will consider conjunctive queries. Using negation in queries greatly complicates the issues. In the next section, we recall some standard material on containment of conjunctive queries, i.e., of the queries at the heart of our formal development. In Sections 9.3 and 9.4, we study GAV and LAV mediators, respectively. For each of these languages, we describe appropriate query rewriting algorithms. In Section 9.5, we show the impact on query rewriting of adding DL-LITE constraints in the global schema. Finally, in Section 9.6, we lay the basis of a peer-to-peer approach for data integration. In contrast with the mediator approach which offers a unique entry point to data, peer-to-peer data management systems (PDMS for short) are decentralized data integration systems.

In this section, we recall some basic notions on comparing conjunctive queries that we will use in the following.

We recall that a conjunctive query is an expression of the form:

Given an instance I of the relations appearing in the body of the query, an answer is a tuple
⟨ν(x_{1}),...,ν(x_{n})⟩ for some valuation ν of the variables in the query, such that for each i, A_{i}(ν(u_{i}))
holds in I. We denote q(I) the set of answers.

We sometimes denote this query q(x_{1},...x_{n}) when its body is understood. Observe that the
interpretation of such a conjunctive query in logical terms is:

The data integration techniques rely on conjunctive query containment. This problem has been extensively studied because it is at the core of query optimization. We use known techniques that we recall next.

A query q_{1} is contained in q_{2}, denoted q_{1} ⊆q_{2}, if for each I, q_{1}(I) ⊆q_{2}(I). It is known that the
containment between a conjunctive query q_{1} and a conjunctive query q_{2} can be tested by finding a
“homomorphism” from q_{2} to q_{1}.

Definition 9.2.1 Let q_{1}(x_{1},...,x_{n}) and q_{2}(y_{1},...,y_{n}) be two conjunctive queries. A (conjunctive query)
homomorphism from q_{2} to q_{1} is a mapping ψ from the variables of q_{2} to the variables of q_{1} such
that:

- For each i, ψ(y
_{i}) = x_{i}; and - For each atom R(u
_{i}) in the body of q_{2}, R(ψ(u_{i})) is in the body of q_{1}.

Example 9.2.2 Consider the following queries:

- q
_{1}(x_{1},x′_{1}) : -A_{1}(x_{1},x_{2},x_{3}),A_{2}(x′_{1},x_{2},x_{3}) - q
_{2}(y_{1},y′_{1}) : -A_{1}(y_{1},y_{2},y_{3}),A_{2}(y′_{1},y_{2},y′_{3})

Consider a mapping ψ such that ψ(y_{i}) = x_{i} for each i, ψ(y′_{1}) = x′_{1} and ψ(y′_{3}) = x_{3}. Then the
required conditions hold, and it follows that q_{1} ⊆q_{2}. Intuitively, q_{2} joins A_{1} and A_{2} on the
second attribute, whereas q_{1} also joins on the third one. The additional condition induces the
containment.

The following proposition states that the existence of a homomorphism is a necessary and sufficient condition for query containment.

Proposition 9.2.3 (Homomorphism theorem) Let q_{1} and q_{2} be two conjunctive queries. Then q_{1}
is contained in q_{2} if and only if there exists a homomorphism from q_{2} to q_{1}.

This provides a simple algorithm for testing conjunctive query containment. In the general case, deciding whether a conjunctive query is contained in another one is NP-complete in the size of the two queries. In fact, in many practical cases, there are polynomial-time algorithms for query containment.

Algorithm 4 checks whether a query q_{1} is contained in a query q_{2}.

QC(q1,q_{2})

Input: Two conjunctive queries:

q_{1}(x) :- g_{1}(x_{1}),…,g_{n}(x_{n})

q_{2}(y) :- h_{1}(y_{1}),…,h_{m}(y_{m})

Output: Yes if q_{1} ⊆q_{2}; no otherwise

(1)freeze q_{1}: construct a canonical instance D_{can} = {g_{i}(ν(x_{i}))∣1 ≤i≤n}

(2) for some valuation ν mapping each variable in q_{1}

(4)if ν(x) ∈q_{2}(D_{can}) return yes

(5)else return no.

Example 9.2.4 Consider the queries of Example 9.2.2. The canonical instance D_{can} is
A_{1}(a,b,c),A_{2}(a′,b,c). It is easily verified that q_{2}(D_{can}) = (a,a′), which is ν(x,x′).

The main advantage of GAV is its conceptual and algorithmic simplicity. The global schema is
simply defined using views over the data sources and specifies how to obtain tuples of the global
relation G_{i} from tuples in the sources.

Definition 9.3.1 (GAV mapping) A GAV mapping is an expression of the form: R(x_{1},...,x_{n}) ⊇
q(x_{1},...,x_{n}), where q(x_{1},...,x_{n}) :- A_{1}(u_{1}),...,A_{k}(u_{k}) is a conjunctive query of the same arity as R.
The semantics of this mapping is:

We write alternatively this GAV mapping as:

Example 9.3.2 Consider the following four data sources:

- The source relation S1is a catalog of teaching programs offered in different French universities
with master programs.
S1.Catalogue(nomUniv, programme).

- The source relation S2 provides the names of European students enrolled in courses at some
university within the Erasmus exchange program:
S2.Erasmus(student, course, univ).

- The source relation S3 provides the names of foreign students enrolled in programs of some
French university:
S3.CampusFr(student, program, university).

- The source relation S4provides the course contents of international master programs:
S4.Mundus(program,course)

Now, suppose we define a global schema with the following unary and binary relations:

MasterStudent(studentName), | University(uniName), |

MasterProgram(title), | MasterCourse(code), |

EnrolledIn(studentName,title), | RegisteredTo(studentName, uniName). |

These relations are defined in terms of the local relations by the following GAV mappings:

MasterStudent(N)⊇S2.Erasmus(N,C,U), S4.Mundus(P,C) |

MasterStudent(N)⊇S3.CampusFr(N,P,U), S4.Mundus(P,C) |

University(U)⊇S1.Catalogue(U,P) |

University(U)⊇S2.Erasmus(N,C,U) |

University(U)⊇S3.CampusFr(N,P,U) |

MasterProgram(T)⊇S4.Mundus(T,C) |

MasterCourse(C)⊇S4.Mundus(T,C) |

EnrolledIn(N,T)⊇S2.Erasmus(N,C,U), S4.Mundus(T,C) |

EnrolledIn(N,T)⊇S3.CampusFr(N,T,U), S4.Mundus(T,C) |

RegisteredTo(N,U)⊇S3.CampusFr(N,T,U) |

Note that in a warehousing approach, one would simply evaluate all the queries that define the global view, and populate the warehouse using standard relational query evaluation. In a mediator approach, we try to only derive data that is relevant to a specific query posed on the global view by a user. We show how to rewrite a global query into queries over the local relations and combine their results. This is achieved by a technical trick that consists in unfolding the atoms of the global query.

Observe the first two mappings. They specify (using joins) how to obtain tuples of the unary relation MasterStudent. Now consider the following global query asking for universities with registered master students:

The rewriting of this query into source queries is obtained by unfolding, i.e., by replacing each atom which can be matched with the head of some view, by the body of the corresponding view. (For readers familiar with logic programming, this is some very simple form of resolution.)

In the example, there is a single mapping whose head can be matched with RegisteredTo(s,x), and two mappings that match MasterStudent(s). Thus, we obtain the following two unfoldings:

q1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2)

q2(x) :- S3.CampusFr(s,v5,x), S3.CampusFr(s,v6,v7), S4.Mundus(v6,v8)

q2(x) :- S3.CampusFr(s,v5,x), S3.CampusFr(s,v6,v7), S4.Mundus(v6,v8)

Observe that q2 can be simplified. Replacing the conjunction of its first two atoms by the single atom S3.CampusFr(s,v6,x) leads to an equivalent query. We thus obtain the following two GAV rewritings of the initial query:

r1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2)

r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

The result is obtained by computing r1∪r2. Now, observe that each r_{ℓ} is a conjunctive
query. It can be optimized using standard query optimization to obtain an optimized
physical query plan. Of course, the choice of the particular physical query plan that is
selected depends on the statistics that are available and the capabilities of the sources.
For instance, a plan may consist in querying S3 and then for each value a of v6 (i.e.,
a particular university program), asking the query q(X) :- S4.Mundus(a,X) to
S4.

We now formalize the simple and intuitive notion of query unfolding.

Definition 9.3.3 (Query unfolding) Let q(x) :- G_{1}(z_{1}),…,G_{n}(z_{n}) be a query and for each i,
G_{i}(x_{i}) ⊇q_{i}(x_{i},y_{i}) be a GAV mapping. An unfolding of qis the query uobtained from qby replacing,
for each i, each conjunct G_{i}(z_{i}) by q_{i}(ψ_{i}(x_{i},y_{i})) where ψ_{i} is a function that maps x_{i} to z_{i}, and the
existential variables y_{i} to new fresh variables.

The renaming of the existential variables into fresh ones is necessary to avoid the
introduction of unnecessary constraints in the unfolding. Indeed, consider an existential
variable y occurring in two distinct atoms, say G_{i} and G_{j}. Then, the two atoms should be
understood as ∃...y...G_{i}(z_{i}) and ∃...y...G_{j}(z_{j}). The scopes of y in both are disjoint and nothing
requires that the two occurrences of y take the same value. Hence the renaming using fresh
variables.

Example 9.3.4 Suppose we have the two mappings:

We next establish that each unfolding of a query computes a part of the desired results, and that their union computes the whole set of answers. To do so, we use two propositions. The first one ignores unfolding and focuses on the “materialization” of the global relations.

Proposition 9.3.5 Let S_{1},...,S_{n} be a set of source relations; G_{1},...,G_{m} a global schema defined by a
set of GAV mappings over S_{1},...,S_{n}; and I be an instance over S_{1},...,S_{n}. Let J be the instance over
G_{1},...,G_{m} defined by, for each j,

Proof (sketch). Let u be an answer. Then, by definition, q(u) is true in each instance J′ over
G_{1},...,G_{m} such that I and J′ together satisfy the mappings. In particular, u belongs to q(J).
Conversely, let u be in q(J). Let J′ be an instance such that I and J′ together satisfy the mappings.
Since J′ satisfies the mappings, J ⊆J′. Since conjunctive queries are monotone, q(J) ⊆q(J′). Thus
u∈J′. Since u belongs to all such J′, u is an answer. □

The second proposition deals with unfoldings.

Proposition 9.3.6 Let S be a set of source relations and G a set of global relations defined by a set
of GAV mappings over S. Consider the query q(z) :- G_{i1}(z_{i1}),…,G_{in}(z_{in}) over G and the set {r_{ℓ}}
of unfoldings of qgiven . Then for each I over S_{1},...,S_{n}, the answer of qis given by ∪r_{ℓ}(I).

Proof (sketch). Let J be as in Proposition 9.3.5. By the same proposition, it suffices to show that
q(J) = ∪r_{ℓ}(I).

- Soundness.
- Let u∈∪r
_{ℓ}(I). Then u is r_{ℓ}(I) for some unfolding r_{ℓ}. Suppose r_{ℓ}results from the unfolding defined by selecting for each j, the mapping G_{ij}(x_{ij}) ⊇q_{ij}(x_{ij},y_{ij}). It follows that u∈q({u_{1}},...,{u_{n}}) where for each j, u_{j}is derived by G_{ij}(x_{ij}) ⊇q_{ij}(x_{ij},y_{ij}). Thus, each u_{j}is in J(G_{ij}) and u∈q(J(G_{i1}),...,J(G_{in})) = q(J). Therefore, ∪r_{ℓ}(I) ⊆q(J). - Completeness.
- Conversely, consider u in q(J). Then, there exists u
_{1}in J(G_{i1}), ..., u_{j}in J(G_{ij}), ...u_{n}in J(G_{in}) such that u ∈ q({u_{1}},...,{u_{n}}). By construction of J, for each j there is some mapping G_{ij}(x_{ij}) ⊇ q_{ij}(x_{ij},y_{i-1}) such that u_{j}is in q_{ij}(x_{ij},y_{i-1}). Consider the unfolding r_{ℓ}defined by selecting for each j, this particular mapping. One can verify that u is r_{ℓ}(I). Hence, u∈∪r_{ℓ}(I) and q(J) ⊆∪r_{ℓ}(I).

We can compute the answer using the unfoldings (also called the GAV rewritings). These
unfoldings can be simplified by removing redundant conjuncts that may have been introduced by
the technique. This simplification relies on checking conjunctive query containment. Given a
conjunctive query with body A_{1}(u_{1}),...,A_{m}(u_{m}), we verify whether each query obtained by
removing some A_{i}(u_{i}) is equivalent to the initial one. If yes, the atom is redundant and can be
removed. We keep doing this until the query is “minimal”. This simplification test is
costly but the resulting query may be much less expensive to evaluate that the initial
one.

We must evaluate all the unfoldings to obtain the complete answer. If we are aware of some
constraints on the local schemas or on the global one, this can be further simplified. For instance,
the constraints may imply that the result of a particular unfolding is empty, in which case this
particular unfolding needs not be evaluated. Also, the constraints may imply that the result of
some unfolding, say r_{ℓ}, is always included in another one. Then r_{ℓ} needs not be evaluated. For
instance, in the previous example, if it is known that students obtained from the source S2 are
European students, while those obtained from the source S3 are non European students, we can be
sure that the GAV rewriting r_{ℓ} obtained by unfolding will not provide any answer. This requires
expressing and exploiting disjointness constraints over the local relations. Inclusion constraints
on local relations would, on the other hand, permit to detect in advance that a given
query plan provides answers that are redundant with those obtained by another query
plan.

A main limitation of GAV is that adding or removing data sources to the integration system may require deeply revising all the views defining the global schema. In a Web context where sources may come and go, e.g., because of (non) availability of servers, this is really too constraining. The LAV approach does not suffer from this disadvantage.

The LAV approach takes a dual approach. The local relations are defined as views over global relations. The goal is to define the global schema in such a way that individual definitions do not change when data sources join or leave the integration system except for the definitions of the sources that are involved in the change.

Definition 9.4.1 (LAV mapping) A LAV mapping is a mapping of the form: S ⊆ q, for some
conjunctive query q(x_{1},...,x_{n}) :- A_{1}(u_{1}),...,A_{k}(u_{k}) over the global relations. Its semantics is:

Again, S(x_{1},...,x_{n}) is called the head of the view, whereas A_{1}(u_{1}),...,A_{k}(u_{k}) is called the body of
the view.

Example 9.4.2 We define the global schema as consisting of the following relations:

Student(studentName), | EuropeanStudent(studentName), |

University(uniName), | NonEuropeanStudent(studentName), |

FrenchUniversity(uniName), | EuropeanUniversity(uniName), |

NonEuropeanUniversity(uniName), | Program(title), |

MasterProgram(title), | EnrolledInProgram(studentName,title), |

Course(code), | EnrolledInCourse(studentName,code), |

PartOf(code, title), | RegisteredTo(studentName, uniName), |

OfferedBy(title, uniName). |

The four data sources considered in the previous example can be described by the following LAV mappings:

m1: S1.Catalogue(U,P)⊆ | FrenchUniversity(U), Program(P), OfferedBy(P,U), |

OfferedBy(P’,U), MasterProgram(P’) | |

m2: S2.Erasmus(S,C,U)⊆ | Student(S), EnrolledInCourse(S,C), PartOf(C,P), |

OfferedBy(P,U), EuropeanUniversity(U), | |

EuropeanUniversity(U’)RegisteredTo(S,U’), | |

U≠ U’ | |

m3: S3.CampusFr(S,P,U)⊆ | NonEuropeanStudent(S), Program(P), |

EnrolledInProgram(S,P), OfferedBy(P,U), | |

FrenchUniversity(U), RegisteredTo(S,U) | |

m4: S4.Mundus(P,C)⊆ | MasterProgram(P), OfferedBy(P,U), |

OfferedBy(P,U’), EuropeanUniversity(U), | |

NonEuropeanUniversity(U’), PartOf(C,P) |

LAV mappings enable quite fine-grained descriptions of the contents of data sources. For example, we are able to specify precisely the students that can be found in the Erasmus source: they are European students enrolled in courses of a given (European) university that is different from their home (European) University in which they remain registered.

LAV mappings express loose coupling between local and global relations, which is important for flexibility and robustness when the participating data sources change frequently. If we are interested in Master students, we do not need to know in advance (unlike the GAV approach) how to join two sources. We just define them as a global query:

MasterStudent(E) :- | Student(E), EnrolledInProgram(E,M), |

MasterProgram(M). |

The local sources that must be queried and combined to get the Master students will be discovered by the rewriting process. Recall that, in the GAV approach, they were predefined by the two mappings given in Example 9.3.2.

The price to pay for the flexibility of LAV compared to GAV is that the rewritings are more complicated to find. We describe three algorithms that achieve this rewriting. The Bucket algorithm and the Minicon algorithm follow the same approach. They first determine the local relations that are relevant to the query, then consider their combinations as candidate rewritings and verify whether they are indeed correct. Minicon is actually an optimization of Bucket that avoids the last verification step by a trickier first step. The third algorithm, namely the Inverse-rules algorithm, follows a completely different approach: it consists in transforming the logical rules supporting the LAV mappings (which are unsafe rules) into a set of safe rules with a single global relation. The global query is unfolded using these rules.

The principle of the Bucket algorithm is quite simple. It proceeds in three steps:

- the first step constructs for each atom g of the global query body its bucket, which groups the view atoms from which g can be inferred;
- the second step consists in building a set of candidate rewritings that are obtained by combining the view atoms of each bucket;
- in a last step, we check whether each candidate rewriting is valid.

Let g be a query atom. The atoms in bucket(g) are the heads of mappings having in their body an atom from which g can be inferred. Intuitively, data comes from source relations, and a (global) query atom is satisfied by (local) data only if it can be matched to a (global) atom in the body of a mapping whose head can be matched to source facts. A match between g and some atom in the body of a mapping is thus an indication that the corresponding data source provides a relevant information for this particular query.

There is an extra constraint that has to be considered to guarantee that g can indeed be logically inferred, as illustrated next. In fact, the bucket of a query atom g includes a view atom v only if an atom in the body of v can be matched with g by a variable mapping such that the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

Let us illustrate this on an example. Consider the LAV mappings of Example 9.4.2, and the global query:

Let us consider the query atom g= RegisteredTo(s,x), in which the variable x is distinguished.

We can find two mappings (m2 and m3) in which a body atom can be matched to

RegisteredTo(s,x).

First, consider the mapping m3:

m3: S3.CampusFr(S,P,U)⊆ | NonEuropeanStudent(S), Program(P), |

EnrolledInProgram(S,P), OfferedBy(P,U), | |

FrenchUniversity(U), RegisteredTo(S,U) | |

The atom RegisteredTo(s,x) matches the atom RegisteredTo(S,U) with the variable mapping {S/s,U/x}, where U is distinguished in the view defining the mapping (it occurs in the head of this LAV mapping).

Therefore, applying the variable mapping {S/s,U/x} to the head S3.CampusFr(S,P,U) of the mapping m3 enforces the matching of RegisteredTo(S,U) with the query atom RegisteredTo(s,x), and then:

S3.CampusFr(s,v1,x)∧FOL(m3)⊧ ∃sRegisteredTo(s,x)

Thus S3.CampusFr(s,v1,x) is added in Bucket(g). Note that v1 is simply a fresh variable mapped to the variable P appearing in S3.CampusFr(S,P,U) but not in the variable mapping {S/s,U/x}.

On the other hand, consider the mapping m2:

m2: S2.Erasmus(S,C,U)⊆ | Student(S), EnrolledInCourse(S,C), PartOf(C,P), |

OfferedBy(P,U), EuropeanUniversity(U), | |

EuropeanUniversity(U’)RegisteredTo(S,U’), U≠ U’ |

The match this time is between g = RegisteredTo(s,x) and RegisteredTo(S,U’) by the variable mapping {S/s,U′/x}. The difference with the previous situation is that the variable U’ is existentially quantified in the view defining the mapping. Applying the variable mapping {S/s,U′/x} to the head S2.Erasmus(S,C,U) of the mapping m2do not enforce the matching of RegisteredTo(S,U’) in its body with the query atom RegisteredTo(s,x).

More formally:

S2.Erasmus(s,v2,v3)∧FOL(m2)⊭ ∃sRegisteredTo(s,x).

To see why, consider the LAV mapping m2 and its logical meaning FOL(m2):

FOL(m2): | ∀S∀C∀U [ S2.Erasmus(S,C,U)⇒∃P∃U’ ( |

EuropeanStudent(S)∧EnrolledInCourse(S,C)∧ | |

PartOf(C,P)∧OfferedBy(P,U) | |

∧EuropeanUniversity(U)∧RegisteredTo(S,U’)∧U≠ U’) ] |

From the fact that S2.Erasmus(s,v2,v3), it follows that:

∃s∃U’RegisteredTo(s,U’).

However, this is a strictly weaker statement than ∃sRegisteredTo(s,x) where x is fixed. We prove this next. Consider an instance I over the domain Δ = {s,v2,v3,v4,v5,x} defined by:

As a consequence, S2.Erasmus(s,v2,v3) does not belong to the bucket and:

Bucket(g, q, M)

Input: An atom g = G(u_{1},...,u_{m}) of the query q and a set of LAV mappings

Output: The set of view atoms from which g can be inferred

(2)for each LAV mapping S(x) ⊆q(x,y)

(3) if there exists in q(x,y) an atom G(z_{1},...,z_{m}) such that

(4) z_{i} is distinguished for each i such that u_{i} is distinguished in
q;

(5) let ψ the variable mapping {z_{1}/u_{1},....,z_{m}/u_{m}}

(6) extended by mapping the head variables in x not

(7) appearing in {z_{1},...,z_{m}} to new fresh variables;

(9)return Bucket(g);

Algorithm 5 constructs the buckets. Proposition 9.4.3 is a logical characterization of the view atoms put in the buckets of the atoms of the global query.

Proposition 9.4.3 Let G(u_{1},...,u_{m}) be an atom of the global query. Let u be the (possibly
empty) subset of existential variables in {u_{1},...,u_{m}}. Let m: S(x) ⊆q(x,y) be a LAV mapping.
Then

S(v), FOL(m)⊧∃uG(u_{1},...,u_{m})

iff there exists a view atom in Bucket(g) that is equal to S(v) (up to a renaming of the fresh variables).

The proof is tedious and left as exercise.

In the worst-case, the Bucket algorithm applied to each atom of a query has a time complexity in O(N ×M×V) and produces N buckets containing each at most M×V view atoms, where N is the size of the query, M is the maximal size of the LAV mappings and V is the number of LAV mappings.

Returning to the example, we obtain by the Bucket algorithm, the following buckets for the three atoms of the query q.

RegisteredTo(s,x) | EnrolledInProgram(s,p) | MasterProgram(p) |

S3.CampusFr(s,v1,x) | S3.CampusFr(s,p,v2) | S1.Catalogue(v3,v4) |

S4.Mundus(p,v5) | ||

The candidate rewritings of the initial global query are then obtained by combining the
view atoms of each bucket. In the worst-case, the number of candidate rewritings is in
O((M×V)^{N}). For instance, in our example, we obtain two candidate rewritings for the query
q:

r1(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2), S1.Catalogue(v3,v4)

r2(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2), S4.Mundus(p,v5)

r2(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2), S4.Mundus(p,v5)

A candidate rewriting may not be a valid rewriting of the query. By Proposition 9.4.3, we only know that each candidate rewriting entails each atom of the query in isolation, i.e., without taking into account the possible bindings of the existential variables within the query.

It turns out that, in our example, the first candidate rewriting r1 is not a valid rewriting of the query q: the body of q is not logically entailed by the conjunction of the view atoms in the body of r1.

To see why, we first apply to each view atom in the body of r1 the corresponding LAV mapping to obtain the logical global expression (i.e., built on global relations). This step is called expanding r1, and its result, the expansion of r1. In our case, the expansion of r1 is the following query expression:

Exp_r1(x) :- NonEuropeanStudent(s), Program(v1), EnrolledInProgram(s,v1),

OfferedBy(v1,x), FrenchUniversity(x), RegisteredTo(s,x),

Program(p), EnrolledInProgram(s,p), OfferedBy(p,v2),

FrenchUniversity(v2), RegisteredTo(s,v2),

FrenchUniversity(v3), Program(v4),OfferedBy(v4,v3),

OfferedBy(v5,v3), MasterProgram(v5)

OfferedBy(v1,x), FrenchUniversity(x), RegisteredTo(s,x),

Program(p), EnrolledInProgram(s,p), OfferedBy(p,v2),

FrenchUniversity(v2), RegisteredTo(s,v2),

FrenchUniversity(v3), Program(v4),OfferedBy(v4,v3),

OfferedBy(v5,v3), MasterProgram(v5)

Note that new existential variables may be introduced by the expansion of some view atoms. For instance, the LAV mapping defining S1.Catalogue(v3,v4) contains the existential variable denoted P’ in the LAV mapping definition. Such variables are renamed with new fresh variables to avoid unnecessary constraints between the variables. In our example, this corresponds to variable v5 in the body of Exp_r1(x).

To check whether a rewriting is correct, it suffices to check with the Conjunctive Query Containment algorithm whether the query Exp_r1(x) is contained in the query q(x). For each variable v, let the corresponding constant, i.e., ψ(v), be "v". The canonical database obtained from r1 is given in Figure 9.2.

NonEuropean- | Program | EnrolledIn- | OfferedBy | French- | RegisteredTo | Master- |

Student | Program | University | Program | |||

"s" | "v1" | ("s", "v1") | ("v1", "x") | "x" | ("s", "x") | "v5" |

"p" | ("s", "p") | ("p", "v2") | "v2" | ("s", "v2") | ||

"v4" | ("v4", "v3") | "v3" | ||||

("v5", "v3") | ||||||

The evaluation of q(x) over this canonical database yields an empty result because there is no way of assigning the existential variables s and p to constants of the canonical database which satisfies the binding of the existential variable p between the two last atoms of the body of the query.

Expanding the rewriting r2 and checking that it is contained into the query q is left in exercise. This shows that among the two candidate rewritings, only r2 is a valid rewriting:

Remark 9.4.4 In spite of the apparent redundancy of the two first atoms, this rewriting cannot be simplified to

It is true that r2.1(x) is contained into r2(x). However, the two queries are not equivalent. For some data sets, it may be the case that there is a student s and there is a university x such that (based on S3.CampusFr), s is registered in x and also enrolled in a Mundus master program offered by another university. The containment would hold under a constraint that would forbid a student to be registered in more than one universities.

One can prove that each rewriting finally obtained does indeed provide answers and that their union constitutes the complete answer.

The idea underlying Minicon is to avoid putting in a bucket an atom that will only generate invalid rewritings. As we saw in the discussion of Bucket, the reason for an atom to be useless is that its binding of a variable does not match with the binding of other occurrences of that variable. This explains why a candidate rewriting (like r1) is not valid.

We now illustrate the Minicon algorithm by example. Consider the query q:

and the two LAV mappings:

V1(u,v)⊆T(w,u), U(v,w), R(v,u) |

V2(u,v,v’)⊆T(w,u), U(v,w), R(v’,w) |

Minicon proceeds in two steps that correspond to the first two steps of Bucket.

Minicon scans each atom in the query, but instead of creating buckets for them, it builds MCDs (short name for Minicon Descriptions). The first iteration of Minicon determines the relevance of the different LAV mappings to rewrite the first query atom U(y,z):

- The Bucket algorithm would put V1(v1,y) in the bucket of U(y,z) (where v1 is a
fresh variable), because the variable mapping {v/y,w/z} allows the match between
the atom U(v,w) in the expansion of V1(u,v) and the query atom U(y,z).
Minicon does not consider the query atom U(y,z) in isolation. Instead, since the variable w is existential in the view defining the mapping, and mapped to the variable z that has several occurrences in the query, it checks whether the variable mapping {v/y,w/z} also covers all the query atoms involving variable z, i.e., can be extended to also match R(x,z) and T(z,y). Because variable w is existential in the expansion of V1(u,v) (i.e., w does not appear in the head of the mapping), it is the only way to enforce the several occurrences of z in the query to be mapped to by the same variable w. Here, matching the query atom R(x,z) with an atom of the form R(_,w) in the expansion of V1(v1,y) is not possible: there does not exist such an atom in the expansion of V1(v1,y). Therefore, no MCD is created from V1 for covering the query atoms including an occurrence of the variable z.

- When considering V2, though Minicon starts from the same variable mapping
{v/y,w/z} to match U(v,w) in the expansion of V2(u,v,v’) and the query atom
U(y,z), the situation is different for checking whether it can be extended to cover
the other query atoms R(x,z) and T(z,y) containing occurrences of the variable z.
Extending the variable mapping {v/y,w/z} to match R(x,z) with the atom R(v’,w)
is possible by adding the variable mapping v′/x. Now, extending the variable mapping
{v/y,w/z,v′/x} to match T(z,y) with the atom T(w,u) is also possible by adding
the variable mapping u/y. The resulting variable mapping is: {v/y,w/z,v′/x,u/y}.
And, V2(y,y,x) is retained as a rewriting of the corresponding part of the query: a
MCD is created for it, with in addition the positions of the atoms in the query it covers:
MCD1 = (V2(y,y,x)) , {1,2,3} )

The last iteration of building MCDs corresponds to the last query atom: R(y’,x). The LAV mapping V1 has in its expansion the atom R(v,u) that can be matched to it by the variable mapping {v/y′,u/x)}. Since the distinguished variable x in the query is assigned to the distinguished variable (same condition as for adding to a bucket), and since the existential variable y’ of the query atom has a single occurrence in the query, the following MCD is created:

MCD2 = ((V1(x,y’), {4})

In contrast, there is no MCD created for R(y’,x) with the second LAV mapping: in the variable mapping {v′/y′,w/x)} that allows to match the query atom R(y’,x) with the atom R(v’,w) in the expansion of V2, the distinguished variable x in the query is assigned to the variable w which is not distinguished in the expansion of of V2. As for adding a view atom in a bucket, a MCD is created for a query atom g only if the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

.

The second step of Minicon replaces the combination of the buckets by the combination of the MCDs. More precisely, the rewritings are obtained by combining MCDs that cover mutually disjoint subsets of query atoms, while together covering all the atoms of the query.

Because of the way in which the MCDs were constructed, the rewritings obtained that way are guaranteed to be valid. No containment checking is needed, unlike in the Bucket Algorithm. In our example, we would therefore obtain as single rewriting of q(x):

This algorithm takes a radically different approach. It transforms the LAV mappings into GAV mappings (called inverse rules) so that the complex operation of query rewriting using LAV mappings can then be replaced by the much simpler operation of query unfolding. A LAV mapping is replaced by several GAV mappings, one for each atom in the body of the rule. The subtlety is to keep bindings between the different occurrences of the same existential variable in the body. This is realized using a simple trick from first-order logic, namely by introducing Skolem functions.

Let us explain the Inverse-rules algorithm on the example we used for Minicon. A first important point that distinguishes it from the Bucket and Minicon algorithms is that the Inverse-rules algorithm is independent of the query. It only considers as input the set of LAV mappings:

V1(u,v)⊆T(w,u), U(v,w), R(v,u) |

V2(u,v,v’)⊆T(w,u), U(v,w), R(v’,w). |

Consider the first LAV mapping and recall that its logical meaning mapping is the formula:

But we can infer much more. We can also infer that there exists some constant d1 such that T(d1,a) and U(b,d1) are both true. We do not know the exact value of that constant d1, but we know it exists and that, in some way, it depends on the constants a,b. Since this dependency comes from the first rule, we denote this unknown d1 value: f1(a,b).

Creating the inverse rules This motivates the construction of three following GAV mappings for which we give also the FOL translation.

In the previous rules, the symbol f1 is a Skolem function of arity 2, and f1(u,v) is a Skolem term denoting some constant that depends on the values instantiating the variables u,v. Given two distinct Skolem terms, e.g. f1(1,2) and f1(2,v3), we cannot tell whether they refer to the same constant or not.

The Inverse-rules algorithm just scans the LAV mappings and creates n GAV mappings for each LAV mapping having n atoms. The result of this algorithm applied to the second LAV mappings in the example is:

Obtaining the rewritings by unfolding: The rewritings of any global query is now obtained by unfolding the query atoms using the (Inverse-rules) GAV mappings corresponding to the initial set of LAV mappings. The unfolding operation here is a bit trickier than the unfolding defined in Definition 9.3.3, because of the Skolem terms. In Definition 9.3.3, the unfolding was based on matching each query atom G(x1, .., xm) with an atom (in the right-hand side of a GAV mapping) of the form G(z1, .., zm) by equating each pair (zi, xi) of variables. Proposition 9.3.6 showed that unfolding each atom of the query in isolation builds valid rewritings of the query, i.e., conjunctions of view atoms which logically implies the conjunction of the query atoms. It is not the case anymore when atoms in the right-hand side of GAV mappings contain Skolem terms.

The unification of two atoms with functions is more complex than just equating variables, and it may fail. It may require the substitution of some variables with functional terms (in our case, Skolem terms). This may make impossible to unify the other atoms of the query with atoms in the right-hand side of GAV mappings.

Let us illustrate on our example the subtleties of unfolding queries in presence of functional terms. Consider again the same query q:

The query atom U(y,z) can be unified with the atom U(v,f1(u,v)) in the right-hand side of the GAV mappings IN12 using a so-called most general unifier (mgu). In this case, the mgu is the substitution:

This substitution that makes the unfolding of the first query atom possible, now constrains the other occurrences in the query of the variables y and z for the unfolding of the other query atoms. After the application of σ to the whole body of the query and the unfolding of the first query atom made possible by σ, we obtain the following (partial) query rewriting:

The unfolding of the second atom R(x,f1(v2,v1)) yields V1(f1(v2,v1),x), and we obtain the (partial) rewriting:

It is useless to continue unfolding the remaining query atoms of pr2(x). As soon as a given unfolding has produced a view atom with Skolem terms, we can be sure that the evaluation of the query plan under construction will not produce any answer: there is no way to match V1(f1(v2,v1),x) with any fact in the data source which are of the form V1(a,b) where a,b are constants. Since we don’t know f1(v2,v1), there is absolutely no reason to believe that it is equal to a.

Using the inverse rule IN23 to unfold R(x,f1(v2,v1)) does not help because unifying R(x,f1(v2,v1)) and R(v’,f2(u,v,v’)) fails because of the two different Skolem functions. Thus, the (partial) rewriting issued from unfolding U(y,z) using the inverse rule IN12 is abandoned.

Let us try now to unfold U(y,z) using IN22 made possible by the substitution

Now, unfolding R(x,f2(v2,v1,v3)) using the inverse rule IN23 is possible thanks to the substitution

in which one of the first two atoms can be dropped.

Now, we examine the unfolding of the query atom T(f2(v2,v1,x),v1), which requires checking whether T(f2(v2,v1,x),v1) and T(f2(u,v,v’),u) are unifiable. This is the case thanks to the substitution {v2/v3,u/v3,v1/v3,v/v3,v′/x}, which leads to the (partial) rewriting:

Again, we can remove the first atom that is redundant and obtain the equivalent (partial) rewriting:

Finally the unfolding of R(y’,x) using IN23 leads to the final rewriting:

The three algorithms have the same (worst-case) complexity and they guarantee to provide the correct answer. Some experiments have shown that in practice Minicon outperforms both Bucket and Inverse-rules. The main advantage of the Inverse-rules algorithm over the Bucket and Minicon algorithms is that the step producing the inverse rules is done independently of the queries. Furthermore, the unfolding step can also be applied to Datalog queries, i.e., to recursive queries.

The common limitation of the three algorithms is that they do not handle additional knowledge (ontology statements) that can be known about the domain of application. In the next section, we see how to extend both the Local-As-Views and Global-As-Views approaches with DL-LITE ontologies, i.e., we consider global schemas that include constraints expressed as DL-LITE axioms.

We first show a negative result: as soon as we add functionality constraints over the global schema, the number of conjunctive rewritings of a query to be considered, may become infinite. This is a severe limitation for extending the LAV or GAV approaches since such constraints are rather natural. So these approaches to data integration fail when we consider the DL-LITE dialect of previous chapters. On the positive side, we show how to extend the GAV and LAV approaches to constraints expressible in DL-LITE.

We illustrate on an example the problem raised by taking into account functionality constraints in the global schema. Let us consider a global schema with one unary relation C and two binary relations R and R’. In both R and R1, we impose that the first attribute is a key. Let us consider two LAV mappings:

To prove it, we expand r_{1}(x) and show that the resulting expansion together with the logical axiom
expressing the functionality of R1 logically implies the conjunction of atoms in the body of the query.
The expansion of r_{1}(x) is:

It is important to note that to properly check containment, the standard query containment algorithm seen in the previous section would have to be modified in a standard manner to take into account functional dependencies. Intuitively, one would have to proceed pretty much as we did in the example, equating variables as implied by the functional dependencies.

It turns out that the situation is even more subtle. Surprisingly, this rewriting r_{1}(x) is not the only
one. In fact there exists an infinite number of different rewritings for q(x). Let k ≥ 2. The following
query is a valid rewriting of q(x):

The expansion of r_{k}(x) is:

One can also show that for each k, each such rewriting may return answers that are not returned with k- 1. Thus, there exists an infinite number of non redundant conjunctive rewritings. The reader familiar with Datalog will observe that this infinite collection of rewritings can be captured in Datalog by the following recursive rewriting:

Querying data through DL-LITE ontologies has been detailed in the previous chapter. It has
been shown how the positive and negative constraints expressed in the ontology are exploited both
for data consistency checking and for query answering. In particular, the first step of query
answering is the query reformulation step which is performed by the PerfectRef algorithm: using the
positive constraints, called the PIs, it computes a set of reformulations, which are then
evaluated over the data to produce the answer set of the original query. The negative
constraints, called the NIs, are used to check data consistency, by translating each (declared or
entailed) NI into a Boolean conjunctive query q_{unsat} that must be evaluated over the
data.

In this section, we show how to extend both the LAV and GAV approaches to rewrite queries in term of views when the global schema includes some DL-LITE Tbox.

Two observations explain how this can be realized:

- First, one can obtain the answer set of a query q(x) by computing the union of the answer sets returned by the evaluation over the local data sources of the (GAV or LAV) relational rewritings of each reformulation of q(x) as computed by PerfectRef(q(x),PI).
- The rewritings that are obtained may be inconsistent with the negative constraints NI
declared or inferred in the Tbox. Therefore, the consistency of each rewriting r(x) has
to be checked. This can be done by checking containment between the Boolean query
∃xExp_r(x) (where Exp_r(x) is the expansion of r(x)) and each of the Boolean queries
q
_{unsat}obtained from the NIs.

These two observations follow from the completeness of the PerfectRef and Consistent algorithms for DL-LITE presented in the previous chapter, and that of the rewriting algorithms of Sections 9.3 and 9.4; namely Unfolding for GAV, Minicon, Bucket or Inverse-rules for LAV.

The argument may be somewhat too abstract for some readers. We next illustrate these two points with examples. We use the global schema considered in Example 9.4.2 page 439, enriched with the DL-LITE Tbox of Figure 9.3. Note in particular that we add the subclass College of the class University, the subproperty EnrolledInCollege of the property RegisteredTo, for which the domain is declared as being the class MasterStudent. In addition, we add the property EnrolledInMasterProgram that we declare as a subproperty of the property EnrolledInProgram. Finally, we declare a mandatory property for the class College: any college must have students enrolled in it.

DL notation | FOL notation |

PIs: | |

MasterStudent ⊑Student | MasterStudent(X) ⇒Student(X) |

EuropeanStudent ⊑Student | EuropeanStudent(X) ⇒Student(X) |

NonEuropeanStudent ⊑Student | NonEuropeanStudent(X) ⇒Student(X) |

College ⊑University | College(X) ⇒University(X) |

FrenchUniversity ⊑University | FrenchUniversity(X) ⇒University(X) |

EuropeanUniversity ⊑University | EuropeanUniversity(X) ⇒University(X) |

NonEuropeanUniversity ⊑University | NonEuropeanUniversity(X) ⇒University(X) |

∃EnrolledInCollege ⊑MasterStudent | EnrolledInCollege(X,Y) ⇒MasterStudent(X) |

College ⊑∃EnrolledInCollege^{-} | College(X) ⇒∃YEnrolledInCollege(Y,X) |

EnrolledInCollege ⊑RegisteredTo | EnrolledInCollege(X,Y) ⇒RegisteredTo(X,Y) |

MasterStudent ⊑∃EnrolledInMasterProgram | MasterStudent(X) ⇒∃YEnrolledInMasterProgram(X,Y) |

∃EnrolledInMasterProgram^{-}⊑MasterProgram | EnrolledInMasterProgram(X,Y) ⇒MasterProgram(Y) |

EnrolledInMasterProgram ⊑EnrolledInProgram | EnrolledInMasterProgram(X,Y) ⇒EnrolledInProgram(X,Y) |

NIs: | |

NonEuropeanStudent ⊑¬EuropeanStudent | NonEuropeanStudent(X) ⇒¬EuropeanStudent(X) |

NonEuropeanUniversity ⊑¬EuropeanUniversity | NonEuropeanUniversity(X) ⇒¬EuropeanUniversity(X) |

NonEuropeanUniversity ⊑¬FrenchUniversity | NonEuropeanUniversity(X) ⇒¬FrenchUniversity(X) |

We revisit Example 9.3.2 by adding the data source S5 giving the list of French so-called Grandes Ecoles. Its local schema is made of the local relation: S5.GrandeEcole(nomEcole). According to this new source and also to the enriched global schema of Figure 9.3, we add the following GAV mappings to the ones already considered in Example 9.3.2:

College(U)⊇S5.GrandeEcole(U) |

EuropeanStudent(N)⊇S2.Erasmus(N,C,U) |

NonEuropeanStudent(N)⊇S3.CampusFr(N,P,U) |

Consider again the global query looking for universities with registered master students:

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the reformulation:

By unfolding q(x), we obtain the same two rewritings as in Example 9.3.2:

r1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2)

r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

By unfolding the reformulation q1(x), we get the additional rewriting:

It is important to note that even if we had the GAV mapping

College(U)⊇S5.GrandeEcole(U),

the rewriting r3(x) would not have been obtained without reformulated first the initial query q(x) into q1(x).

Now, in contrast with the standard GAV approach, we have to check the consistency of each of these rewritings. To do so:

- We first compute the closure of the NI and we translate them into Boolean queries
q
_{unsat}(as explained in detail in Section 8.4 of Chapter 8). This is independent of the rewritings and can be performed at compile time given the Tbox. From the Tbox in Figure 9.3, we obtain only three Boolean queries q_{unsat}: - At query time, we check the consistency of each rewriting by applying the Consistent algorithm to the canonical instance obtained by expanding each rewriting and freezing its variables (as explained in detail in Section 8.4 of Chapter8).

We illustrate the consistency check by checking the consistency of the rewriting r1(x). First, its expansion replaces each of its local atoms S(z) with the conjunction of global atoms of the form G(z) that can be produced by a GAV mapping G(x) ⊇S(x), if such GAV mappings exist. For expanding r1(x), we apply the following GAV mappings:

NonEuropeanStudent(N)⊇S3.CampusFr(N,P,U) |

University(U)⊇S3.CampusFr(N,P,U) |

RegisteredTo(N,U)⊇S3.CampusFr(N,P,U) |

EuropeanStudent(N)⊇S2.Erasmus(N,C,U) |

University(U)⊇S2.Erasmus(N,C,U) |

MasterProgram(T)⊇S4.Mundus(T,C) |

MasterCourse(C)⊇S4.Mundus(T,C) |

As a result, we obtain the following expansion for r1(x):

Exp_r1(x) :- NonEuropeanStudent(s), University(x), RegisteredTo(s,x),

EuropeanStudent(s), University(x), MasterProgram(v4),

MasterCourse(v2)

EuropeanStudent(s), University(x), MasterProgram(v4),

MasterCourse(v2)

We then apply the Consistent algorithm. For this, we evaluate q_{unsat}^{1}, q_{unsat}^{2} and q_{unsat}^{3} over the
body of Exp_r1(x) seen as a relational database; i.e., we freeze its atoms to obtain a canonical
instance. Query q_{unsat}^{1} returns true, so an inconsistency has been detected and the rewriting r1(x)
is rejected.

We revisit Example 9.4.2 by adding the same data source S5 as in Section 9.5.2. The GAV mapping is also a LAV mapping: S5.GrandeEcole(U)⊆College(U)

Consider again the global query considered in Section 9.4.1:

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the following reformulation:

By applying the Minicon algorithm^{1}
to the initial query q(x), we obtain the following rewriting (as shown in Section 9.4.1):

By applying the Minicon algorithm to the reformulation q1(x) of the initial query, we obtain the additional rewriting:

As for the extended GAV approach, the consistency of LAV rewritings is not guaranteed because
of the NIs in the Tbox. We follow the same approach: at compile time, the closure of the NIs is
computed and each (declared or inferred ) NI is compiled into a Boolean query q_{unsat}. At query
time, each of these q_{unsat} queries is evaluated over the canonical instance corresponding to each
rewriting.

In contrast with the centralized mediator model, a Peer-to-Peer data management system (PDMS for short) implements a decentralized view of data integration, in which data sources collaborate without any central authority. In a PDMS, each collaborating data source can also play the role of a mediator, so is at the same time a data server and a client for other data sources. Thus each participant to the system is a peer and there are mappings relating data from the different peers. A PDMS architecture is therefore very flexible in the sense that there is no need for a global schema defining in advance a set of terms to which each data source needs to adhere. Over time, data sources can join or leave the PDMS just by adding or removing mappings between them. PDMS are inspired by P2P file sharing systems but they enable answering fine-grained queries. Like in the mediator model, answering queries is performed by reformulating queries based on the mappings, but in a decentralized manner.

Each peer in a PDMS has a peer schema composed of peer relations and peer mappings that relate its schema to the schemas of other peers. To avoid confusing relations from different peers, we assume that each relation of peer p is of the form r@p for some local relation name r. A query to a PDMS is posed using the peer schema of one of the peers. A query is asked to a particular peer, as a query over his particular schema. It is reformulated using the peer mappings into a set of queries that may refer to other peer relations. This captures the intuition that we want to use the information available in the entire P2P system to answer the query.

For designing the mappings, the distinction made in the mediator model between local and global relations does not make sense anymore, since each peer relation may play the role at different times both of a local relation and of a global relation. Therefore, the notions of GAV and LAV mappings are relaxed to the more appropriate symmetric notion of GLAV mappings.

Definition 9.6.1 (GLAV mapping) Let S@i and S@j be the peer schemas of two peers i and j.
A GLAV mapping between these two peers is an inclusion axiom of the form: q_{i}(x) ⊆q_{j}(x), where
q_{i}(x) and q_{j}(x) are conjunctive queries over the peer schema S@i, S@j, respectively.

Let q_{i}(x,y_{i}) and q_{j}(x,y_{j}) be the bodies of q_{i}(x) and q_{j}(x)), respectively. The semantics of the GLAV
mapping q_{i}(x) ⊆q_{j}(x) is: ∀x[∃y_{i}q_{i}(x,y_{i}) ⇒∃y_{j}q_{j}(x,y_{j})].

In database terms, a GLAV mapping q_{i}(x) ⊆q_{j}(x) expresses that answers obtained by asking
q_{i}(x) at peer i should also be considered as answers to q_{j}(x) asked at peer j. Note that with this
semantics, each local query is assumed to be incompletely answered with local data since external
data may bring in new information to it. As already mentioned, such an open-world assumption is
fully appropriate for Web data.

We next show one negative and one positive result for PDMSs. In Section 9.6.1, we show that in general, answering queries with GLAV mappings is undecidable, so without further restriction, answering queries in a PDMS is undecidable. In Section 9.6.2, we show that if we restrict the peer mappings to be DL-LITE inclusion axioms, a decentralized version of the algorithm for DL-LITE can be used to answer queries in DL-LITE PDMSs.

We show that the Dependency Implication Problem (more precisely, the problem of the implication of an inclusion dependency from a set of inclusion and functional dependencies) can be reduced to the GLAV Query Answering Problem, i.e., the problem of answering queries in presence of GLAV mappings. Since the Dependency Implication Problem is known to be undecidable, this shows that the GLAV Query Answering Problem is also undecidable.

The reduction technique is standard for proving undecidability results. We first recall how it works and also recall the Dependency Implication Problem. We believe that these notions are important to know beyond the scope of this book. Finally, we use a reduction to show the undecidability of answering queries using GLAV mappings.

Let B be a Boolean function over a set X. The decision problem B is decidable if there exists an algorithm (in any computation model equivalent to Turing machines) that terminates on any input x∈B and returns “true” if and only if B(x) is true.

Let B,B′ be two decision problems. A reduction from B to B’ is an algorithm f computing a function (also denoted f ) from X to X’ such that: B(x) is true ⇔B’(f(x)) is true.

It is immediate to see that if there is a reduction f from B to B′:

- if B′ is decidable then B is. Suppose B′ is decidable. Let f
_{B′}be an algorithm that given some x′∈X′, decides whether B′(x′) holds. Then for each x, B(x) is true if f_{B′}(f(x)) is true. This provides an algorithm for deciding for any x if B(x) is true. - (The contraposite) if B is undecidable, then B′ is also undecidable.

We recall the class of dependencies that are used. Let R be a relation of arity n. Then:

- Functional dependencies.
- A functional dependency over R is an expression R : i
_{1}...i_{m}→j, where 1 ≤i_{1},...,i_{m},j≤n, for n = arity(R). An instance I over R satisfies R : i_{1}...i_{m}→j if for each tuples ⟨a_{1},...a_{n}⟩, ⟨b_{1},...b_{n}⟩ in I,if for each k∈ [1..m], a

_{ik}= b_{ik}, then a_{j}= b_{j}. - Inclusion dependencies.
- An inclusion dependency over R
_{1},R_{2}is an expression R_{1}: i_{1}...i_{m}⊆R_{2}: j_{1}...j_{m}, where the i_{k}are distinct, the j_{k}are distinct, 1 ≤i_{1},...,i_{m}≤arity(R_{1}), 1 ≤j_{1},...,j_{m}≤arity(R_{2}). An instance I over {R_{1},R_{2}} satisfies R_{1}: i_{1}...i_{m}⊆R_{2}: j_{1}...j_{m}if for each tuple ⟨a_{1},...,a_{n}⟩ in I(R_{1}), there exists a tuple ⟨b_{1},...,b_{n′}⟩ in I(R_{2}) such that for each k, 1 ≤k≤m, a_{ik}= b_{jk}.

We will use the following known result:

Theorem 9.6.2 (Undecidability of the Dependency Implication Problem). Let R = {R_{1},...,R_{n}}
be a relational schema. Given a set Σ of functional and inclusion dependencies and an
inclusion dependency σ over relations in R, one cannot decide whether Σ⊧σ (i.e., whether
each instance over R satisfying Σ also satisfies σ).

The problem is undecidable when Σ contains both functional and inclusion dependencies. Note that the implication problem is decidable for functional dependencies alone, and for inclusion dependencies alone. Undecidability arises when they are considered together.

The GLAV Query Answering Problem is to decide, given a PDMS defined using a set of GLAV mappings and a query asked at one of the peers whether some particular tuple is in its answer.

Let us define a reduction from the Dependency Implication Problem to the GLAV Query Answering Problem. If we show that such a reduction exists, since the Dependency Implication Problem is undecidable, this will show that the GLAV Query Answering Problem is undecidable.

Surprisingly, we can show the reduction for a PDMS with a single peer. To do that, we will use
some GLAV mapping of the form q@P ⊇q′@P, where both sides of the mapping involve the same
peer. Note that the undecidability still holds if such “self” mappings are forbidden. Indeed, we can
simulate such a mapping by using “clones” of relations. For instance, suppose that we want to
enforce the mapping R@P(x_{1},...,x_{n}) ⊇R′@P(y_{1},...,y_{n}). Then we can use a dummy site P and a copy
R@P of R@P with the mappings:

Let (Σ,σ) be an instance over {R_{1},...,R_{n}} of the Dependency Implication Problem with Σ a finite set
of functional and inclusion dependencies, and σ an inclusion dependency. We build a PDMS
defined as follows:

- For each relation R
_{i}, the peer P has a relation R_{i}. - For each inclusion dependency R
_{1}: i_{1}...i_{m}⊆R_{2}: j_{1}...j_{m}in Σ, we add the GLAV mapping q_{1}⊆q_{2}, where:_{k}in position i_{k}for each k and some existential variable x_{j}^{i}in each other position j; and similarly for v and j_{k}. - For each functional dependency R
_{i}: i_{1}...i_{m}→j in Σ, we add the GLAV mapping q⊆q′ where q,q′ are defined by:_{1},...,x_{k}and x′_{1},...,x′_{k}of variables.

It is easy to see that the GLAV mappings force each R_{i} to satisfy the functional dependencies of
R_{i}, and each R_{i},R_{j} to satisfy the inclusion dependencies between R_{i} and R_{j}.

Let us assume that σ = R_{i} : i_{1} ⊆R_{j} : j_{1} for some R_{i} of arity n. (This is without loss of generality
since the implication is already undecidable when σ is unary).

Let Ext(R_{i}) be the set of tuples t of arity n with values in [1..n] such that:

- t[i
_{1}] = 1, for every tuple t in Ext(R_{i}), - each tuple t in Ext(R
_{i}) represents an equality pattern between values in tuples of size n.

For instance if n = 3 and i_{1} = 2, Ext(R_{i}) = {⟨1,1,1⟩,⟨1,1,2⟩,⟨2,1,1⟩,⟨2,1,2⟩,⟨2,1,3⟩}.

We construct an instance ( ,Ext(R_{i}),q) of the GLAV Query Answering Problem where q is the
query defined by q(x) :- R_{j}(y_{1},...,x,...y_{k}) where the distinguished variable x is in position j_{1}, and
the existential variables y_{i} are pairwise distinct.

We show that Σ⊧σ iff 1 is an answer to q in the PDMS in which the only data is
Ext(R_{i}).

(⇒) Suppose that Σ⊧σ. Let I be a model of Ext(R_{i}) satisfying the GLAV mappings of . By
construction of those GLAV mappings, I is a model of Σ. Because Σ⊧σ, I is a model of σ, and thus
for each tuple ⟨a_{1},...,a_{n}⟩ in I(R_{i}), there exists a tuple ⟨b_{1},...,b_{k}⟩ in I(R_{j}) such that a_{i1} = 1 = b_{j1}.
Therefore, I⊧∃y_{1},...,y_{k}R_{j}(y_{1},...,1,...y_{k}), i.e., I⊧q(1). Thus 1 is an answer to q given the GLAV
mapping of and the extension Ext(R_{i}).

(⇐) Conversely, suppose that 1 is an answer to q given the GLAV mapping of and the extension
Ext(R_{i}). Note that 1 is also an answer to q if the extension of R_{i} is reduced to any tuple of the
original Ext(R_{i}). Suppose that Σ⊭σ: there exists an interpretation I that satisfies Σ in which σ is not
satisfied. This means that there exists a tuple ⟨e_{1},...,e,...e_{n}⟩ (where e is in position i_{1}) in I(R_{i}) such
that there does not exists a tuple in I(R_{j}) with the value e in position j_{1}. Let t be the tuple of Ext(R_{i})
which corresponds to the equality pattern between values of ⟨e_{1},...,e,...e_{n}⟩. By extending I to
interpret each value of t by the element e_{i} at the same position in ⟨e_{1},...,e,...e_{n}⟩, we obtain a new
interpretation I′ that satisfies Σ and thus each GLAV mapping of , and R_{i}(t). Since 1 is an
answer to q given the GLAV mapping of and R_{i}(t), I′⊧q(1), i.e., I′(1) ∈I′(R_{j}[j_{1}]). Since
I′(1) = e and I′(R_{j}) = I(R_{j}), it means that there exists a tuple in I(R_{j}) with the value e in
position j_{1}, which contradicts our assumption that σ is not satisfied in I. Hence Σ⊧σ.
□

If we restrict the GLAV mappings in a PDMS to be inclusion statements that are expressible in DL-LITE, we get what we will call a DL-LITE PDMS. The decidability of query answering over a DL-LITE PDMS results from the algorithmic machinery described in the previous chapter for answering queries over DL-LITE knowledge bases. Given a query posed to a given peer, the application of the PerfectRef algorithm to the set of all the GLAV mappings in the PDMS provides a set of reformulations. The union of the answer sets obtained by evaluating each reformulation provides the answer set of the initial query. Note that a reformulation is of the form:

R_{1}@i_{1}(z_{1}), …, R_{k}@i_{k}(z_{k})

where the different conjuncts R_{j}@i_{j}(z_{j}) may refer to relations of different peer schemas. Therefore, the
evaluation of each reformulation may require the interrogation of different peers and the
combination of the answers returned by each such sub-queries.

This provides a centralized algorithm for computing the reformulations of answering queries over a decentralized DL-LITE knowledge base. We next present a decentralized algorithm that computes exactly the same thing, i.e., we present a decentralized version of the PerfectRef algorithm seen in the previous chapter in order to deploy effectively DL-LITE PDMSs that avoids having to centralize all the GLAV mappings.

We denote PerfectRef ^{i}(q) the reformulation algorithm running on the peer P_{i} applied to a
query q (asked to the peer P_{i}). The main procedure is the decentralized reformulation
of each atom of the query using the positive inclusion statements that are distributed
over the whole PDMS. Let us denote AtomRef ^{i}(g) the reformulation algorithm running
on the peer P_{i} to reformulate the atom g (built on a relation of the schema of the peer
P_{i}).

Within each peer P_{i} we distinguish the local positive inclusion axioms of the form
C_{i} ⊆D_{i} where C_{i} and D_{i} are built over relations in the schema of the peer P_{i}, from the
mappings which are positive inclusion mappings of the form C_{j} ⊆D_{i} or D_{i} ⊆C_{j} where C_{j}
denotes a relation of another peer P_{j} (while D_{i} refers to a relation in the schema of the peer
P_{i}).

Let us denote LocalRef(g,PI_{i}) the result of the reformulation of the atom g using the set PI_{i} of
local positive inclusion atoms of the peer P_{i}. We refer to the previous chapter (Definition 8.4.7,
Section 8.4) for the computation of LocalRef(g,PI_{i}) by backward application of the local
PIs.

We just recall here that gr(g,I) denotes the reformulation of the atom g using the positive inclusion axiom I. We also recall that the atoms g that can be found as conjunct of a query q over a DL-LITE PDMS are of the following forms:

- A@i(x) where A@i is a unary relation in the schema of a peer P
_{i}and x an (existential or qualified) variable - P@i(x,_), P@i(_,x) or P@i(x,y) where P@i is a binary relation in the schema of a peer
P
_{i}, and _ denotes an unbounded existential variable of the query, while x and y denote qualified variables or existential variables which are bounded in the query.

Running the algorithm AtomRef ^{i} on the peer P_{i} for reformulating the atom g consists first in
computing the set LocalRef(g,PI_{i}) of local reformulations of g, and then, for each mapping m with
a peer P_{j} applicable to a local reformulation g′, in triggering the application of AtomRef ^{j}(gr(g′,m))
on P_{j} (by sending a message to P_{j}). Other peers P_{k} may be solicited in turn to run locally
AtomRef ^{k}.

A loop may occur if a request of reformulation of an atom g initiated by a given peer P generates
a branch of requests reaching a peer P′ which in turn requests P to reformulate g. Such loops can be
easily handled by transmitting with every request the history of the current reasoning branch. More
precisely, an history hist is a sequence [(g_{k},P_{k}),…,(g_{1},P_{1})] of pairs (g_{i},P_{i}) where g_{i} is an atom of a
peer P_{i} such that for each i∈ [1..k- 1], g_{i+1} is a reformulation of g_{i} using a mapping between P_{i}
and P_{i+1}.

This is summarized in Algorithm 6, which is the atom reformulation algorithm with history running on Peer i.

AtomRefHist^{i}(g,hist)

Input: An atom g in the vocabulary of the peer P_{i}, an history hist

Output: The set of its reformulations in the PDMS: R

(4) Let PI_{i} be the local PIs of the peer P_{i}

(5) Let M_{i} be the mappings between the peer P_{i} and other peers

(6) for each g′∈LocalRef(g,PI_{i})

(7) for each mapping m∈M_{i} between P_{i} and a peer P_{j} applicable to
g′

(8) R←R∪AtomRefHist^{j}(gr(g′,m),[(g,P_{i})|hist])

Algorithm 7 is the atom reformulation algorithm (denoted AtomRef ^{i}) running on peer P_{i}, which
just calls AtomRefHist^{i} with an empty history.

AtomRef ^{i}(g)

Input: An atom g in the vocabulary of the peer P_{i}

Output: The set of its reformulations in the PDMS

(1)AtomRefHist^{i}(g,∅)

The decentralized version of the PerfectRef algorithm that computes all the reformulations of a conjunctive query q is provided in Algorithm 8. The main difference with the centralized version is that the simplification of the produced reformulations (which is required for making some PIs applicable) are delayed after (decentralized) computation of the reformulations of all the atoms in the query.

We recall here the notation used for denoting the simplification of some atoms within a query under reformulation, which were introduced in the previous chapter when describing the PerfectRef algorithm:

- The notation q[g/gr(g,I)] denotes the replacement of the atom g in the body of the query q with the result gr(g,I) of the backward application of the PI I to the atom g,
- The operator reduce(q,g,g′) denotes the simplification of the body of q obtained by replacing the conjunction of its two atoms g and g’ with their most general unifier (if g and g’ can be unified),
- The operator τ replaces in the body of a query all the possibly new unbounded existential variables with the anonymous variable denoted _.

For each atom in the query, it computes first (in the decentralized manner explained previously)
the set of all of its reformulations, and then a first set of reformulations of the original
query by building all the conjunctions between the atomic reformulations (denoted
⊕_{i=1}^{n}AtomRef^{i}(g_{i}) at Line 5). These reformulations are then possibly simplified by unifying some of
their atoms (Lines 8 to 11), and the reformulation process is iterated on these newly
produced reformulations until no simplification is possible (general loop starting on Line
4).

PerfectRef^{i}(q)

Input: a conjunctive query q over the schema of the peer P_{i}

Output: a set of reformulations of the query using the union of PIs and mappings in the PDMS

(4) (a) foreach q′ = g_{1} ∧g_{2} ∧…∧g_{n} ∈PR′

(5)PR′′ := ⊕_{i=1}^{n}AtomRef^{i}(g_{i})

(8) foreach g′_{1},g′_{2} ∈q′′

(10)PR′ := PR′∪{τ(reduce(q′′,g′_{1},g′_{2}))}

(12)return PR

One can prove that the decentralized algorithm computes the same set of facts as the centralized
one, and thus is correct. The proof results (1) from the observation that the centralized version of
PerfectRef^{i} (in which AtomRef^{i}(g_{i}) is computed by iterating the one-step application of PIs on each
atom g_{i} of the query) produces the same results than the original PerfectRef, and (2) from the
completeness of AtomRef^{i}(g_{i}) ensuring the decentralized computation of all the reformulations of
g_{i}.

The Bucket and Minicon algorithms can be extended ( [113, 137]) to handle (union of) conjunctives queries with interpreted predicates. When a query q includes interpreted predicates, finding all answers to q given the LAV mappings is co-NP hard in the size of the data. This complexity result shows that answering such queries cannot be fully realized with a finite set of conjunctive rewriting (unlike what we showed here in absence of interpreted predicates). The Inverse-rule algorithm does not handle interpreted predicates but is able to build recursive query plans for data integration [59].

A survey on answering queries using views can be found in [87], and a survey on query containment for data integration systems in [124].

More material can be found on PDMS in [88, 86].

Distributed reasoning in a peer to peer setting has been investigated in [14] as a basis for querying distributed data through distributed ontologies [73, 1]. The subtle point that we have not treated in this chapter concerns consistency checking. In contrast with the centralized case, the global consistency of the PDMS cannot be checked at query time since the queried peer does not know all the peers in the PDMS. However, it can get the identifiers of the peers involved in a reformulation of the query. Then the (local) consistency of the union of the corresponding knowledge bases can be checked in a decentralized manner. The important point is that it can be shown that this local consistency is sufficient to guarantee that the answers obtained by evaluating the reformulations (computed by the decentralized algorithm that we have described) are well-founded.

The undecidability of the Dependency Implication Problem is shown in [42] even if σ is a unary inclusion dependency. More on this topic may be found in [9])

Exercise 9.8.1 By applying the query containment algorithm (see Algorithm 4), determine which query is contained in which one among the three following queries. Are there equivalent queries ? (two queries qand q′are equivalent if qis contained in q′and q′is contained in q).

Exercise 9.8.2 Consider a global schema defined by the following relations:

emp(E): E is an employee

phone(E,P): E has P as phone number

office(E,O): E has O as office

manager(E,M): M is the manager of E

dept(E,D): D is the department of E

Suppose that the three following data sources are available for providing data:

Source1 provides the phone number and the manager for some employees. It is modeled by the local relation s1(E,P,M).

Source2 provides the office and the department for some employees. It is modeled by the local relation s2(E,O,D).

Source3 provides the phone number of employees of the ’toy’ department. It is modeled by the local relation s3(E,P).

- Model the content of these sources by GAV mappings.
- Model the content of these sources by LAV mappings.
- Consider the global query asking for Sally’s phone number and office:
Compute the reformulation of the query in terms of local relations:

- by applying the query unfolding algorithmto the GAV mappings of Question 1,
- by applying the Bucket algorithm to LAV mappings of Question 2.

Which algorithm is easier ?

- Now Source1 disappears (becomes unavailable) and a new source comes in, that provides the phone number of their manager for some employees. Do the updates in the GAV and LAV mappings that are required to take into account these changes. What is the approach (GAV or LAV) for which updating the mappings between the global and local relations is easier ?

Exercise 9.8.3 Consider the three following LAV mappings:

V1(x) ⊆cite(x,y), cite(y,x)

V2(x,y) ⊆sameTopic(x,y)

V3(x,y) ⊆cite(x,z), cite(z,x), sameTopic(x,z)

- Provide the FOL semantics of these LAV mappings
- Suppose that the global relation cite(x,y) means that the paper x cites the paper y, and that the global relation sameTopic(x,y) means that the two papers x and y have the same topic. Suppose that each LAV mapping models the content of different available data sources. Express with an english sentence which information on papers each data source provides.
- Apply in turn the Bucket, Minicon and Inverse-rule algorithms to compute the different rewritings of the following query asking for papers that cite and are cited by a paper having the same topic:

^{1}The same holds for the Bucket or Inverse-rules algorithm.