9 Data Integration

The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.

A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes NOM, PRENOM, whereas the Erasmus database about European students may use a European student number and the attributes fiRSTNAME, LASTNAME and HOME UNIVERSITY.

In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried. One sometimes use the term virtual data integration. This is in contrast to a warehousing approach where the data is extracted from the data sources ahead of query time, transformed, and loaded in the warehouse. At query time, the warehouse is accessed but not the data sources. Warehouses approaches are typically preferred for very complex queries, e.g., for data mining. On the other hand, to have access to “fresh” information, a mediator approach is preferred since it avoids having to propagate in real time, data source updates to the warehouse. Figure 9.1 illustrates these two approaches of data integration.

In the mediator approach, one starts by designing a global schema (also called mediated schema) that serves as a unique entry point on which global queries are posed by users. A main issue is then to specify the relationships, namely semantic mappings, between the schemas of the data sources and the global schema. Based on these mappings, one can answer queries over the global schema using queries over the data sources. Typically, query answering in the mediator approach is performed as follows. First, independently of the data in the sources, the user’s query posed over the global schema is transformed into local queries that refer to the schemas of the data sources. A global query combines the data provided by sources. Queries are optimized and transformed into query plans. The local query plans are executed and their results combined by the global query plan.

In the following, for presentation purposes, we consider that the global schema and the schemas of the data sources to integrate are all relational. In practice, each non-relational data source (e.g., XML or HTML) is abstracted as a relational database with the help of a wrapper. Wrappers are small programs that translate local relational queries into appropriate requests understood by specific data sources, and transform their results into relations. The role of wrappers is to allow the mediator to see each data source as relational, no matter which actual format it uses.

Let us consider in more detail the specification of semantic mappings between the data sources and the global schema. Let S₁, ..., S_n be the local schemas of n pre-existing data sources. To simplify the presentation, let us assume that each local schema S_i is made of a single relation that we denote also S_i. The relations S₁, ..., S_n are called the local relations. Suppose the global schema G consists of the global relations G₁, ..., G_m. The goal is to specify semantic relations between the local relations S_i and the global relations G_j. The G_j are logically (intentionally) defined by the S_i.

An example of simple relationship (not very interesting) is based on equality, e.g., G₁ = S₁. One can find more complicated relationships, e.g., G₂ = S₁ ∪S₂ or G₃ = S₁⋈S₃. In these last two examples, a global relation is defined as a query over the local relations. In other words, the global relation is a view of the local relations. Indeed, one typically prefers more flexible constraints such as G₃ ⊇S₁⋈S₃. Using containment instead of equality leaves open the possibility for other sources of providing data about G₃, e.g., G₃ ⊇σ_A=”yes”(S₄). Because global relations are constrained by views of the local relations, one uses the term global-as-view for such specifications.

In a somewhat dual manner, one can use local-as-view constraints such as: S₄ ⊆G₁⋈G₃. This leaves even more flexibility since the contribution of each data source can be specified (e.g., by its owner) independently of the other sources of the system. This kind of autonomy is typically well-adapted to a Web setting.

More generally, to express semantic mappings between {S₁,...,S_n} and {G₁,...,G_m}, one can use inclusion statements, i.e., logical constraints, of the form v(S₁,...,S_n) ⊆v′(G₁,...,G_m), where v and v′ are query expressions called views. All the constraints we consider in this chapter will be of this general form. Now, given an instance I of {S₁,...,S_n} (i.e., an instance of the data sources), we don’t know the instance J of the global schema. But we know that:

In our development, we will consider conjunctive queries. Using negation in queries greatly complicates the issues. In the next section, we recall some standard material on containment of conjunctive queries, i.e., of the queries at the heart of our formal development. In Sections 9.3 and 9.4, we study GAV and LAV mediators, respectively. For each of these languages, we describe appropriate query rewriting algorithms. In Section 9.5, we show the impact on query rewriting of adding DL-LITE constraints in the global schema. Finally, in Section 9.6, we lay the basis of a peer-to-peer approach for data integration. In contrast with the mediator approach which offers a unique entry point to data, peer-to-peer data management systems (PDMS for short) are decentralized data integration systems.

9.2 Containment of conjunctive queries

In this section, we recall some basic notions on comparing conjunctive queries that we will use in the following.

Given an instance I of the relations appearing in the body of the query, an answer is a tuple ⟨ν(x₁),...,ν(x_n)⟩ for some valuation ν of the variables in the query, such that for each i, A_i(ν(u_i)) holds in I. We denote q(I) the set of answers.

We sometimes denote this query q(x₁,...x_n) when its body is understood. Observe that the interpretation of such a conjunctive query in logical terms is:

The data integration techniques rely on conjunctive query containment. This problem has been extensively studied because it is at the core of query optimization. We use known techniques that we recall next.

A query q₁ is contained in q₂, denoted q₁ ⊆q₂, if for each I, q₁(I) ⊆q₂(I). It is known that the containment between a conjunctive query q₁ and a conjunctive query q₂ can be tested by finding a “homomorphism” from q₂ to q₁.

Definition 9.2.1 Let q₁(x₁,...,x_n) and q₂(y₁,...,y_n) be two conjunctive queries. A (conjunctive query) homomorphism from q₂ to q₁ is a mapping ψ from the variables of q₂ to the variables of q₁ such that:

For each i, ψ(y_i) = x_i; and
For each atom R(u_i) in the body of q₂, R(ψ(u_i)) is in the body of q₁.

Example 9.2.2 Consider the following queries:

q₁(x₁,x′₁) : -A₁(x₁,x₂,x₃),A₂(x′₁,x₂,x₃)
q₂(y₁,y′₁) : -A₁(y₁,y₂,y₃),A₂(y′₁,y₂,y′₃)

Consider a mapping ψ such that ψ(y_i) = x_i for each i, ψ(y′₁) = x′₁ and ψ(y′₃) = x₃. Then the required conditions hold, and it follows that q₁ ⊆q₂. Intuitively, q₂ joins A₁ and A₂ on the second attribute, whereas q₁ also joins on the third one. The additional condition induces the containment.

The following proposition states that the existence of a homomorphism is a necessary and sufficient condition for query containment.

Proposition 9.2.3 (Homomorphism theorem) Let q₁ and q₂ be two conjunctive queries. Then q₁ is contained in q₂ if and only if there exists a homomorphism from q₂ to q₁.

This provides a simple algorithm for testing conjunctive query containment. In the general case, deciding whether a conjunctive query is contained in another one is NP-complete in the size of the two queries. In fact, in many practical cases, there are polynomial-time algorithms for query containment.

QC(q1,q₂)

Input: Two conjunctive queries:

q₁(x) :- g₁(x₁),…,g_n(x_n)

q₂(y) :- h₁(y₁),…,h_m(y_m)

Output: Yes if q₁ ⊆q₂; no otherwise

(1)freeze q₁: construct a canonical instance D_can = {g_i(ν(x_i))∣1 ≤i≤n}

(2) for some valuation ν mapping each variable in q₁

(3) to a distinct constant

(4)if ν(x) ∈q₂(D_can) return yes

(5)else return no.

Algorithm 4:

The Query containment algorithm

Example 9.2.4 Consider the queries of Example 9.2.2. The canonical instance D_can is A₁(a,b,c),A₂(a′,b,c). It is easily verified that q₂(D_can) = (a,a′), which is ν(x,x′).

9.3 Global-as-view mediation

The main advantage of GAV is its conceptual and algorithmic simplicity. The global schema is simply defined using views over the data sources and specifies how to obtain tuples of the global relation G_i from tuples in the sources.

Definition 9.3.1 (GAV mapping) A GAV mapping is an expression of the form: R(x₁,...,x_n) ⊇ q(x₁,...,x_n), where q(x₁,...,x_n) :- A₁(u₁),...,A_k(u_k) is a conjunctive query of the same arity as R. The semantics of this mapping is:

∀x ,...,x (∃y ,...,y (A (u⃗),...,A (⃗u )⇒ R(⃗u))) 1 n 1 m 1 1 k k

where y₁,...,y_m are of variables occurring in the body of the rule and not its head.

Example 9.3.2 Consider the following four data sources:

The source relation S1is a catalog of teaching programs offered in different French universities with master programs.
S1.Catalogue(nomUniv, programme).
The source relation S2 provides the names of European students enrolled in courses at some university within the Erasmus exchange program:
S2.Erasmus(student, course, univ).
The source relation S3 provides the names of foreign students enrolled in programs of some French university:
S3.CampusFr(student, program, university).
The source relation S4provides the course contents of international master programs:
S4.Mundus(program,course)

Now, suppose we define a global schema with the following unary and binary relations:

MasterStudent(studentName),	University(uniName),
MasterProgram(title),	MasterCourse(code),
EnrolledIn(studentName,title),	RegisteredTo(studentName, uniName).

These relations are defined in terms of the local relations by the following GAV mappings:

MasterStudent(N)⊇S2.Erasmus(N,C,U), S4.Mundus(P,C)

MasterStudent(N)⊇S3.CampusFr(N,P,U), S4.Mundus(P,C)

University(U)⊇S1.Catalogue(U,P)

University(U)⊇S2.Erasmus(N,C,U)

University(U)⊇S3.CampusFr(N,P,U)

MasterProgram(T)⊇S4.Mundus(T,C)

MasterCourse(C)⊇S4.Mundus(T,C)

EnrolledIn(N,T)⊇S2.Erasmus(N,C,U), S4.Mundus(T,C)

EnrolledIn(N,T)⊇S3.CampusFr(N,T,U), S4.Mundus(T,C)

RegisteredTo(N,U)⊇S3.CampusFr(N,T,U)

Note that in a warehousing approach, one would simply evaluate all the queries that define the global view, and populate the warehouse using standard relational query evaluation. In a mediator approach, we try to only derive data that is relevant to a specific query posed on the global view by a user. We show how to rewrite a global query into queries over the local relations and combine their results. This is achieved by a technical trick that consists in unfolding the atoms of the global query.

Observe the first two mappings. They specify (using joins) how to obtain tuples of the unary relation MasterStudent. Now consider the following global query asking for universities with registered master students:

The rewriting of this query into source queries is obtained by unfolding, i.e., by replacing each atom which can be matched with the head of some view, by the body of the corresponding view. (For readers familiar with logic programming, this is some very simple form of resolution.)

In the example, there is a single mapping whose head can be matched with RegisteredTo(s,x), and two mappings that match MasterStudent(s). Thus, we obtain the following two unfoldings:

Observe that q2 can be simplified. Replacing the conjunction of its first two atoms by the single atom S3.CampusFr(s,v6,x) leads to an equivalent query. We thus obtain the following two GAV rewritings of the initial query:

The result is obtained by computing r1∪r2. Now, observe that each r_ℓ is a conjunctive query. It can be optimized using standard query optimization to obtain an optimized physical query plan. Of course, the choice of the particular physical query plan that is selected depends on the statistics that are available and the capabilities of the sources. For instance, a plan may consist in querying S3 and then for each value a of v6 (i.e., a particular university program), asking the query q(X) :- S4.Mundus(a,X) to S4.

Definition 9.3.3 (Query unfolding) Let q(x) :- G₁(z₁),…,G_n(z_n) be a query and for each i, G_i(x_i) ⊇q_i(x_i,y_i) be a GAV mapping. An unfolding of qis the query uobtained from qby replacing, for each i, each conjunct G_i(z_i) by q_i(ψ_i(x_i,y_i)) where ψ_i is a function that maps x_i to z_i, and the existential variables y_i to new fresh variables.

The renaming of the existential variables into fresh ones is necessary to avoid the introduction of unnecessary constraints in the unfolding. Indeed, consider an existential variable y occurring in two distinct atoms, say G_i and G_j. Then, the two atoms should be understood as ∃...y...G_i(z_i) and ∃...y...G_j(z_j). The scopes of y in both are disjoint and nothing requires that the two occurrences of y take the same value. Hence the renaming using fresh variables.

Example 9.3.4 Suppose we have the two mappings:

F (x,y) ⊇ S(x,z),S(y,z) G(x) ⊇ S(x,y)

and the query q(x) :- F(x,y),G(y). Then we get the following unfolding:

q(x) :- S(x,v1),S(y,v1),S(y,v2)

The variable v₁ corresponds to the renaming of the existential variable z in the view defining F, whereas v₂ comes from the renaming of the existential variable yin the view defining G.

We next establish that each unfolding of a query computes a part of the desired results, and that their union computes the whole set of answers. To do so, we use two propositions. The first one ignores unfolding and focuses on the “materialization” of the global relations.

Proposition 9.3.5 Let S₁,...,S_n be a set of source relations; G₁,...,G_m a global schema defined by a set of GAV mappings over S₁,...,S_n; and I be an instance over S₁,...,S_n. Let J be the instance over G₁,...,G_m defined by, for each j,

J(Gj )= ∪ {V(I) |Gj ⊇ V(S1,...,Sn) ∈ G}

Then for each query qover G₁,...,G_m, the answer of qis q(J).

Proof (sketch). Let u be an answer. Then, by definition, q(u) is true in each instance J′ over G₁,...,G_m such that I and J′ together satisfy the mappings. In particular, u belongs to q(J). Conversely, let u be in q(J). Let J′ be an instance such that I and J′ together satisfy the mappings. Since J′ satisfies the mappings, J ⊆J′. Since conjunctive queries are monotone, q(J) ⊆q(J′). Thus u∈J′. Since u belongs to all such J′, u is an answer. □

Proposition 9.3.6 Let S be a set of source relations and G a set of global relations defined by a set of GAV mappings over S. Consider the query q(z) :- G_i₁(z_i₁),…,G_{i_n}(z_{i_n}) over G and the set {r_ℓ} of unfoldings of qgiven . Then for each I over S₁,...,S_n, the answer of qis given by ∪r_ℓ(I).

Proof (sketch). Let J be as in Proposition 9.3.5. By the same proposition, it suffices to show that q(J) = ∪r_ℓ(I).

We can compute the answer using the unfoldings (also called the GAV rewritings). These unfoldings can be simplified by removing redundant conjuncts that may have been introduced by the technique. This simplification relies on checking conjunctive query containment. Given a conjunctive query with body A₁(u₁),...,A_m(u_m), we verify whether each query obtained by removing some A_i(u_i) is equivalent to the initial one. If yes, the atom is redundant and can be removed. We keep doing this until the query is “minimal”. This simplification test is costly but the resulting query may be much less expensive to evaluate that the initial one.

We must evaluate all the unfoldings to obtain the complete answer. If we are aware of some constraints on the local schemas or on the global one, this can be further simplified. For instance, the constraints may imply that the result of a particular unfolding is empty, in which case this particular unfolding needs not be evaluated. Also, the constraints may imply that the result of some unfolding, say r_ℓ, is always included in another one. Then r_ℓ needs not be evaluated. For instance, in the previous example, if it is known that students obtained from the source S2 are European students, while those obtained from the source S3 are non European students, we can be sure that the GAV rewriting r_ℓ obtained by unfolding will not provide any answer. This requires expressing and exploiting disjointness constraints over the local relations. Inclusion constraints on local relations would, on the other hand, permit to detect in advance that a given query plan provides answers that are redundant with those obtained by another query plan.

A main limitation of GAV is that adding or removing data sources to the integration system may require deeply revising all the views defining the global schema. In a Web context where sources may come and go, e.g., because of (non) availability of servers, this is really too constraining. The LAV approach does not suffer from this disadvantage.

9.4 Local-as-view mediation

The LAV approach takes a dual approach. The local relations are defined as views over global relations. The goal is to define the global schema in such a way that individual definitions do not change when data sources join or leave the integration system except for the definitions of the sources that are involved in the change.

Definition 9.4.1 (LAV mapping) A LAV mapping is a mapping of the form: S ⊆ q, for some conjunctive query q(x₁,...,x_n) :- A₁(u₁),...,A_k(u_k) over the global relations. Its semantics is:

∀x 1,...,xn[S (x 1,...,xn)⇒ (∃y1,...,ymA 1(⃗u1),...,Ak(⃗uk))]

where y₁,...,y_m are the existential variables.

Again, S(x₁,...,x_n) is called the head of the view, whereas A₁(u₁),...,A_k(u_k) is called the body of the view.

Example 9.4.2 We define the global schema as consisting of the following relations:

Student(studentName),	EuropeanStudent(studentName),
University(uniName),	NonEuropeanStudent(studentName),
FrenchUniversity(uniName),	EuropeanUniversity(uniName),
NonEuropeanUniversity(uniName),	Program(title),
MasterProgram(title),	EnrolledInProgram(studentName,title),
Course(code),	EnrolledInCourse(studentName,code),
PartOf(code, title),	RegisteredTo(studentName, uniName),
OfferedBy(title, uniName).

The four data sources considered in the previous example can be described by the following LAV mappings:

m1: S1.Catalogue(U,P)⊆	FrenchUniversity(U), Program(P), OfferedBy(P,U),
	OfferedBy(P’,U), MasterProgram(P’)
m2: S2.Erasmus(S,C,U)⊆	Student(S), EnrolledInCourse(S,C), PartOf(C,P),
	OfferedBy(P,U), EuropeanUniversity(U),
	EuropeanUniversity(U’)RegisteredTo(S,U’),
	U≠ U’
m3: S3.CampusFr(S,P,U)⊆	NonEuropeanStudent(S), Program(P),
	EnrolledInProgram(S,P), OfferedBy(P,U),
	FrenchUniversity(U), RegisteredTo(S,U)
m4: S4.Mundus(P,C)⊆	MasterProgram(P), OfferedBy(P,U),
	OfferedBy(P,U’), EuropeanUniversity(U),
	NonEuropeanUniversity(U’), PartOf(C,P)

LAV mappings enable quite fine-grained descriptions of the contents of data sources. For example, we are able to specify precisely the students that can be found in the Erasmus source: they are European students enrolled in courses of a given (European) university that is different from their home (European) University in which they remain registered.

LAV mappings express loose coupling between local and global relations, which is important for flexibility and robustness when the participating data sources change frequently. If we are interested in Master students, we do not need to know in advance (unlike the GAV approach) how to join two sources. We just define them as a global query:

The local sources that must be queried and combined to get the Master students will be discovered by the rewriting process. Recall that, in the GAV approach, they were predefined by the two mappings given in Example 9.3.2.

The price to pay for the flexibility of LAV compared to GAV is that the rewritings are more complicated to find. We describe three algorithms that achieve this rewriting. The Bucket algorithm and the Minicon algorithm follow the same approach. They first determine the local relations that are relevant to the query, then consider their combinations as candidate rewritings and verify whether they are indeed correct. Minicon is actually an optimization of Bucket that avoids the last verification step by a trickier first step. The third algorithm, namely the Inverse-rules algorithm, follows a completely different approach: it consists in transforming the logical rules supporting the LAV mappings (which are unsafe rules) into a set of safe rules with a single global relation. The global query is unfolded using these rules.

9.4.1 The Bucket algorithm

The principle of the Bucket algorithm is quite simple. It proceeds in three steps:

Bucket creation

Let g be a query atom. The atoms in bucket(g) are the heads of mappings having in their body an atom from which g can be inferred. Intuitively, data comes from source relations, and a (global) query atom is satisfied by (local) data only if it can be matched to a (global) atom in the body of a mapping whose head can be matched to source facts. A match between g and some atom in the body of a mapping is thus an indication that the corresponding data source provides a relevant information for this particular query.

There is an extra constraint that has to be considered to guarantee that g can indeed be logically inferred, as illustrated next. In fact, the bucket of a query atom g includes a view atom v only if an atom in the body of v can be matched with g by a variable mapping such that the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

Let us illustrate this on an example. Consider the LAV mappings of Example 9.4.2, and the global query:

Let us consider the query atom g= RegisteredTo(s,x), in which the variable x is distinguished.

We can find two mappings (m2 and m3) in which a body atom can be matched to
RegisteredTo(s,x).

The atom RegisteredTo(s,x) matches the atom RegisteredTo(S,U) with the variable mapping {S/s,U/x}, where U is distinguished in the view defining the mapping (it occurs in the head of this LAV mapping).

Therefore, applying the variable mapping {S/s,U/x} to the head S3.CampusFr(S,P,U) of the mapping m3 enforces the matching of RegisteredTo(S,U) with the query atom RegisteredTo(s,x), and then:

Thus S3.CampusFr(s,v1,x) is added in Bucket(g). Note that v1 is simply a fresh variable mapped to the variable P appearing in S3.CampusFr(S,P,U) but not in the variable mapping {S/s,U/x}.

The match this time is between g = RegisteredTo(s,x) and RegisteredTo(S,U’) by the variable mapping {S/s,U′/x}. The difference with the previous situation is that the variable U’ is existentially quantified in the view defining the mapping. Applying the variable mapping {S/s,U′/x} to the head S2.Erasmus(S,C,U) of the mapping m2do not enforce the matching of RegisteredTo(S,U’) in its body with the query atom RegisteredTo(s,x).

However, this is a strictly weaker statement than ∃sRegisteredTo(s,x) where x is fixed. We prove this next. Consider an instance I over the domain Δ = {s,v2,v3,v4,v5,x} defined by:

Bucket(g, q, M)

Input: An atom g = G(u₁,...,u_m) of the query q and a set of LAV mappings

Output: The set of view atoms from which g can be inferred

(1)Bucket(g) : ∅

(2)for each LAV mapping S(x) ⊆q(x,y)

(3) if there exists in q(x,y) an atom G(z₁,...,z_m) such that

(4) z_i is distinguished for each i such that u_i is distinguished in q;

(5) let ψ the variable mapping {z₁/u₁,....,z_m/u_m}

(6) extended by mapping the head variables in x not

(7) appearing in {z₁,...,z_m} to new fresh variables;

(8) add S(ψ(x)) to Bucket(g);

(9)return Bucket(g);

Algorithm 5:

The Bucket algorithm

Algorithm 5 constructs the buckets. Proposition 9.4.3 is a logical characterization of the view atoms put in the buckets of the atoms of the global query.

Proposition 9.4.3 Let G(u₁,...,u_m) be an atom of the global query. Let u be the (possibly empty) subset of existential variables in {u₁,...,u_m}. Let m: S(x) ⊆q(x,y) be a LAV mapping. Then

S(v), FOL(m)⊧∃uG(u₁,...,u_m)

iff there exists a view atom in Bucket(g) that is equal to S(v) (up to a renaming of the fresh variables).

In the worst-case, the Bucket algorithm applied to each atom of a query has a time complexity in O(N ×M×V) and produces N buckets containing each at most M×V view atoms, where N is the size of the query, M is the maximal size of the LAV mappings and V is the number of LAV mappings.

Returning to the example, we obtain by the Bucket algorithm, the following buckets for the three atoms of the query q.

Construction of candidate rewritings

The candidate rewritings of the initial global query are then obtained by combining the view atoms of each bucket. In the worst-case, the number of candidate rewritings is in O((M×V)^N). For instance, in our example, we obtain two candidate rewritings for the query q:

A candidate rewriting may not be a valid rewriting of the query. By Proposition 9.4.3, we only know that each candidate rewriting entails each atom of the query in isolation, i.e., without taking into account the possible bindings of the existential variables within the query.

It turns out that, in our example, the first candidate rewriting r1 is not a valid rewriting of the query q: the body of q is not logically entailed by the conjunction of the view atoms in the body of r1.

To see why, we first apply to each view atom in the body of r1 the corresponding LAV mapping to obtain the logical global expression (i.e., built on global relations). This step is called expanding r1, and its result, the expansion of r1. In our case, the expansion of r1 is the following query expression:

Note that new existential variables may be introduced by the expansion of some view atoms. For instance, the LAV mapping defining S1.Catalogue(v3,v4) contains the existential variable denoted P’ in the LAV mapping definition. Such variables are renamed with new fresh variables to avoid unnecessary constraints between the variables. In our example, this corresponds to variable v5 in the body of Exp_r1(x).

To check whether a rewriting is correct, it suffices to check with the Conjunctive Query Containment algorithm whether the query Exp_r1(x) is contained in the query q(x). For each variable v, let the corresponding constant, i.e., ψ(v), be "v". The canonical database obtained from r1 is given in Figure 9.2.

NonEuropean-	Program	EnrolledIn-	OfferedBy	French-	RegisteredTo	Master-
Student		Program		University		Program

"s"	"v1"	("s", "v1")	("v1", "x")	"x"	("s", "x")	"v5"
	"p"	("s", "p")	("p", "v2")	"v2"	("s", "v2")
	"v4"		("v4", "v3")	"v3"
			("v5", "v3")

Figure 9.2:

The canonical database resulting from freezing r1

The evaluation of q(x) over this canonical database yields an empty result because there is no way of assigning the existential variables s and p to constants of the canonical database which satisfies the binding of the existential variable p between the two last atoms of the body of the query.

Expanding the rewriting r2 and checking that it is contained into the query q is left in exercise. This shows that among the two candidate rewritings, only r2 is a valid rewriting:

Remark 9.4.4 In spite of the apparent redundancy of the two first atoms, this rewriting cannot be simplified to

r2.1(x) :- S3.CampusFr(s,p,x), S4.Mundus(p,v5)

It is true that r2.1(x) is contained into r2(x). However, the two queries are not equivalent. For some data sets, it may be the case that there is a student s and there is a university x such that (based on S3.CampusFr), s is registered in x and also enrolled in a Mundus master program offered by another university. The containment would hold under a constraint that would forbid a student to be registered in more than one universities.

One can prove that each rewriting finally obtained does indeed provide answers and that their union constitutes the complete answer.

9.4.2 The Minicon algorithm

The idea underlying Minicon is to avoid putting in a bucket an atom that will only generate invalid rewritings. As we saw in the discussion of Bucket, the reason for an atom to be useless is that its binding of a variable does not match with the binding of other occurrences of that variable. This explains why a candidate rewriting (like r1) is not valid.

First step of Minicon: creation of MCDs

Minicon scans each atom in the query, but instead of creating buckets for them, it builds MCDs (short name for Minicon Descriptions). The first iteration of Minicon determines the relevance of the different LAV mappings to rewrite the first query atom U(y,z):

The last iteration of building MCDs corresponds to the last query atom: R(y’,x). The LAV mapping V1 has in its expansion the atom R(v,u) that can be matched to it by the variable mapping {v/y′,u/x)}. Since the distinguished variable x in the query is assigned to the distinguished variable (same condition as for adding to a bucket), and since the existential variable y’ of the query atom has a single occurrence in the query, the following MCD is created:

In contrast, there is no MCD created for R(y’,x) with the second LAV mapping: in the variable mapping {v′/y′,w/x)} that allows to match the query atom R(y’,x) with the atom R(v’,w) in the expansion of V2, the distinguished variable x in the query is assigned to the variable w which is not distinguished in the expansion of of V2. As for adding a view atom in a bucket, a MCD is created for a query atom g only if the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

Second step of Minicon: combination of the MCDs

The second step of Minicon replaces the combination of the buckets by the combination of the MCDs. More precisely, the rewritings are obtained by combining MCDs that cover mutually disjoint subsets of query atoms, while together covering all the atoms of the query.

Because of the way in which the MCDs were constructed, the rewritings obtained that way are guaranteed to be valid. No containment checking is needed, unlike in the Bucket Algorithm. In our example, we would therefore obtain as single rewriting of q(x):

9.4.3 The Inverse-rules algorithm

This algorithm takes a radically different approach. It transforms the LAV mappings into GAV mappings (called inverse rules) so that the complex operation of query rewriting using LAV mappings can then be replaced by the much simpler operation of query unfolding. A LAV mapping is replaced by several GAV mappings, one for each atom in the body of the rule. The subtlety is to keep bindings between the different occurrences of the same existential variable in the body. This is realized using a simple trick from first-order logic, namely by introducing Skolem functions.

Let us explain the Inverse-rules algorithm on the example we used for Minicon. A first important point that distinguishes it from the Bucket and Minicon algorithms is that the Inverse-rules algorithm is independent of the query. It only considers as input the set of LAV mappings:

Consider the first LAV mapping and recall that its logical meaning mapping is the formula:

But we can infer much more. We can also infer that there exists some constant d1 such that T(d1,a) and U(b,d1) are both true. We do not know the exact value of that constant d1, but we know it exists and that, in some way, it depends on the constants a,b. Since this dependency comes from the first rule, we denote this unknown d1 value: f1(a,b).

Creating the inverse rules This motivates the construction of three following GAV mappings for which we give also the FOL translation.

In the previous rules, the symbol f1 is a Skolem function of arity 2, and f1(u,v) is a Skolem term denoting some constant that depends on the values instantiating the variables u,v. Given two distinct Skolem terms, e.g. f1(1,2) and f1(2,v3), we cannot tell whether they refer to the same constant or not.

The Inverse-rules algorithm just scans the LAV mappings and creates n GAV mappings for each LAV mapping having n atoms. The result of this algorithm applied to the second LAV mappings in the example is:

Obtaining the rewritings by unfolding: The rewritings of any global query is now obtained by unfolding the query atoms using the (Inverse-rules) GAV mappings corresponding to the initial set of LAV mappings. The unfolding operation here is a bit trickier than the unfolding defined in Definition 9.3.3, because of the Skolem terms. In Definition 9.3.3, the unfolding was based on matching each query atom G(x1, .., xm) with an atom (in the right-hand side of a GAV mapping) of the form G(z1, .., zm) by equating each pair (zi, xi) of variables. Proposition 9.3.6 showed that unfolding each atom of the query in isolation builds valid rewritings of the query, i.e., conjunctions of view atoms which logically implies the conjunction of the query atoms. It is not the case anymore when atoms in the right-hand side of GAV mappings contain Skolem terms.

The unification of two atoms with functions is more complex than just equating variables, and it may fail. It may require the substitution of some variables with functional terms (in our case, Skolem terms). This may make impossible to unify the other atoms of the query with atoms in the right-hand side of GAV mappings.

Let us illustrate on our example the subtleties of unfolding queries in presence of functional terms. Consider again the same query q:

The query atom U(y,z) can be unified with the atom U(v,f1(u,v)) in the right-hand side of the GAV mappings IN12 using a so-called most general unifier (mgu). In this case, the mgu is the substitution:

This substitution that makes the unfolding of the first query atom possible, now constrains the other occurrences in the query of the variables y and z for the unfolding of the other query atoms. After the application of σ to the whole body of the query and the unfolding of the first query atom made possible by σ, we obtain the following (partial) query rewriting:

The unfolding of the second atom R(x,f1(v2,v1)) yields V1(f1(v2,v1),x), and we obtain the (partial) rewriting:

It is useless to continue unfolding the remaining query atoms of pr2(x). As soon as a given unfolding has produced a view atom with Skolem terms, we can be sure that the evaluation of the query plan under construction will not produce any answer: there is no way to match V1(f1(v2,v1),x) with any fact in the data source which are of the form V1(a,b) where a,b are constants. Since we don’t know f1(v2,v1), there is absolutely no reason to believe that it is equal to a.

Using the inverse rule IN23 to unfold R(x,f1(v2,v1)) does not help because unifying R(x,f1(v2,v1)) and R(v’,f2(u,v,v’)) fails because of the two different Skolem functions. Thus, the (partial) rewriting issued from unfolding U(y,z) using the inverse rule IN12 is abandoned.

Now, unfolding R(x,f2(v2,v1,v3)) using the inverse rule IN23 is possible thanks to the substitution

Now, we examine the unfolding of the query atom T(f2(v2,v1,x),v1), which requires checking whether T(f2(v2,v1,x),v1) and T(f2(u,v,v’),u) are unifiable. This is the case thanks to the substitution {v2/v3,u/v3,v1/v3,v/v3,v′/x}, which leads to the (partial) rewriting:

Again, we can remove the first atom that is redundant and obtain the equivalent (partial) rewriting:

9.4.4 Discussion

The three algorithms have the same (worst-case) complexity and they guarantee to provide the correct answer. Some experiments have shown that in practice Minicon outperforms both Bucket and Inverse-rules. The main advantage of the Inverse-rules algorithm over the Bucket and Minicon algorithms is that the step producing the inverse rules is done independently of the queries. Furthermore, the unfolding step can also be applied to Datalog queries, i.e., to recursive queries.

The common limitation of the three algorithms is that they do not handle additional knowledge (ontology statements) that can be known about the domain of application. In the next section, we see how to extend both the Local-As-Views and Global-As-Views approaches with DL-LITE ontologies, i.e., we consider global schemas that include constraints expressed as DL-LITE axioms.

9.5 Ontology-based mediators

We first show a negative result: as soon as we add functionality constraints over the global schema, the number of conjunctive rewritings of a query to be considered, may become infinite. This is a severe limitation for extending the LAV or GAV approaches since such constraints are rather natural. So these approaches to data integration fail when we consider the DL-LITE $F$ dialect of previous chapters. On the positive side, we show how to extend the GAV and LAV approaches to constraints expressible in DL-LITE $R$ .

9.5.1 Adding functionality constraints

We illustrate on an example the problem raised by taking into account functionality constraints in the global schema. Let us consider a global schema with one unary relation C and two binary relations R and R’. In both R and R1, we impose that the first attribute is a key. Let us consider two LAV mappings:

To prove it, we expand r₁(x) and show that the resulting expansion together with the logical axiom expressing the functionality of R1 logically implies the conjunction of atoms in the body of the query. The expansion of r₁(x) is:

It is important to note that to properly check containment, the standard query containment algorithm seen in the previous section would have to be modified in a standard manner to take into account functional dependencies. Intuitively, one would have to proceed pretty much as we did in the example, equating variables as implied by the functional dependencies.

It turns out that the situation is even more subtle. Surprisingly, this rewriting r₁(x) is not the only one. In fact there exists an infinite number of different rewritings for q(x). Let k ≥ 2. The following query is a valid rewriting of q(x):

One can also show that for each k, each such rewriting may return answers that are not returned with k- 1. Thus, there exists an infinite number of non redundant conjunctive rewritings. The reader familiar with Datalog will observe that this infinite collection of rewritings can be captured in Datalog by the following recursive rewriting:

9.5.2 Query rewriting using views in DL-LITE $R$

Querying data through DL-LITE $R$ ontologies has been detailed in the previous chapter. It has been shown how the positive and negative constraints expressed in the ontology are exploited both for data consistency checking and for query answering. In particular, the first step of query answering is the query reformulation step which is performed by the PerfectRef algorithm: using the positive constraints, called the PIs, it computes a set of reformulations, which are then evaluated over the data to produce the answer set of the original query. The negative constraints, called the NIs, are used to check data consistency, by translating each (declared or entailed) NI into a Boolean conjunctive query q_unsat that must be evaluated over the data.

In this section, we show how to extend both the LAV and GAV approaches to rewrite queries in term of views when the global schema includes some DL-LITE $R$ Tbox.

These two observations follow from the completeness of the PerfectRef and Consistent algorithms for DL-LITE $R$ presented in the previous chapter, and that of the rewriting algorithms of Sections 9.3 and 9.4; namely Unfolding for GAV, Minicon, Bucket or Inverse-rules for LAV.

The argument may be somewhat too abstract for some readers. We next illustrate these two points with examples. We use the global schema considered in Example 9.4.2 page 439, enriched with the DL-LITE $R$ Tbox of Figure 9.3. Note in particular that we add the subclass College of the class University, the subproperty EnrolledInCollege of the property RegisteredTo, for which the domain is declared as being the class MasterStudent. In addition, we add the property EnrolledInMasterProgram that we declare as a subproperty of the property EnrolledInProgram. Finally, we declare a mandatory property for the class College: any college must have students enrolled in it.

DL notation	FOL notation

PIs:

MasterStudent ⊑Student	MasterStudent(X) ⇒Student(X)
EuropeanStudent ⊑Student	EuropeanStudent(X) ⇒Student(X)
NonEuropeanStudent ⊑Student	NonEuropeanStudent(X) ⇒Student(X)
College ⊑University	College(X) ⇒University(X)
FrenchUniversity ⊑University	FrenchUniversity(X) ⇒University(X)
EuropeanUniversity ⊑University	EuropeanUniversity(X) ⇒University(X)
NonEuropeanUniversity ⊑University	NonEuropeanUniversity(X) ⇒University(X)
∃EnrolledInCollege ⊑MasterStudent	EnrolledInCollege(X,Y) ⇒MasterStudent(X)
College ⊑∃EnrolledInCollege^-	College(X) ⇒∃YEnrolledInCollege(Y,X)
EnrolledInCollege ⊑RegisteredTo	EnrolledInCollege(X,Y) ⇒RegisteredTo(X,Y)
MasterStudent ⊑∃EnrolledInMasterProgram	MasterStudent(X) ⇒∃YEnrolledInMasterProgram(X,Y)
∃EnrolledInMasterProgram^-⊑MasterProgram	EnrolledInMasterProgram(X,Y) ⇒MasterProgram(Y)
EnrolledInMasterProgram ⊑EnrolledInProgram	EnrolledInMasterProgram(X,Y) ⇒EnrolledInProgram(X,Y)

NIs:

NonEuropeanStudent ⊑¬EuropeanStudent	NonEuropeanStudent(X) ⇒¬EuropeanStudent(X)
NonEuropeanUniversity ⊑¬EuropeanUniversity	NonEuropeanUniversity(X) ⇒¬EuropeanUniversity(X)
NonEuropeanUniversity ⊑¬FrenchUniversity	NonEuropeanUniversity(X) ⇒¬FrenchUniversity(X)

Figure 9.3:

A DL-LITE $R$ Tbox enriching the global schema of Example 9.4.2

GAV and DL-LITE $R$

We revisit Example 9.3.2 by adding the data source S5 giving the list of French so-called Grandes Ecoles. Its local schema is made of the local relation: S5.GrandeEcole(nomEcole). According to this new source and also to the enriched global schema of Figure 9.3, we add the following GAV mappings to the ones already considered in Example 9.3.2:

Consider again the global query looking for universities with registered master students:

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the reformulation:

the rewriting r3(x) would not have been obtained without reformulated first the initial query q(x) into q1(x).

Now, in contrast with the standard GAV approach, we have to check the consistency of each of these rewritings. To do so:

We illustrate the consistency check by checking the consistency of the rewriting r1(x). First, its expansion replaces each of its local atoms S(z) with the conjunction of global atoms of the form G(z) that can be produced by a GAV mapping G(x) ⊇S(x), if such GAV mappings exist. For expanding r1(x), we apply the following GAV mappings:

We then apply the Consistent algorithm. For this, we evaluate q_unsat¹, q_unsat² and q_unsat³ over the body of Exp_r1(x) seen as a relational database; i.e., we freeze its atoms to obtain a canonical instance. Query q_unsat¹ returns true, so an inconsistency has been detected and the rewriting r1(x) is rejected.

LAV and DL-LITE $R$

We revisit Example 9.4.2 by adding the same data source S5 as in Section 9.5.2. The GAV mapping is also a LAV mapping: S5.GrandeEcole(U)⊆College(U)

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the following reformulation:

By applying the Minicon algorithm¹ to the initial query q(x), we obtain the following rewriting (as shown in Section 9.4.1):

By applying the Minicon algorithm to the reformulation q1(x) of the initial query, we obtain the additional rewriting:

As for the extended GAV approach, the consistency of LAV rewritings is not guaranteed because of the NIs in the Tbox. We follow the same approach: at compile time, the closure of the NIs is computed and each (declared or inferred ) NI is compiled into a Boolean query q_unsat. At query time, each of these q_unsat queries is evaluated over the canonical instance corresponding to each rewriting.

9.6 Peer-to-Peer Data Management Systems

In contrast with the centralized mediator model, a Peer-to-Peer data management system (PDMS for short) implements a decentralized view of data integration, in which data sources collaborate without any central authority. In a PDMS, each collaborating data source can also play the role of a mediator, so is at the same time a data server and a client for other data sources. Thus each participant to the system is a peer and there are mappings relating data from the different peers. A PDMS architecture is therefore very flexible in the sense that there is no need for a global schema defining in advance a set of terms to which each data source needs to adhere. Over time, data sources can join or leave the PDMS just by adding or removing mappings between them. PDMS are inspired by P2P file sharing systems but they enable answering fine-grained queries. Like in the mediator model, answering queries is performed by reformulating queries based on the mappings, but in a decentralized manner.

Each peer in a PDMS has a peer schema composed of peer relations and peer mappings that relate its schema to the schemas of other peers. To avoid confusing relations from different peers, we assume that each relation of peer p is of the form r@p for some local relation name r. A query to a PDMS is posed using the peer schema of one of the peers. A query is asked to a particular peer, as a query over his particular schema. It is reformulated using the peer mappings into a set of queries that may refer to other peer relations. This captures the intuition that we want to use the information available in the entire P2P system to answer the query.

For designing the mappings, the distinction made in the mediator model between local and global relations does not make sense anymore, since each peer relation may play the role at different times both of a local relation and of a global relation. Therefore, the notions of GAV and LAV mappings are relaxed to the more appropriate symmetric notion of GLAV mappings.

Definition 9.6.1 (GLAV mapping) Let S@i and S@j be the peer schemas of two peers i and j. A GLAV mapping between these two peers is an inclusion axiom of the form: q_i(x) ⊆q_j(x), where q_i(x) and q_j(x) are conjunctive queries over the peer schema S@i, S@j, respectively.

Let q_i(x,y_i) and q_j(x,y_j) be the bodies of q_i(x) and q_j(x)), respectively. The semantics of the GLAV mapping q_i(x) ⊆q_j(x) is: ∀x[∃y_iq_i(x,y_i) ⇒∃y_jq_j(x,y_j)].

In database terms, a GLAV mapping q_i(x) ⊆q_j(x) expresses that answers obtained by asking q_i(x) at peer i should also be considered as answers to q_j(x) asked at peer j. Note that with this semantics, each local query is assumed to be incompletely answered with local data since external data may bring in new information to it. As already mentioned, such an open-world assumption is fully appropriate for Web data.

We next show one negative and one positive result for PDMSs. In Section 9.6.1, we show that in general, answering queries with GLAV mappings is undecidable, so without further restriction, answering queries in a PDMS is undecidable. In Section 9.6.2, we show that if we restrict the peer mappings to be DL-LITE $R$ inclusion axioms, a decentralized version of the algorithm for DL-LITE $R$ can be used to answer queries in DL-LITE $R$ PDMSs.

9.6.1 Answering queries using GLAV mappings is undecidable

We show that the Dependency Implication Problem (more precisely, the problem of the implication of an inclusion dependency from a set of inclusion and functional dependencies) can be reduced to the GLAV Query Answering Problem, i.e., the problem of answering queries in presence of GLAV mappings. Since the Dependency Implication Problem is known to be undecidable, this shows that the GLAV Query Answering Problem is also undecidable.

The reduction technique is standard for proving undecidability results. We first recall how it works and also recall the Dependency Implication Problem. We believe that these notions are important to know beyond the scope of this book. Finally, we use a reduction to show the undecidability of answering queries using GLAV mappings.

Reduction from a decision problem B to a decision problem B’

Let B be a Boolean function over a set X. The decision problem B is decidable if there exists an algorithm (in any computation model equivalent to Turing machines) that terminates on any input x∈B and returns “true” if and only if B(x) is true.

Let B,B′ be two decision problems. A reduction from B to B’ is an algorithm f computing a function (also denoted f ) from X to X’ such that: B(x) is true ⇔B’(f(x)) is true.

The Dependency Implication Problem

We recall the class of dependencies that are used. Let R be a relation of arity n. Then:

Theorem 9.6.2 (Undecidability of the Dependency Implication Problem). Let R = {R₁,...,R_n} be a relational schema. Given a set Σ of functional and inclusion dependencies and an inclusion dependency σ over relations in R, one cannot decide whether Σ⊧σ (i.e., whether each instance over R satisfying Σ also satisfies σ).

The problem is undecidable when Σ contains both functional and inclusion dependencies. Note that the implication problem is decidable for functional dependencies alone, and for inclusion dependencies alone. Undecidability arises when they are considered together.

Undecidability of the GLAV Query Answering Problem

The GLAV Query Answering Problem is to decide, given a PDMS

defined using a set of GLAV mappings and a query asked at one of the peers whether some particular tuple is in its answer.

Let us define a reduction from the Dependency Implication Problem to the GLAV Query Answering Problem. If we show that such a reduction exists, since the Dependency Implication Problem is undecidable, this will show that the GLAV Query Answering Problem is undecidable.

Surprisingly, we can show the reduction for a PDMS with a single peer. To do that, we will use some GLAV mapping of the form q@P ⊇q′@P, where both sides of the mapping involve the same peer. Note that the undecidability still holds if such “self” mappings are forbidden. Indeed, we can simulate such a mapping by using “clones” of relations. For instance, suppose that we want to enforce the mapping R@P(x₁,...,x_n) ⊇R′@P(y₁,...,y_n). Then we can use a dummy site P and a copy R@P of R@P with the mappings:

Let (Σ,σ) be an instance over {R₁,...,R_n} of the Dependency Implication Problem with Σ a finite set of functional and inclusion dependencies, and σ an inclusion dependency. We build a PDMS

defined as follows:

It is easy to see that the GLAV mappings force each R_i to satisfy the functional dependencies of R_i, and each R_i,R_j to satisfy the inclusion dependencies between R_i and R_j.

Let us assume that σ = R_i : i₁ ⊆R_j : j₁ for some R_i of arity n. (This is without loss of generality since the implication is already undecidable when σ is unary).

For instance if n = 3 and i₁ = 2, Ext(R_i) = {⟨1,1,1⟩,⟨1,1,2⟩,⟨2,1,1⟩,⟨2,1,2⟩,⟨2,1,3⟩}.

We construct an instance (

,Ext(R_i),q) of the GLAV Query Answering Problem where q is the query defined by q(x) :- R_j(y₁,...,x,...y_k) where the distinguished variable x is in position j₁, and the existential variables y_i are pairwise distinct.

We show that Σ⊧σ iff 1 is an answer to q in the PDMS

in which the only data is Ext(R_i).

(⇒) Suppose that Σ⊧σ. Let I be a model of Ext(R_i) satisfying the GLAV mappings of

. By construction of those GLAV mappings, I is a model of Σ. Because Σ⊧σ, I is a model of σ, and thus for each tuple ⟨a₁,...,a_n⟩ in I(R_i), there exists a tuple ⟨b₁,...,b_k⟩ in I(R_j) such that a_i₁ = 1 = b_j₁. Therefore, I⊧∃y₁,...,y_kR_j(y₁,...,1,...y_k), i.e., I⊧q(1). Thus 1 is an answer to q given the GLAV mapping of

and the extension Ext(R_i).

(⇐) Conversely, suppose that 1 is an answer to q given the GLAV mapping of

and the extension Ext(R_i). Note that 1 is also an answer to q if the extension of R_i is reduced to any tuple of the original Ext(R_i). Suppose that Σ⊭σ: there exists an interpretation I that satisfies Σ in which σ is not satisfied. This means that there exists a tuple ⟨e₁,...,e,...e_n⟩ (where e is in position i₁) in I(R_i) such that there does not exists a tuple in I(R_j) with the value e in position j₁. Let t be the tuple of Ext(R_i) which corresponds to the equality pattern between values of ⟨e₁,...,e,...e_n⟩. By extending I to interpret each value of t by the element e_i at the same position in ⟨e₁,...,e,...e_n⟩, we obtain a new interpretation I′ that satisfies Σ and thus each GLAV mapping of

, and R_i(t). Since 1 is an answer to q given the GLAV mapping of

and R_i(t), I′⊧q(1), i.e., I′(1) ∈I′(R_j[j₁]). Since I′(1) = e and I′(R_j) = I(R_j), it means that there exists a tuple in I(R_j) with the value e in position j₁, which contradicts our assumption that σ is not satisfied in I. Hence Σ⊧σ. □

9.6.2 Decentralized DL-LITE $R$

If we restrict the GLAV mappings in a PDMS to be inclusion statements that are expressible in DL-LITE $R$ , we get what we will call a DL-LITE $R$ PDMS. The decidability of query answering over a DL-LITE $R$ PDMS results from the algorithmic machinery described in the previous chapter for answering queries over DL-LITE $R$ knowledge bases. Given a query posed to a given peer, the application of the PerfectRef algorithm to the set of all the GLAV mappings in the PDMS provides a set of reformulations. The union of the answer sets obtained by evaluating each reformulation provides the answer set of the initial query. Note that a reformulation is of the form:

where the different conjuncts R_j@i_j(z_j) may refer to relations of different peer schemas. Therefore, the evaluation of each reformulation may require the interrogation of different peers and the combination of the answers returned by each such sub-queries.

This provides a centralized algorithm for computing the reformulations of answering queries over a decentralized DL-LITE $R$ knowledge base. We next present a decentralized algorithm that computes exactly the same thing, i.e., we present a decentralized version of the PerfectRef algorithm seen in the previous chapter in order to deploy effectively DL-LITE $R$ PDMSs that avoids having to centralize all the GLAV mappings.

We denote PerfectRef ⁱ(q) the reformulation algorithm running on the peer P_i applied to a query q (asked to the peer P_i). The main procedure is the decentralized reformulation of each atom of the query using the positive inclusion statements that are distributed over the whole PDMS. Let us denote AtomRef ⁱ(g) the reformulation algorithm running on the peer P_i to reformulate the atom g (built on a relation of the schema of the peer P_i).

Within each peer P_i we distinguish the local positive inclusion axioms of the form C_i ⊆D_i where C_i and D_i are built over relations in the schema of the peer P_i, from the mappings which are positive inclusion mappings of the form C_j ⊆D_i or D_i ⊆C_j where C_j denotes a relation of another peer P_j (while D_i refers to a relation in the schema of the peer P_i).

Let us denote LocalRef(g,PI_i) the result of the reformulation of the atom g using the set PI_i of local positive inclusion atoms of the peer P_i. We refer to the previous chapter (Definition 8.4.7, Section 8.4) for the computation of LocalRef(g,PI_i) by backward application of the local PIs.

We just recall here that gr(g,I) denotes the reformulation of the atom g using the positive inclusion axiom I. We also recall that the atoms g that can be found as conjunct of a query q over a DL-LITE $R$ PDMS are of the following forms:

Running the algorithm AtomRef ⁱ on the peer P_i for reformulating the atom g consists first in computing the set LocalRef(g,PI_i) of local reformulations of g, and then, for each mapping m with a peer P_j applicable to a local reformulation g′, in triggering the application of AtomRef ^j(gr(g′,m)) on P_j (by sending a message to P_j). Other peers P_k may be solicited in turn to run locally AtomRef ^k.

A loop may occur if a request of reformulation of an atom g initiated by a given peer P generates a branch of requests reaching a peer P′ which in turn requests P to reformulate g. Such loops can be easily handled by transmitting with every request the history of the current reasoning branch. More precisely, an history hist is a sequence [(g_k,P_k),…,(g₁,P₁)] of pairs (g_i,P_i) where g_i is an atom of a peer P_i such that for each i∈ [1..k- 1], g_i+1 is a reformulation of g_i using a mapping between P_i and P_i+1.

This is summarized in Algorithm 6, which is the atom reformulation algorithm with history running on Peer i.

AtomRefHistⁱ(g,hist)

Input: An atom g in the vocabulary of the peer P_i, an history hist

Output: The set of its reformulations in the PDMS: R

(1)R←∅

(2)if (g,P_i) ∈histreturn R

(3)else

(4) Let PI_i be the local PIs of the peer P_i

(5) Let M_i be the mappings between the peer P_i and other peers

(6) for each g′∈LocalRef(g,PI_i)

(7) for each mapping m∈M_i between P_i and a peer P_j applicable to g′

(8) R←R∪AtomRefHist^j(gr(g′,m),[(g,P_i)|hist])

Algorithm 6:

The decentralized algorithm with history for reformulating atoms

Algorithm 7 is the atom reformulation algorithm (denoted AtomRef ⁱ) running on peer P_i, which just calls AtomRefHistⁱ with an empty history.

The decentralized version of the PerfectRef algorithm that computes all the reformulations of a conjunctive query q is provided in Algorithm 8. The main difference with the centralized version is that the simplification of the produced reformulations (which is required for making some PIs applicable) are delayed after (decentralized) computation of the reformulations of all the atoms in the query.

We recall here the notation used for denoting the simplification of some atoms within a query under reformulation, which were introduced in the previous chapter when describing the PerfectRef algorithm:

For each atom in the query, it computes first (in the decentralized manner explained previously) the set of all of its reformulations, and then a first set of reformulations of the original query by building all the conjunctions between the atomic reformulations (denoted ⊕_i=1ⁿAtomRefⁱ(g_i) at Line 5). These reformulations are then possibly simplified by unifying some of their atoms (Lines 8 to 11), and the reformulation process is iterated on these newly produced reformulations until no simplification is possible (general loop starting on Line 4).

PerfectRefⁱ(q)

Input: a conjunctive query q over the schema of the peer P_i

Output: a set of reformulations of the query using the union of PIs and mappings in the PDMS

(1)PR := {q}

(2)PR′ := PR

(3)while PR′≠∅

(4) (a) foreach q′ = g₁ ∧g₂ ∧…∧g_n ∈PR′

(5)PR′′ := ⊕_i=1ⁿAtomRefⁱ(g_i)

(6)PR′ := ∅

(7)(b) foreach q′′∈PR′′

(8) foreach g′₁,g′₂ ∈q′′

(9)if g′₁ and g′₂ unify

(10)PR′ := PR′∪{τ(reduce(q′′,g′₁,g′₂))}

(11)PR := PR∪PR′∪PR′′

(12)return PR

Algorithm 8:

The decentralized PerfectRef algorithm running on the peer P_i

One can prove that the decentralized algorithm computes the same set of facts as the centralized one, and thus is correct. The proof results (1) from the observation that the centralized version of PerfectRefⁱ (in which AtomRefⁱ(g_i) is computed by iterating the one-step application of PIs on each atom g_i of the query) produces the same results than the original PerfectRef, and (2) from the completeness of AtomRefⁱ(g_i) ensuring the decentralized computation of all the reformulations of g_i.

9.7 Further reading

The Bucket and Minicon algorithms can be extended ( [113, 137]) to handle (union of) conjunctives queries with interpreted predicates. When a query q includes interpreted predicates, finding all answers to q given the LAV mappings is co-NP hard in the size of the data. This complexity result shows that answering such queries cannot be fully realized with a finite set of conjunctive rewriting (unlike what we showed here in absence of interpreted predicates). The Inverse-rule algorithm does not handle interpreted predicates but is able to build recursive query plans for data integration [59].

A survey on answering queries using views can be found in [87], and a survey on query containment for data integration systems in [124].

Distributed reasoning in a peer to peer setting has been investigated in [14] as a basis for querying distributed data through distributed ontologies [73, 1]. The subtle point that we have not treated in this chapter concerns consistency checking. In contrast with the centralized case, the global consistency of the PDMS cannot be checked at query time since the queried peer does not know all the peers in the PDMS. However, it can get the identifiers of the peers involved in a reformulation of the query. Then the (local) consistency of the union of the corresponding knowledge bases can be checked in a decentralized manner. The important point is that it can be shown that this local consistency is sufficient to guarantee that the answers obtained by evaluating the reformulations (computed by the decentralized algorithm that we have described) are well-founded.

The undecidability of the Dependency Implication Problem is shown in [42] even if σ is a unary inclusion dependency. More on this topic may be found in [9])

9.8 Exercices

Exercise 9.8.1 By applying the query containment algorithm (see Algorithm 4), determine which query is contained in which one among the three following queries. Are there equivalent queries ? (two queries qand q′are equivalent if qis contained in q′and q′is contained in q).

q1(x) :- A(x,y), B(x,y’), A(y,z’)

q2(x) :- A(x,y’), A(y’,z), B(x,x)

q3(x) :- B(x,y), A(x,y’), B(z,z’), A(y’,u)

Exercise 9.8.2 Consider a global schema defined by the following relations:

emp(E): E is an employee

phone(E,P): E has P as phone number

office(E,O): E has O as office

manager(E,M): M is the manager of E

dept(E,D): D is the department of E

Suppose that the three following data sources are available for providing data:

Source1 provides the phone number and the manager for some employees. It is modeled by the local relation s1(E,P,M).

Source2 provides the office and the department for some employees. It is modeled by the local relation s2(E,O,D).

Source3 provides the phone number of employees of the ’toy’ department. It is modeled by the local relation s3(E,P).

Model the content of these sources by GAV mappings.
Model the content of these sources by LAV mappings.
Consider the global query asking for Sally’s phone number and office:
q(x,y) :- phone(’sally’, x), office(’sally’, y)

Compute the reformulation of the query in terms of local relations:
- by applying the query unfolding algorithmto the GAV mappings of Question 1,
- by applying the Bucket algorithm to LAV mappings of Question 2.
Which algorithm is easier ?
Now Source1 disappears (becomes unavailable) and a new source comes in, that provides the phone number of their manager for some employees. Do the updates in the GAV and LAV mappings that are required to take into account these changes. What is the approach (GAV or LAV) for which updating the mappings between the global and local relations is easier ?

Exercise 9.8.3 Consider the three following LAV mappings:

V1(x) ⊆cite(x,y), cite(y,x)

V2(x,y) ⊆sameTopic(x,y)

V3(x,y) ⊆cite(x,z), cite(z,x), sameTopic(x,z)

Provide the FOL semantics of these LAV mappings
Suppose that the global relation cite(x,y) means that the paper x cites the paper y, and that the global relation sameTopic(x,y) means that the two papers x and y have the same topic. Suppose that each LAV mapping models the content of different available data sources. Express with an english sentence which information on papers each data source provides.
Apply in turn the Bucket, Minicon and Inverse-rule algorithms to compute the different rewritings of the following query asking for papers that cite and are cited by a paper having the same topic:
q(u) :- cite(u,v), cite(v,u), sameTopic(u,v)

¹The same holds for the Bucket or Inverse-rules algorithm.

FOL(m2):	∀S∀C∀U [ S2.Erasmus(S,C,U)⇒∃P∃U’ (
	EuropeanStudent(S)∧EnrolledInCourse(S,C)∧
	PartOf(C,P)∧OfferedBy(P,U)
	∧EuropeanUniversity(U)∧RegisteredTo(S,U’)∧U≠ U’) ]

MasterStudent(E) :-	Student(E), EnrolledInProgram(E,M),
	MasterProgram(M).

RegisteredTo(s,x)	EnrolledInProgram(s,p)	MasterProgram(p)

S3.CampusFr(s,v1,x)	S3.CampusFr(s,p,v2)	S1.Catalogue(v3,v4)
		S4.Mundus(p,v5)