9  Data Integration

 9.1  Introduction
 9.2  Containment of conjunctive queries
 9.3  Global-as-view mediation
 9.4  Local-as-view mediation
  9.4.1  The Bucket algorithm
  9.4.2  The Minicon algorithm
  9.4.3  The Inverse-rules algorithm
  9.4.4  Discussion
 9.5  Ontology-based mediators
  9.5.1  Adding functionality constraints
  9.5.2  Query rewriting using views in DL-LITER
 9.6  Peer-to-Peer Data Management Systems
  9.6.1  Answering queries using GLAV mappings is undecidable
  9.6.2  Decentralized DL-LITER
 9.7  Further reading
 9.8  Exercices

9.1  Introduction

The goal of data integration is to provide a uniform access to a set of autonomous and possibly heterogeneous data sources in a particular application domain. This is typically what we need when, for instance, querying the deep web that is composed of a plethora of databases accessible through Web forms. We would like to be able with a single query to find relevant data no matter which database provides it.

A first issue for data integration (that will be ignored here) is social: The owners of some data set may be unwilling to fully share it and be reluctant to participate in a data integration system. Also, from a technical viewpoint, the difficulty comes from the lack of interoperability between the data sources, that may use a variety of formats, specific query processing capabilities, different protocols. However, the real bottleneck for data integration is logical. It comes from the so-called semantic heterogeneity between the data sources. They typically organize data using different schemas even in the same application domain. For instance, each university or educational institution may choose to model students and teaching programs in its own way. A French university may use the social security number to identify students and the attributes NOM, PRENOM, whereas the Erasmus database about European students may use a European student number and the attributes fiRSTNAME, LASTNAME and HOME UNIVERSITY.

In this chapter, we study data integration in the mediator approach. In this approach, data remain exclusively in data sources and are obtained when the system is queried. One sometimes use the term virtual data integration. This is in contrast to a warehousing approach where the data is extracted from the data sources ahead of query time, transformed, and loaded in the warehouse. At query time, the warehouse is accessed but not the data sources. Warehouses approaches are typically preferred for very complex queries, e.g., for data mining. On the other hand, to have access to “fresh” information, a mediator approach is preferred since it avoids having to propagate in real time, data source updates to the warehouse. Figure 9.1 illustrates these two approaches of data integration.


PIC

Figure 9.1: Virtual versus Materialized data integration


In the mediator approach, one starts by designing a global schema (also called mediated schema) that serves as a unique entry point on which global queries are posed by users. A main issue is then to specify the relationships, namely semantic mappings, between the schemas of the data sources and the global schema. Based on these mappings, one can answer queries over the global schema using queries over the data sources. Typically, query answering in the mediator approach is performed as follows. First, independently of the data in the sources, the user’s query posed over the global schema is transformed into local queries that refer to the schemas of the data sources. A global query combines the data provided by sources. Queries are optimized and transformed into query plans. The local query plans are executed and their results combined by the global query plan.

In the following, for presentation purposes, we consider that the global schema and the schemas of the data sources to integrate are all relational. In practice, each non-relational data source (e.g., XML or HTML) is abstracted as a relational database with the help of a wrapper. Wrappers are small programs that translate local relational queries into appropriate requests understood by specific data sources, and transform their results into relations. The role of wrappers is to allow the mediator to see each data source as relational, no matter which actual format it uses.

Let us consider in more detail the specification of semantic mappings between the data sources and the global schema. Let S1, ..., Sn be the local schemas of n pre-existing data sources. To simplify the presentation, let us assume that each local schema Si is made of a single relation that we denote also Si. The relations S1, ..., Sn are called the local relations. Suppose the global schema G consists of the global relations G1, ..., Gm. The goal is to specify semantic relations between the local relations Si and the global relations Gj. The Gj are logically (intentionally) defined by the Si.

An example of simple relationship (not very interesting) is based on equality, e.g., G1 = S1. One can find more complicated relationships, e.g., G2 = S1 S2 or G3 = S1S3. In these last two examples, a global relation is defined as a query over the local relations. In other words, the global relation is a view of the local relations. Indeed, one typically prefers more flexible constraints such as G3 S1S3. Using containment instead of equality leaves open the possibility for other sources of providing data about G3, e.g., G3 σA=yes(S4). Because global relations are constrained by views of the local relations, one uses the term global-as-view for such specifications.

In a somewhat dual manner, one can use local-as-view constraints such as: S4 G1G3. This leaves even more flexibility since the contribution of each data source can be specified (e.g., by its owner) independently of the other sources of the system. This kind of autonomy is typically well-adapted to a Web setting.

More generally, to express semantic mappings between {S1,...,Sn} and {G1,...,Gm}, one can use inclusion statements, i.e., logical constraints, of the form v(S1,...,Sn) v(G1,...,Gm), where v and vare query expressions called views. All the constraints we consider in this chapter will be of this general form. Now, given an instance I of {S1,...,Sn} (i.e., an instance of the data sources), we don’t know the instance J of the global schema. But we know that:

                    ′
v(I(S1),...,I(Sn))⊆ v (J(G1),...,J(Gm ))
So, the story of mediator systems is essentially a story of logical constraints and incomplete information. In this general setting, given I, an answer to a global query q is a fact q(a) that is true in any instance J that together with I satisfies the mapping constraints, i.e., a fact we can be sure of as a logical consequence of both the data stored in I and of the logical constraints expressed by the mappings. Not surprisingly, query answering is thus a complex reasoning problem that in general may be undecidable. We focus on two particular decidable cases, for which rewriting algorithms have been designed and implemented. They are based on semantic mappings that capture typical constraints found in many applications:
Global-As-View (GAV for short).
The semantic mappings are of the form
Vi(S1,...,Sn) ⊆ Gi

also equivalently denoted

G  ⊇ V (S ,...,S )
  i   i  1    n

where each Vi is a view over the local schemas, i.e., a query built on local relations.

Local-As-View (LAV for short).
The semantic mappings are of the form
Si ⊆ Vi(G 1,...,Gm)
where each Vi is a view over the global schema, i.e., a query built on global relations.

In our development, we will consider conjunctive queries. Using negation in queries greatly complicates the issues. In the next section, we recall some standard material on containment of conjunctive queries, i.e., of the queries at the heart of our formal development. In Sections 9.3 and 9.4, we study GAV and LAV mediators, respectively. For each of these languages, we describe appropriate query rewriting algorithms. In Section 9.5, we show the impact on query rewriting of adding DL-LITE constraints in the global schema. Finally, in Section 9.6, we lay the basis of a peer-to-peer approach for data integration. In contrast with the mediator approach which offers a unique entry point to data, peer-to-peer data management systems (PDMS for short) are decentralized data integration systems.

9.2  Containment of conjunctive queries

In this section, we recall some basic notions on comparing conjunctive queries that we will use in the following.

We recall that a conjunctive query is an expression of the form:

q(x1,...xn) :-A1(u⃗1),...,Ak (⃗uk)
where each Ai is an relation, u1,...,uk are vectors of constants and variables. Furthermore, we require that each xi occurs in some ui. q(x1,...xn) is called the head and A1(u1),...,Ak(uk) the body of the query. The xi variables are called distinguished. The other variables are called existential.

Given an instance I of the relations appearing in the body of the query, an answer is a tuple ν(x1),...,ν(xn)for some valuation ν of the variables in the query, such that for each i, Ai(ν(ui)) holds in I. We denote q(I) the set of answers.

We sometimes denote this query q(x1,...xn) when its body is understood. Observe that the interpretation of such a conjunctive query in logical terms is:

{x1,...,xn |∃y1,...,∃ym (A 1(u⃗1) ∧ ...∧Ak (⃗uk))}
where y1,...,ym are the variables not occurring in the head.

The data integration techniques rely on conjunctive query containment. This problem has been extensively studied because it is at the core of query optimization. We use known techniques that we recall next.

A query q1 is contained in q2, denoted q1 q2, if for each I, q1(I) q2(I). It is known that the containment between a conjunctive query q1 and a conjunctive query q2 can be tested by finding a “homomorphism” from q2 to q1.

Definition 9.2.1 Let q1(x1,...,xn) and q2(y1,...,yn) be two conjunctive queries. A (conjunctive query) homomorphism from q2 to q1 is a mapping ψ from the variables of q2 to the variables of q1 such that:

  1. For each i, ψ(yi) = xi; and
  2. For each atom R(ui) in the body of q2, R(ψ(ui)) is in the body of q1.

Example 9.2.2 Consider the following queries:

Consider a mapping ψ such that ψ(yi) = xi for each i, ψ(y1) = x1 and ψ(y3) = x3. Then the required conditions hold, and it follows that q1 q2. Intuitively, q2 joins A1 and A2 on the second attribute, whereas q1 also joins on the third one. The additional condition induces the containment.

The following proposition states that the existence of a homomorphism is a necessary and sufficient condition for query containment.

Proposition 9.2.3 (Homomorphism theorem) Let q1 and q2 be two conjunctive queries. Then q1 is contained in q2 if and only if there exists a homomorphism from q2 to q1.

This provides a simple algorithm for testing conjunctive query containment. In the general case, deciding whether a conjunctive query is contained in another one is NP-complete in the size of the two queries. In fact, in many practical cases, there are polynomial-time algorithms for query containment.

Algorithm 4 checks whether a query q1 is contained in a query q2.


QC(q1,q2)

Input: Two conjunctive queries:

     q1(x) :- g1(x1),,gn(xn)

     q2(y) :- h1(y1),,hm(ym)

Output: Yes if q1 q2; no otherwise

(1)freeze q1: construct a canonical instance Dcan = {gi(ν(xi))1 in}

(2)     for some valuation ν mapping each variable in q1

(3)     to a distinct constant

(4)if ν(x) q2(Dcan) return yes

(5)else return no.

Algorithm 4: The Query containment algorithm

Example 9.2.4 Consider the queries of Example 9.2.2. The canonical instance Dcan is A1(a,b,c),A2(a,b,c). It is easily verified that q2(Dcan) = (a,a), which is ν(x,x).

9.3  Global-as-view mediation

The main advantage of GAV is its conceptual and algorithmic simplicity. The global schema is simply defined using views over the data sources and specifies how to obtain tuples of the global relation Gi from tuples in the sources.

Definition 9.3.1 (GAV mapping) A GAV mapping is an expression of the form: R(x1,...,xn) q(x1,...,xn), where q(x1,...,xn) :- A1(u1),...,Ak(uk) is a conjunctive query of the same arity as R. The semantics of this mapping is:

∀x ,...,x (∃y ,...,y  (A (u⃗),...,A  (⃗u )⇒  R(⃗u)))
  1    n    1    m  1  1      k  k
where y1,...,ym are of variables occurring in the body of the rule and not its head.

We write alternatively this GAV mapping as:

R (x1,...,xn)⊇  A1(⃗u1),...,Ak (u⃗k)
R (x1,...,xn)⊇  q(x1,...,xn)
R ⊇  q
by omitting information that is either not needed or that is clear from the context. When we want to stress which are the existential variables, we write it R(x) q(x,y) where y is the vector of existential variables.

Example 9.3.2 Consider the following four data sources:

Now, suppose we define a global schema with the following unary and binary relations:

MasterStudent(studentName), University(uniName),
MasterProgram(title), MasterCourse(code),
EnrolledIn(studentName,title), RegisteredTo(studentName, uniName).

These relations are defined in terms of the local relations by the following GAV mappings:

MasterStudent(N)S2.Erasmus(N,C,U), S4.Mundus(P,C)
MasterStudent(N)S3.CampusFr(N,P,U), S4.Mundus(P,C)
University(U)S1.Catalogue(U,P)
University(U)S2.Erasmus(N,C,U)
University(U)S3.CampusFr(N,P,U)
MasterProgram(T)S4.Mundus(T,C)
MasterCourse(C)S4.Mundus(T,C)
EnrolledIn(N,T)S2.Erasmus(N,C,U), S4.Mundus(T,C)
EnrolledIn(N,T)S3.CampusFr(N,T,U), S4.Mundus(T,C)
RegisteredTo(N,U)S3.CampusFr(N,T,U)

Note that in a warehousing approach, one would simply evaluate all the queries that define the global view, and populate the warehouse using standard relational query evaluation. In a mediator approach, we try to only derive data that is relevant to a specific query posed on the global view by a user. We show how to rewrite a global query into queries over the local relations and combine their results. This is achieved by a technical trick that consists in unfolding the atoms of the global query.

Observe the first two mappings. They specify (using joins) how to obtain tuples of the unary relation MasterStudent. Now consider the following global query asking for universities with registered master students:

q(x) :- RegisteredTo(s,x), MasterStudent(s)

The rewriting of this query into source queries is obtained by unfolding, i.e., by replacing each atom which can be matched with the head of some view, by the body of the corresponding view. (For readers familiar with logic programming, this is some very simple form of resolution.)

In the example, there is a single mapping whose head can be matched with RegisteredTo(s,x), and two mappings that match MasterStudent(s). Thus, we obtain the following two unfoldings:

q1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2) 
q2(x) :- S3.CampusFr(s,v5,x), S3.CampusFr(s,v6,v7), S4.Mundus(v6,v8)

Observe that q2 can be simplified. Replacing the conjunction of its first two atoms by the single atom S3.CampusFr(s,v6,x) leads to an equivalent query. We thus obtain the following two GAV rewritings of the initial query:

r1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2) 
r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

The result is obtained by computing r1r2. Now, observe that each r is a conjunctive query. It can be optimized using standard query optimization to obtain an optimized physical query plan. Of course, the choice of the particular physical query plan that is selected depends on the statistics that are available and the capabilities of the sources. For instance, a plan may consist in querying S3 and then for each value a of v6 (i.e., a particular university program), asking the query q(X) :- S4.Mundus(a,X) to S4.

We now formalize the simple and intuitive notion of query unfolding.

Definition 9.3.3 (Query unfolding) Let q(x) :- G1(z1),,Gn(zn) be a query and for each i, Gi(xi) qi(xi,yi) be a GAV mapping. An unfolding of qis the query uobtained from qby replacing, for each i, each conjunct Gi(zi) by qi(ψi(xi,yi)) where ψi is a function that maps xi to zi, and the existential variables yi to new fresh variables.

The renaming of the existential variables into fresh ones is necessary to avoid the introduction of unnecessary constraints in the unfolding. Indeed, consider an existential variable y occurring in two distinct atoms, say Gi and Gj. Then, the two atoms should be understood as ...y...Gi(zi) and ...y...Gj(zj). The scopes of y in both are disjoint and nothing requires that the two occurrences of y take the same value. Hence the renaming using fresh variables.

Example 9.3.4 Suppose we have the two mappings:

F (x,y) ⊇ S(x,z),S(y,z)    G(x) ⊇ S(x,y)
and the query q(x) :- F(x,y),G(y). Then we get the following unfolding:
q(x) :- S(x,v1),S(y,v1),S(y,v2)
The variable v1 corresponds to the renaming of the existential variable z in the view defining F, whereas v2 comes from the renaming of the existential variable yin the view defining G.

We next establish that each unfolding of a query computes a part of the desired results, and that their union computes the whole set of answers. To do so, we use two propositions. The first one ignores unfolding and focuses on the “materialization” of the global relations.

Proposition 9.3.5 Let S1,...,Sn be a set of source relations; G1,...,Gm a global schema defined by a set Gof GAV mappings over S1,...,Sn; and I be an instance over S1,...,Sn. Let J be the instance over G1,...,Gm defined by, for each j,

J(Gj )= ∪ {V(I) |Gj ⊇ V(S1,...,Sn) ∈ G}
Then for each query qover G1,...,Gm, the answer of qis q(J).

Proof (sketch). Let u be an answer. Then, by definition, q(u) is true in each instance Jover G1,...,Gm such that I and Jtogether satisfy the mappings. In particular, u belongs to q(J). Conversely, let u be in q(J). Let Jbe an instance such that I and Jtogether satisfy the mappings. Since Jsatisfies the mappings, J J. Since conjunctive queries are monotone, q(J) q(J). Thus uJ. Since u belongs to all such J, u is an answer.

The second proposition deals with unfoldings.

Proposition 9.3.6 Let S be a set of source relations and G a set of global relations defined by a set Gof GAV mappings over S. Consider the query q(z) :- Gi1(zi1),,Gin(zin) over G and the set {r} of unfoldings of qgiven G. Then for each I over S1,...,Sn, the answer of qis given by r(I).

Proof (sketch). Let J be as in Proposition 9.3.5. By the same proposition, it suffices to show that q(J) = r(I).

Soundness.
Let u∈∪r(I). Then u is r(I) for some unfolding r. Suppose r results from the unfolding defined by selecting for each j, the mapping Gij(xij) qij(xij,yij). It follows that uq({u1},...,{un}) where for each j, uj is derived by Gij(xij) qij(xij,yij). Thus, each uj is in J(Gij) and uq(J(Gi1),...,J(Gin)) = q(J). Therefore, r(I) q(J).
Completeness.
Conversely, consider u in q(J). Then, there exists u1 in J(Gi1), ..., uj in J(Gij), ...un in J(Gin) such that u q({u1},...,{un}). By construction of J, for each j there is some mapping Gij(xij) qij(xij,yi-1) such that uj is in qij(xij,yi-1). Consider the unfolding r defined by selecting for each j, this particular mapping. One can verify that u is r(I). Hence, u∈∪r(I) and q(J) ⊆∪r(I).

We can compute the answer using the unfoldings (also called the GAV rewritings). These unfoldings can be simplified by removing redundant conjuncts that may have been introduced by the technique. This simplification relies on checking conjunctive query containment. Given a conjunctive query with body A1(u1),...,Am(um), we verify whether each query obtained by removing some Ai(ui) is equivalent to the initial one. If yes, the atom is redundant and can be removed. We keep doing this until the query is “minimal”. This simplification test is costly but the resulting query may be much less expensive to evaluate that the initial one.

We must evaluate all the unfoldings to obtain the complete answer. If we are aware of some constraints on the local schemas or on the global one, this can be further simplified. For instance, the constraints may imply that the result of a particular unfolding is empty, in which case this particular unfolding needs not be evaluated. Also, the constraints may imply that the result of some unfolding, say r, is always included in another one. Then r needs not be evaluated. For instance, in the previous example, if it is known that students obtained from the source S2 are European students, while those obtained from the source S3 are non European students, we can be sure that the GAV rewriting r obtained by unfolding will not provide any answer. This requires expressing and exploiting disjointness constraints over the local relations. Inclusion constraints on local relations would, on the other hand, permit to detect in advance that a given query plan provides answers that are redundant with those obtained by another query plan.

A main limitation of GAV is that adding or removing data sources to the integration system may require deeply revising all the views defining the global schema. In a Web context where sources may come and go, e.g., because of (non) availability of servers, this is really too constraining. The LAV approach does not suffer from this disadvantage.

9.4  Local-as-view mediation

The LAV approach takes a dual approach. The local relations are defined as views over global relations. The goal is to define the global schema in such a way that individual definitions do not change when data sources join or leave the integration system except for the definitions of the sources that are involved in the change.

Definition 9.4.1 (LAV mapping) A LAV mapping is a mapping of the form: S q, for some conjunctive query q(x1,...,xn) :- A1(u1),...,Ak(uk) over the global relations. Its semantics is:

∀x 1,...,xn[S (x 1,...,xn)⇒  (∃y1,...,ymA 1(⃗u1),...,Ak(⃗uk))]
where y1,...,ym are the existential variables.

Again, S(x1,...,xn) is called the head of the view, whereas A1(u1),...,Ak(uk) is called the body of the view.

Example 9.4.2 We define the global schema as consisting of the following relations:

Student(studentName), EuropeanStudent(studentName),
University(uniName), NonEuropeanStudent(studentName),
FrenchUniversity(uniName), EuropeanUniversity(uniName),
NonEuropeanUniversity(uniName), Program(title),
MasterProgram(title), EnrolledInProgram(studentName,title),
Course(code), EnrolledInCourse(studentName,code),
PartOf(code, title), RegisteredTo(studentName, uniName),
OfferedBy(title, uniName).

The four data sources considered in the previous example can be described by the following LAV mappings:

m1: S1.Catalogue(U,P) FrenchUniversity(U), Program(P), OfferedBy(P,U),
OfferedBy(P’,U), MasterProgram(P’)
m2: S2.Erasmus(S,C,U) Student(S), EnrolledInCourse(S,C), PartOf(C,P),
OfferedBy(P,U), EuropeanUniversity(U),
EuropeanUniversity(U’)RegisteredTo(S,U’),
U U’
m3: S3.CampusFr(S,P,U) NonEuropeanStudent(S), Program(P),
EnrolledInProgram(S,P), OfferedBy(P,U),
FrenchUniversity(U), RegisteredTo(S,U)
m4: S4.Mundus(P,C) MasterProgram(P), OfferedBy(P,U),
OfferedBy(P,U’), EuropeanUniversity(U),
NonEuropeanUniversity(U’), PartOf(C,P)

LAV mappings enable quite fine-grained descriptions of the contents of data sources. For example, we are able to specify precisely the students that can be found in the Erasmus source: they are European students enrolled in courses of a given (European) university that is different from their home (European) University in which they remain registered.

LAV mappings express loose coupling between local and global relations, which is important for flexibility and robustness when the participating data sources change frequently. If we are interested in Master students, we do not need to know in advance (unlike the GAV approach) how to join two sources. We just define them as a global query:

MasterStudent(E) :-  Student(E), EnrolledInProgram(E,M),
MasterProgram(M).

The local sources that must be queried and combined to get the Master students will be discovered by the rewriting process. Recall that, in the GAV approach, they were predefined by the two mappings given in Example 9.3.2.

The price to pay for the flexibility of LAV compared to GAV is that the rewritings are more complicated to find. We describe three algorithms that achieve this rewriting. The Bucket algorithm and the Minicon algorithm follow the same approach. They first determine the local relations that are relevant to the query, then consider their combinations as candidate rewritings and verify whether they are indeed correct. Minicon is actually an optimization of Bucket that avoids the last verification step by a trickier first step. The third algorithm, namely the Inverse-rules algorithm, follows a completely different approach: it consists in transforming the logical rules supporting the LAV mappings (which are unsafe rules) into a set of safe rules with a single global relation. The global query is unfolded using these rules.

9.4.1  The Bucket algorithm

The principle of the Bucket algorithm is quite simple. It proceeds in three steps:

  1. the first step constructs for each atom g of the global query body its bucket, which groups the view atoms from which g can be inferred;
  2. the second step consists in building a set of candidate rewritings that are obtained by combining the view atoms of each bucket;
  3. in a last step, we check whether each candidate rewriting is valid.

Bucket creation

Let g be a query atom. The atoms in bucket(g) are the heads of mappings having in their body an atom from which g can be inferred. Intuitively, data comes from source relations, and a (global) query atom is satisfied by (local) data only if it can be matched to a (global) atom in the body of a mapping whose head can be matched to source facts. A match between g and some atom in the body of a mapping is thus an indication that the corresponding data source provides a relevant information for this particular query.

There is an extra constraint that has to be considered to guarantee that g can indeed be logically inferred, as illustrated next. In fact, the bucket of a query atom g includes a view atom v only if an atom in the body of v can be matched with g by a variable mapping such that the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

Let us illustrate this on an example. Consider the LAV mappings of Example 9.4.2, and the global query:

q(x) :- RegisteredTo(s,x), EnrolledInProgram(s,p), MasterProgram(p)

Let us consider the query atom g= RegisteredTo(s,x), in which the variable x is distinguished.

We can find two mappings (m2 and m3) in which a body atom can be matched to
RegisteredTo(s,x).

First, consider the mapping m3:

m3: S3.CampusFr(S,P,U) NonEuropeanStudent(S), Program(P),
EnrolledInProgram(S,P), OfferedBy(P,U),
FrenchUniversity(U), RegisteredTo(S,U)

The atom RegisteredTo(s,x) matches the atom RegisteredTo(S,U) with the variable mapping {S/s,U/x}, where U is distinguished in the view defining the mapping (it occurs in the head of this LAV mapping).

Therefore, applying the variable mapping {S/s,U/x} to the head S3.CampusFr(S,P,U) of the mapping m3 enforces the matching of RegisteredTo(S,U) with the query atom RegisteredTo(s,x), and then:

S3.CampusFr(s,v1,x)FOL(m3) sRegisteredTo(s,x)

Thus S3.CampusFr(s,v1,x) is added in Bucket(g). Note that v1 is simply a fresh variable mapped to the variable P appearing in S3.CampusFr(S,P,U) but not in the variable mapping {S/s,U/x}.

On the other hand, consider the mapping m2:

m2: S2.Erasmus(S,C,U) Student(S), EnrolledInCourse(S,C), PartOf(C,P),
OfferedBy(P,U), EuropeanUniversity(U),
EuropeanUniversity(U’)RegisteredTo(S,U’), U U’

The match this time is between g = RegisteredTo(s,x) and RegisteredTo(S,U’) by the variable mapping {S/s,U/x}. The difference with the previous situation is that the variable U’ is existentially quantified in the view defining the mapping. Applying the variable mapping {S/s,U/x} to the head S2.Erasmus(S,C,U) of the mapping m2do not enforce the matching of RegisteredTo(S,U’) in its body with the query atom RegisteredTo(s,x).

More formally:

S2.Erasmus(s,v2,v3)FOL(m2) sRegisteredTo(s,x).

To see why, consider the LAV mapping m2 and its logical meaning FOL(m2):

FOL(m2): SCU [ S2.Erasmus(S,C,U)⇒∃PU’ (
EuropeanStudent(S)EnrolledInCourse(S,C)
PartOf(C,P)OfferedBy(P,U)
EuropeanUniversity(U)RegisteredTo(S,U’)U U’) ]

From the fact that S2.Erasmus(s,v2,v3), it follows that:

sU’RegisteredTo(s,U’).

However, this is a strictly weaker statement than sRegisteredTo(s,x) where x is fixed. We prove this next. Consider an instance I over the domain Δ = {s,v2,v3,v4,v5,x} defined by:

I(S2 .Erasmus  )= { ⟨s,  v2,  v3 ⟩}     I(EuropeanStudent    )= {s}
I(EnrolledInCourse    )=  {⟨s, v2 ⟩}  I(PartOf ) = {⟨v2,  v4⟩}
I(OfferedBy  ) = {⟨v4,  v3⟩}          I(EuropeanUniversity    ) = {v3,  v5}
I(RegisteredTo   )=  {⟨s, v5 ⟩}
By the valuation that instantiates respectively the variables S to the constant s, C to the constant v2, U to the constant v3, P to the constant v4 and U’ to the constant v5, we see that I satisfies the fact S2.Erasmus(s,v2,v3) and the formula FOL(m2), but that sRegisteredTo(s,x) is not satisfied in I.

As a consequence, S2.Erasmus(s,v2,v3) does not belong to the bucket and:

Bucket  (RegisteredTo   (s,x))=  {S3.CampusFr  (s ,v1 ,x)}.


Bucket(g, q, M)

Input: An atom g = G(u1,...,um) of the query q and a set of LAV mappings

Output: The set of view atoms from which g can be inferred

(1)Bucket(g) :

(2)for each LAV mapping S(x) q(x,y)

(3)    if there exists in q(x,y) an atom G(z1,...,zm) such that

(4)               zi is distinguished for each i such that ui is distinguished in q;

(5)    let ψ the variable mapping {z1/u1,....,zm/um}

(6)               extended by mapping the head variables in x not

(7)               appearing in {z1,...,zm} to new fresh variables;

(8)     add S(ψ(x)) to Bucket(g);

(9)return Bucket(g);

Algorithm 5: The Bucket algorithm

Algorithm 5 constructs the buckets. Proposition 9.4.3 is a logical characterization of the view atoms put in the buckets of the atoms of the global query.

Proposition 9.4.3 Let G(u1,...,um) be an atom of the global query. Let u be the (possibly empty) subset of existential variables in {u1,...,um}. Let m: S(x) q(x,y) be a LAV mapping. Then

S(v), FOL(m)uG(u1,...,um)

iff there exists a view atom in Bucket(g) that is equal to S(v) (up to a renaming of the fresh variables).

The proof is tedious and left as exercise.

In the worst-case, the Bucket algorithm applied to each atom of a query has a time complexity in O(N ×M×V) and produces N buckets containing each at most M×V view atoms, where N is the size of the query, M is the maximal size of the LAV mappings and V is the number of LAV mappings.

Returning to the example, we obtain by the Bucket algorithm, the following buckets for the three atoms of the query q.

RegisteredTo(s,x) EnrolledInProgram(s,p) MasterProgram(p)



S3.CampusFr(s,v1,x) S3.CampusFr(s,p,v2) S1.Catalogue(v3,v4)
S4.Mundus(p,v5)



Construction of candidate rewritings

The candidate rewritings of the initial global query are then obtained by combining the view atoms of each bucket. In the worst-case, the number of candidate rewritings is in O((M×V)N). For instance, in our example, we obtain two candidate rewritings for the query q:

r1(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2), S1.Catalogue(v3,v4) 
r2(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2), S4.Mundus(p,v5)

A candidate rewriting may not be a valid rewriting of the query. By Proposition 9.4.3, we only know that each candidate rewriting entails each atom of the query in isolation, i.e., without taking into account the possible bindings of the existential variables within the query.

It turns out that, in our example, the first candidate rewriting r1 is not a valid rewriting of the query q: the body of q is not logically entailed by the conjunction of the view atoms in the body of r1.

To see why, we first apply to each view atom in the body of r1 the corresponding LAV mapping to obtain the logical global expression (i.e., built on global relations). This step is called expanding r1, and its result, the expansion of r1. In our case, the expansion of r1 is the following query expression:

Exp_r1(x) :- NonEuropeanStudent(s), Program(v1), EnrolledInProgram(s,v1), 
             OfferedBy(v1,x), FrenchUniversity(x), RegisteredTo(s,x), 
             Program(p), EnrolledInProgram(s,p), OfferedBy(p,v2), 
             FrenchUniversity(v2), RegisteredTo(s,v2), 
             FrenchUniversity(v3), Program(v4),OfferedBy(v4,v3), 
             OfferedBy(v5,v3), MasterProgram(v5)

Note that new existential variables may be introduced by the expansion of some view atoms. For instance, the LAV mapping defining S1.Catalogue(v3,v4) contains the existential variable denoted P’ in the LAV mapping definition. Such variables are renamed with new fresh variables to avoid unnecessary constraints between the variables. In our example, this corresponds to variable v5 in the body of Exp_r1(x).

To check whether a rewriting is correct, it suffices to check with the Conjunctive Query Containment algorithm whether the query Exp_r1(x) is contained in the query q(x). For each variable v, let the corresponding constant, i.e., ψ(v), be "v". The canonical database obtained from r1 is given in Figure 9.2.


NonEuropean- Program EnrolledIn- OfferedBy French- RegisteredTo Master-
Student Program University Program







"s" "v1" ("s", "v1") ("v1", "x") "x" ("s", "x") "v5"
"p" ("s", "p") ("p", "v2") "v2" ("s", "v2")
"v4" ("v4", "v3") "v3"
("v5", "v3")








Figure 9.2: The canonical database resulting from freezing r1


The evaluation of q(x) over this canonical database yields an empty result because there is no way of assigning the existential variables s and p to constants of the canonical database which satisfies the binding of the existential variable p between the two last atoms of the body of the query.

Expanding the rewriting r2 and checking that it is contained into the query q is left in exercise. This shows that among the two candidate rewritings, only r2 is a valid rewriting:

r2(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2),S4.Mundus(p,v5)

Remark 9.4.4 In spite of the apparent redundancy of the two first atoms, this rewriting cannot be simplified to

  r2.1(x) :- S3.CampusFr(s,p,x), S4.Mundus(p,v5)

It is true that r2.1(x) is contained into r2(x). However, the two queries are not equivalent. For some data sets, it may be the case that there is a student s and there is a university x such that (based on S3.CampusFr), s is registered in x and also enrolled in a Mundus master program offered by another university. The containment would hold under a constraint that would forbid a student to be registered in more than one universities.

One can prove that each rewriting finally obtained does indeed provide answers and that their union constitutes the complete answer.

9.4.2  The Minicon algorithm

The idea underlying Minicon is to avoid putting in a bucket an atom that will only generate invalid rewritings. As we saw in the discussion of Bucket, the reason for an atom to be useless is that its binding of a variable does not match with the binding of other occurrences of that variable. This explains why a candidate rewriting (like r1) is not valid.

We now illustrate the Minicon algorithm by example. Consider the query q:

q(x) :- U(y,z), R(x,z), T(z,y), R(y’,x)

and the two LAV mappings:

V1(u,v)T(w,u), U(v,w), R(v,u)
V2(u,v,v’)T(w,u), U(v,w), R(v’,w)

Minicon proceeds in two steps that correspond to the first two steps of Bucket.

First step of Minicon: creation of MCDs

Minicon scans each atom in the query, but instead of creating buckets for them, it builds MCDs (short name for Minicon Descriptions). The first iteration of Minicon determines the relevance of the different LAV mappings to rewrite the first query atom U(y,z):

The last iteration of building MCDs corresponds to the last query atom: R(y’,x). The LAV mapping V1 has in its expansion the atom R(v,u) that can be matched to it by the variable mapping {v/y,u/x)}. Since the distinguished variable x in the query is assigned to the distinguished variable (same condition as for adding to a bucket), and since the existential variable y’ of the query atom has a single occurrence in the query, the following MCD is created:

MCD2 = ((V1(x,y’), {4})

In contrast, there is no MCD created for R(y’,x) with the second LAV mapping: in the variable mapping {v/y,w/x)} that allows to match the query atom R(y’,x) with the atom R(v’,w) in the expansion of V2, the distinguished variable x in the query is assigned to the variable w which is not distinguished in the expansion of of V2. As for adding a view atom in a bucket, a MCD is created for a query atom g only if the variables mapped to the distinguished variables of g are also distinguished variables in the view defining the mapping.

.

Second step of Minicon: combination of the MCDs

The second step of Minicon replaces the combination of the buckets by the combination of the MCDs. More precisely, the rewritings are obtained by combining MCDs that cover mutually disjoint subsets of query atoms, while together covering all the atoms of the query.

Because of the way in which the MCDs were constructed, the rewritings obtained that way are guaranteed to be valid. No containment checking is needed, unlike in the Bucket Algorithm. In our example, we would therefore obtain as single rewriting of q(x):

r(x) :- V2(y,y,x), V1(x,y’)

9.4.3  The Inverse-rules algorithm

This algorithm takes a radically different approach. It transforms the LAV mappings into GAV mappings (called inverse rules) so that the complex operation of query rewriting using LAV mappings can then be replaced by the much simpler operation of query unfolding. A LAV mapping is replaced by several GAV mappings, one for each atom in the body of the rule. The subtlety is to keep bindings between the different occurrences of the same existential variable in the body. This is realized using a simple trick from first-order logic, namely by introducing Skolem functions.

Let us explain the Inverse-rules algorithm on the example we used for Minicon. A first important point that distinguishes it from the Bucket and Minicon algorithms is that the Inverse-rules algorithm is independent of the query. It only considers as input the set of LAV mappings:

V1(u,v)T(w,u), U(v,w), R(v,u)
V2(u,v,v’)T(w,u), U(v,w), R(v’,w).

Consider the first LAV mapping and recall that its logical meaning mapping is the formula:

∀u ∀v [V1 (u,v )⇒  ∃w (T(w,u ) ∧ U(v,w ))∧ R (v,u ))]
Suppose we know that (a,b) belongs to the source relation V1. From the fact V1(a,b), we can infer the fact R(b,a), i.e., that the tuple (b,a) is in the extension of the global relation R, and thus that, for instance, b is an answer for the global query q(x) :- R(x,y).

But we can infer much more. We can also infer that there exists some constant d1 such that T(d1,a) and U(b,d1) are both true. We do not know the exact value of that constant d1, but we know it exists and that, in some way, it depends on the constants a,b. Since this dependency comes from the first rule, we denote this unknown d1 value: f1(a,b).

Creating the inverse rules This motivates the construction of three following GAV mappings for which we give also the FOL translation.

IN11 :V1 (u,v )⊆  T(f1 (u,v ),u)  FOL (IN11 ) :∀u ∀v[V1(u,v ) ⇒  T(f1 (u,v ),u) ]

IN12 :V1 (u,v )⊆  U(v,f1 (u,v ))  FOL (IN12 ) :∀u ∀v[V1(u,v ) ⇒  U(v,f1 (u,v )) ]
IN13 :V1 (u,v )⊆  R(v,u )         FOL (IN13 ) :∀u ∀v[V1(u,v ) ⇒  R(v,u )]
They are called the inverse rules of the corresponding LAV mapping.

In the previous rules, the symbol f1 is a Skolem function of arity 2, and f1(u,v) is a Skolem term denoting some constant that depends on the values instantiating the variables u,v. Given two distinct Skolem terms, e.g. f1(1,2) and f1(2,v3), we cannot tell whether they refer to the same constant or not.

The Inverse-rules algorithm just scans the LAV mappings and creates n GAV mappings for each LAV mapping having n atoms. The result of this algorithm applied to the second LAV mappings in the example is:

IN21  :V2(u,v,v  ’) ⊆ T(f2 (u,v,v ’ ),u)
IN22  :V2(u,v,v  ’) ⊆ U(v,f2 (u,v,v  ’))
IN23  :V2(u,v,v  ’) ⊆ R(v ’,f2 (u,v,v ’) )

Obtaining the rewritings by unfolding: The rewritings of any global query is now obtained by unfolding the query atoms using the (Inverse-rules) GAV mappings corresponding to the initial set of LAV mappings. The unfolding operation here is a bit trickier than the unfolding defined in Definition 9.3.3, because of the Skolem terms. In Definition 9.3.3, the unfolding was based on matching each query atom G(x1, .., xm) with an atom (in the right-hand side of a GAV mapping) of the form G(z1, .., zm) by equating each pair (zi, xi) of variables. Proposition 9.3.6 showed that unfolding each atom of the query in isolation builds valid rewritings of the query, i.e., conjunctions of view atoms which logically implies the conjunction of the query atoms. It is not the case anymore when atoms in the right-hand side of GAV mappings contain Skolem terms.

The unification of two atoms with functions is more complex than just equating variables, and it may fail. It may require the substitution of some variables with functional terms (in our case, Skolem terms). This may make impossible to unify the other atoms of the query with atoms in the right-hand side of GAV mappings.

Let us illustrate on our example the subtleties of unfolding queries in presence of functional terms. Consider again the same query q:

q(x) :- U(y,z), R(x,z), T(z,y), R(y’,x).

The query atom U(y,z) can be unified with the atom U(v,f1(u,v)) in the right-hand side of the GAV mappings IN12 using a so-called most general unifier (mgu). In this case, the mgu is the substitution:

σ = {y/v1,  v/v1,   z/f1 (v2,v1 ),  u/v2 }
where v1 and v2 are new fresh variables introduced in order to avoid name conflict between variables that would generate unnecessary constraints. The substitution σ is a unifier of the two expressions U(y,z) and U(v,f1(u,v)) because the replacement in the two expressions of the occurrences of the variables y, v, z and u by the corresponding term (variable or Skolem term) in σ results in two identical expressions:
σ (U (y,z ))=  σ(U(v,f1 (u,v )) )

This substitution that makes the unfolding of the first query atom possible, now constrains the other occurrences in the query of the variables y and z for the unfolding of the other query atoms. After the application of σ to the whole body of the query and the unfolding of the first query atom made possible by σ, we obtain the following (partial) query rewriting:

pr1(x) :- V1(v2,v1), R(x,f1(v2,v1)), T(f1(v2,v1),v1), R(y’,x).

The unfolding of the second atom R(x,f1(v2,v1)) yields V1(f1(v2,v1),x), and we obtain the (partial) rewriting:

pr2(x) :- V1(v2,v1), V1(f1(v2,v1),x), T(f1(v2,v1),v1), R(y’,x).

It is useless to continue unfolding the remaining query atoms of pr2(x). As soon as a given unfolding has produced a view atom with Skolem terms, we can be sure that the evaluation of the query plan under construction will not produce any answer: there is no way to match V1(f1(v2,v1),x) with any fact in the data source which are of the form V1(a,b) where a,b are constants. Since we don’t know f1(v2,v1), there is absolutely no reason to believe that it is equal to a.

Using the inverse rule IN23 to unfold R(x,f1(v2,v1)) does not help because unifying R(x,f1(v2,v1)) and R(v’,f2(u,v,v’)) fails because of the two different Skolem functions. Thus, the (partial) rewriting issued from unfolding U(y,z) using the inverse rule IN12 is abandoned.

Let us try now to unfold U(y,z) using IN22 made possible by the substitution

 ′
σ = {y/v1,   v/v1,   z/f2 (v2,v1,v3  ),  u/v2,  v’/v3 }.
We obtain the following (partial) query rewriting:
pr’1(x) :- V2(v2,v1,v3), R(x,f2(v2,v1,v3)), T(f2(v2,v1,v3),v1), R(y’,x).

Now, unfolding R(x,f2(v2,v1,v3)) using the inverse rule IN23 is possible thanks to the substitution

σ′′ = {v ’/x,v3/x,u/v2,v/v1   }.
This leads to the (partial) rewriting:
pr’2(x) :- V2(v2,v1,x), V2(v2,v1,x), T(f2(v2,v1,x),v1), R(y’,x),

in which one of the first two atoms can be dropped.

Now, we examine the unfolding of the query atom T(f2(v2,v1,x),v1), which requires checking whether T(f2(v2,v1,x),v1) and T(f2(u,v,v’),u) are unifiable. This is the case thanks to the substitution {v2/v3,u/v3,v1/v3,v/v3,v/x}, which leads to the (partial) rewriting:

pr’3(x) :- V2(v2,v1,x), V2(v3,v3,x), R(y’,x),

Again, we can remove the first atom that is redundant and obtain the equivalent (partial) rewriting:

pr’4(x) :- V2(v3,v3,x), R(y’,x).

Finally the unfolding of R(y’,x) using IN23 leads to the final rewriting:

r1(x) :- V2(v3,v3,x), V1(x,y’).

9.4.4  Discussion

The three algorithms have the same (worst-case) complexity and they guarantee to provide the correct answer. Some experiments have shown that in practice Minicon outperforms both Bucket and Inverse-rules. The main advantage of the Inverse-rules algorithm over the Bucket and Minicon algorithms is that the step producing the inverse rules is done independently of the queries. Furthermore, the unfolding step can also be applied to Datalog queries, i.e., to recursive queries.

The common limitation of the three algorithms is that they do not handle additional knowledge (ontology statements) that can be known about the domain of application. In the next section, we see how to extend both the Local-As-Views and Global-As-Views approaches with DL-LITE ontologies, i.e., we consider global schemas that include constraints expressed as DL-LITE axioms.

9.5  Ontology-based mediators

We first show a negative result: as soon as we add functionality constraints over the global schema, the number of conjunctive rewritings of a query to be considered, may become infinite. This is a severe limitation for extending the LAV or GAV approaches since such constraints are rather natural. So these approaches to data integration fail when we consider the DL-LITEF dialect of previous chapters. On the positive side, we show how to extend the GAV and LAV approaches to constraints expressible in DL-LITER .

9.5.1  Adding functionality constraints

We illustrate on an example the problem raised by taking into account functionality constraints in the global schema. Let us consider a global schema with one unary relation C and two binary relations R and R’. In both R and R1, we impose that the first attribute is a key. Let us consider two LAV mappings:

V1:    S(P,N ) ⊆ R(P,A ),  R1(N,A )
V2:    V(P) ⊆ C (P)
and the following query:
q(x) :-R(x,z),R(x ,z),C (x ).
                1       1
The three previous algorithms (Bucket, Minicon, and Inverse-rules) would return no rewriting at all for q. The proof is left as an exercise. However, we next show that the following rewriting is valid:
r1(x) :-S(x,v1),S(x1,v1),V(x1)

To prove it, we expand r1(x) and show that the resulting expansion together with the logical axiom expressing the functionality of R1 logically implies the conjunction of atoms in the body of the query. The expansion of r1(x) is:

Exp_r (x) :-R(x,y ),R1(v ,y ),R (x,y′),R1(v ,y′),C(x )
     1          1      1  1     1 1      1  1     1
Now, if we ignore the functional dependencies, it is not true that Exp_r1 q. But knowing them, the inclusion holds. Indeed, the logical axiom expressing the functionality of R1 is:
∀y∀z1∀z2[R 1(y,z1) ∧ R1(y,z2)⇒  z1=  z2]
Therefore, it can be inferred from R1(v1,y1) and R1(v1,y1) in the body of Exp_r1(x) that y1 = y1, and thus:
Ex p_r1(x) :- R (x,y 1),R1(v1,y1),R(x1,y 1),R1(v1,y1),C(x1)
Hence Exp_r1 q with ψ mapping x,x1,z to x,x1,y1, respectively. Thus r1(x) is a valid rewriting of q(x).

It is important to note that to properly check containment, the standard query containment algorithm seen in the previous section would have to be modified in a standard manner to take into account functional dependencies. Intuitively, one would have to proceed pretty much as we did in the example, equating variables as implied by the functional dependencies.

It turns out that the situation is even more subtle. Surprisingly, this rewriting r1(x) is not the only one. In fact there exists an infinite number of different rewritings for q(x). Let k 2. The following query is a valid rewriting of q(x):

r (x):S(x,v ),S(x ,v ),S(x ,v  ),S(x   ,v   ),...,S(x ,v1),S(x ,v1),V (x )
 k         k     k k     k  k- 1    k-1  k-1        2       1        1

The expansion of rk(x) is:

Exp_rk(x) :- R(x,yk),      R 1(vk,yk),
             R(xk,y′k),     R 1(vk,y′k),
             R(xk,yk-1),   R 1(vk- 1,yk-1),
             R(xk-1,y′k-1), R 1(vk- 1,y′k-1),
             ...,           ...
             R(x ,y ),     R 1(v ,y ),
                2  1′            1 1′
             R(x1,y1),     R 1(v 1,y1),     C (x1).
To show that this expansion is logically contained in q, we exploit the axioms of functionality of both R and R1. Since R1 is functional, we get: yk = yk, and since R is functional, we get: yk = yk-1. By induction, we obtain yk = yk = yk-1 = yk-1 = = y1 = y1, and in particular: yk = y1. Thus Exp_rk q(x). This implies that rk(x) is a valid rewriting of q(x).

One can also show that for each k, each such rewriting may return answers that are not returned with k- 1. Thus, there exists an infinite number of non redundant conjunctive rewritings. The reader familiar with Datalog will observe that this infinite collection of rewritings can be captured in Datalog by the following recursive rewriting:

r(x) :- S(x,v),S (x 1,v),V (x 1)
r(x) :- S(x,v),P (v,u ),S(x 1,u),V (x 1)
P (v,u) :-S(z,v),S(z,u )
P (v,u) :-S(z,v),S(z,w ),P(w,u)
The question of building automatically such conjunctive rewritings is out of the scope of this book (see Section 9.7).

9.5.2  Query rewriting using views in DL-LITER

Querying data through DL-LITER ontologies has been detailed in the previous chapter. It has been shown how the positive and negative constraints expressed in the ontology are exploited both for data consistency checking and for query answering. In particular, the first step of query answering is the query reformulation step which is performed by the PerfectRef algorithm: using the positive constraints, called the PIs, it computes a set of reformulations, which are then evaluated over the data to produce the answer set of the original query. The negative constraints, called the NIs, are used to check data consistency, by translating each (declared or entailed) NI into a Boolean conjunctive query qunsat that must be evaluated over the data.

In this section, we show how to extend both the LAV and GAV approaches to rewrite queries in term of views when the global schema includes some DL-LITER Tbox.

Two observations explain how this can be realized:

  1. First, one can obtain the answer set of a query q(x) by computing the union of the answer sets returned by the evaluation over the local data sources of the (GAV or LAV) relational rewritings of each reformulation of q(x) as computed by PerfectRef(q(x),PI).
  2. The rewritings that are obtained may be inconsistent with the negative constraints NI declared or inferred in the Tbox. Therefore, the consistency of each rewriting r(x) has to be checked. This can be done by checking containment between the Boolean query xExp_r(x) (where Exp_r(x) is the expansion of r(x)) and each of the Boolean queries qunsat obtained from the NIs.

These two observations follow from the completeness of the PerfectRef and Consistent algorithms for DL-LITER presented in the previous chapter, and that of the rewriting algorithms of Sections 9.3 and 9.4; namely Unfolding for GAV, Minicon, Bucket or Inverse-rules for LAV.

The argument may be somewhat too abstract for some readers. We next illustrate these two points with examples. We use the global schema considered in Example 9.4.2 page 439, enriched with the DL-LITER Tbox of Figure 9.3. Note in particular that we add the subclass College of the class University, the subproperty EnrolledInCollege of the property RegisteredTo, for which the domain is declared as being the class MasterStudent. In addition, we add the property EnrolledInMasterProgram that we declare as a subproperty of the property EnrolledInProgram. Finally, we declare a mandatory property for the class College: any college must have students enrolled in it.


DL notation FOL notation


PIs:


MasterStudent Student MasterStudent(X) Student(X)
EuropeanStudent Student EuropeanStudent(X) Student(X)
NonEuropeanStudent Student NonEuropeanStudent(X) Student(X)
College University College(X) University(X)
FrenchUniversity University FrenchUniversity(X) University(X)
EuropeanUniversity University EuropeanUniversity(X) University(X)
NonEuropeanUniversity University NonEuropeanUniversity(X) University(X)
EnrolledInCollege MasterStudent EnrolledInCollege(X,Y) MasterStudent(X)
College ⊑∃EnrolledInCollege- College(X) ⇒∃YEnrolledInCollege(Y,X)
EnrolledInCollege RegisteredTo EnrolledInCollege(X,Y) RegisteredTo(X,Y)
MasterStudent ⊑∃EnrolledInMasterProgram MasterStudent(X) ⇒∃YEnrolledInMasterProgram(X,Y)
EnrolledInMasterProgram-MasterProgram EnrolledInMasterProgram(X,Y) MasterProgram(Y)
EnrolledInMasterProgram EnrolledInProgram EnrolledInMasterProgram(X,Y) EnrolledInProgram(X,Y)


NIs:


NonEuropeanStudent EuropeanStudent NonEuropeanStudent(X) EuropeanStudent(X)
NonEuropeanUniversity EuropeanUniversity NonEuropeanUniversity(X) EuropeanUniversity(X)
NonEuropeanUniversity FrenchUniversity NonEuropeanUniversity(X) FrenchUniversity(X)



Figure 9.3: A DL-LITER Tbox enriching the global schema of Example 9.4.2


GAV and DL-LITER

We revisit Example 9.3.2 by adding the data source S5 giving the list of French so-called Grandes Ecoles. Its local schema is made of the local relation: S5.GrandeEcole(nomEcole). According to this new source and also to the enriched global schema of Figure 9.3, we add the following GAV mappings to the ones already considered in Example 9.3.2:

College(U)S5.GrandeEcole(U)
EuropeanStudent(N)S2.Erasmus(N,C,U)
NonEuropeanStudent(N)S3.CampusFr(N,P,U)

Consider again the global query looking for universities with registered master students:

q(x) :- RegisteredTo(s,x), MasterStudent(s)

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the reformulation:

q1(x) :- College(x)

By unfolding q(x), we obtain the same two rewritings as in Example 9.3.2:

r1(x) :- S3.CampusFr(s,v1,x), S2.Erasmus(s,v2,v3), S4.Mundus(v4,v2) 
r2(x) :- S3.CampusFr(s,v6,x), S4.Mundus(v6,v8)

By unfolding the reformulation q1(x), we get the additional rewriting:

r3(x) :- S5.GrandeEcole(x)

It is important to note that even if we had the GAV mapping

College(U)S5.GrandeEcole(U),

the rewriting r3(x) would not have been obtained without reformulated first the initial query q(x) into q1(x).

Now, in contrast with the standard GAV approach, we have to check the consistency of each of these rewritings. To do so:

We illustrate the consistency check by checking the consistency of the rewriting r1(x). First, its expansion replaces each of its local atoms S(z) with the conjunction of global atoms of the form G(z) that can be produced by a GAV mapping G(x) S(x), if such GAV mappings exist. For expanding r1(x), we apply the following GAV mappings:

NonEuropeanStudent(N)S3.CampusFr(N,P,U)
University(U)S3.CampusFr(N,P,U)
RegisteredTo(N,U)S3.CampusFr(N,P,U)
EuropeanStudent(N)S2.Erasmus(N,C,U)
University(U)S2.Erasmus(N,C,U)
MasterProgram(T)S4.Mundus(T,C)
MasterCourse(C)S4.Mundus(T,C)

As a result, we obtain the following expansion for r1(x):

Exp_r1(x) :- NonEuropeanStudent(s), University(x), RegisteredTo(s,x), 
             EuropeanStudent(s), University(x), MasterProgram(v4), 
             MasterCourse(v2)

We then apply the Consistent algorithm. For this, we evaluate qunsat1, qunsat2 and qunsat3 over the body of Exp_r1(x) seen as a relational database; i.e., we freeze its atoms to obtain a canonical instance. Query qunsat1 returns true, so an inconsistency has been detected and the rewriting r1(x) is rejected.

LAV and DL-LITER

We revisit Example 9.4.2 by adding the same data source S5 as in Section 9.5.2. The GAV mapping is also a LAV mapping: S5.GrandeEcole(U)College(U)

Consider again the global query considered in Section 9.4.1:

q(x) :- RegisteredTo(s,x), EnrolledInProgram(s,p), MasterProgram(p)

It is left as an exercise to show that the application of the PerfectRef(q(x), PI) algorithm returns, in addition to q(x) itself, the following reformulation:

q1(x) :- College(x)

By applying the Minicon algorithm1 to the initial query q(x), we obtain the following rewriting (as shown in Section 9.4.1):

r2(x) :- S3.CampusFr(s,v1,x), S3.CampusFr(s,p,v2),S4.Mundus(p,v5)

By applying the Minicon algorithm to the reformulation q1(x) of the initial query, we obtain the additional rewriting:

r3(x) :- S5.GrandeEcole(x)

As for the extended GAV approach, the consistency of LAV rewritings is not guaranteed because of the NIs in the Tbox. We follow the same approach: at compile time, the closure of the NIs is computed and each (declared or inferred ) NI is compiled into a Boolean query qunsat. At query time, each of these qunsat queries is evaluated over the canonical instance corresponding to each rewriting.

9.6  Peer-to-Peer Data Management Systems

In contrast with the centralized mediator model, a Peer-to-Peer data management system (PDMS for short) implements a decentralized view of data integration, in which data sources collaborate without any central authority. In a PDMS, each collaborating data source can also play the role of a mediator, so is at the same time a data server and a client for other data sources. Thus each participant to the system is a peer and there are mappings relating data from the different peers. A PDMS architecture is therefore very flexible in the sense that there is no need for a global schema defining in advance a set of terms to which each data source needs to adhere. Over time, data sources can join or leave the PDMS just by adding or removing mappings between them. PDMS are inspired by P2P file sharing systems but they enable answering fine-grained queries. Like in the mediator model, answering queries is performed by reformulating queries based on the mappings, but in a decentralized manner.

Each peer in a PDMS has a peer schema composed of peer relations and peer mappings that relate its schema to the schemas of other peers. To avoid confusing relations from different peers, we assume that each relation of peer p is of the form r@p for some local relation name r. A query to a PDMS is posed using the peer schema of one of the peers. A query is asked to a particular peer, as a query over his particular schema. It is reformulated using the peer mappings into a set of queries that may refer to other peer relations. This captures the intuition that we want to use the information available in the entire P2P system to answer the query.

For designing the mappings, the distinction made in the mediator model between local and global relations does not make sense anymore, since each peer relation may play the role at different times both of a local relation and of a global relation. Therefore, the notions of GAV and LAV mappings are relaxed to the more appropriate symmetric notion of GLAV mappings.

Definition 9.6.1 (GLAV mapping) Let S@i and S@j be the peer schemas of two peers i and j. A GLAV mapping between these two peers is an inclusion axiom of the form: qi(x) qj(x), where qi(x) and qj(x) are conjunctive queries over the peer schema S@i, S@j, respectively.

Let qi(x,yi) and qj(x,yj) be the bodies of qi(x) and qj(x)), respectively. The semantics of the GLAV mapping qi(x) qj(x) is: x[yiqi(x,yi) ⇒∃yjqj(x,yj)].

In database terms, a GLAV mapping qi(x) qj(x) expresses that answers obtained by asking qi(x) at peer i should also be considered as answers to qj(x) asked at peer j. Note that with this semantics, each local query is assumed to be incompletely answered with local data since external data may bring in new information to it. As already mentioned, such an open-world assumption is fully appropriate for Web data.

We next show one negative and one positive result for PDMSs. In Section 9.6.1, we show that in general, answering queries with GLAV mappings is undecidable, so without further restriction, answering queries in a PDMS is undecidable. In Section 9.6.2, we show that if we restrict the peer mappings to be DL-LITER inclusion axioms, a decentralized version of the algorithm for DL-LITER can be used to answer queries in DL-LITER PDMSs.

9.6.1  Answering queries using GLAV mappings is undecidable

We show that the Dependency Implication Problem (more precisely, the problem of the implication of an inclusion dependency from a set of inclusion and functional dependencies) can be reduced to the GLAV Query Answering Problem, i.e., the problem of answering queries in presence of GLAV mappings. Since the Dependency Implication Problem is known to be undecidable, this shows that the GLAV Query Answering Problem is also undecidable.

The reduction technique is standard for proving undecidability results. We first recall how it works and also recall the Dependency Implication Problem. We believe that these notions are important to know beyond the scope of this book. Finally, we use a reduction to show the undecidability of answering queries using GLAV mappings.

Reduction from a decision problem B to a decision problem B’

Let B be a Boolean function over a set X. The decision problem B is decidable if there exists an algorithm (in any computation model equivalent to Turing machines) that terminates on any input xB and returns “true” if and only if B(x) is true.

Let B,Bbe two decision problems. A reduction from B to B’ is an algorithm f computing a function (also denoted f ) from X to X’ such that: B(x) is true B’(f(x)) is true.

It is immediate to see that if there is a reduction f from B to B:

The Dependency Implication Problem

We recall the class of dependencies that are used. Let R be a relation of arity n. Then:

Functional dependencies.
A functional dependency over R is an expression R : i1...im j, where 1 i1,...,im,jn, for n = arity(R). An instance I over R satisfies R : i1...im j if for each tuples a1,...an, b1,...bnin I,

if for each k[1..m], aik = bik, then aj = bj.

Inclusion dependencies.
An inclusion dependency over R1,R2 is an expression R1 : i1...im R2 : j1...jm, where the ik are distinct, the jk are distinct, 1 i1,...,im arity(R1), 1 j1,...,jm arity(R2). An instance I over {R1,R2} satisfies R1 : i1...im R2 : j1...jm if for each tuple a1,...,an in I(R1), there exists a tuple b1,...,bnin I(R2) such that for each k, 1 km, aik = bjk.

We will use the following known result:

Theorem 9.6.2 (Undecidability of the Dependency Implication Problem). Let R = {R1,...,Rn} be a relational schema. Given a set Σ of functional and inclusion dependencies and an inclusion dependency σ over relations in R, one cannot decide whether Σσ (i.e., whether each instance over R satisfying Σ also satisfies σ).

The problem is undecidable when Σ contains both functional and inclusion dependencies. Note that the implication problem is decidable for functional dependencies alone, and for inclusion dependencies alone. Undecidability arises when they are considered together.

Undecidability of the GLAV Query Answering Problem

The GLAV Query Answering Problem is to decide, given a PDMS N defined using a set of GLAV mappings and a query asked at one of the peers whether some particular tuple is in its answer.

Let us define a reduction from the Dependency Implication Problem to the GLAV Query Answering Problem. If we show that such a reduction exists, since the Dependency Implication Problem is undecidable, this will show that the GLAV Query Answering Problem is undecidable.

Surprisingly, we can show the reduction for a PDMS with a single peer. To do that, we will use some GLAV mapping of the form q@P q@P, where both sides of the mapping involve the same peer. Note that the undecidability still holds if such “self” mappings are forbidden. Indeed, we can simulate such a mapping by using “clones” of relations. For instance, suppose that we want to enforce the mapping R@P(x1,...,xn) R@P(y1,...,yn). Then we can use a dummy site P and a copy R@P of R@P with the mappings:

R@P (x1,...,xn) ⊇ ^R@ ^P(x1,...,xn)
^  ^
R@ P(x1,...,xn) ⊇ R@′P (x1,...,xn)
^R@ ^P(x1,...,xn) ⊇ R @P (y1,...,yn)
So, in the rest of this proof, we consider a single peer, say P, with possibly self mappings. To simplify a relation R@P is simply denoted R.

Let ,σ) be an instance over {R1,...,Rn} of the Dependency Implication Problem with Σ a finite set of functional and inclusion dependencies, and σ an inclusion dependency. We build a PDMS N defined as follows:

It is easy to see that the GLAV mappings force each Ri to satisfy the functional dependencies of Ri, and each Ri,Rj to satisfy the inclusion dependencies between Ri and Rj.

Let us assume that σ = Ri : i1 Rj : j1 for some Ri of arity n. (This is without loss of generality since the implication is already undecidable when σ is unary).

Let Ext(Ri) be the set of tuples t of arity n with values in [1..n] such that:

For instance if n = 3 and i1 = 2, Ext(Ri) = {⟨1,1,1,1,1,2,2,1,1,2,1,2,2,1,3⟩}.

We construct an instance (N ,Ext(Ri),q) of the GLAV Query Answering Problem where q is the query defined by q(x) :- Rj(y1,...,x,...yk) where the distinguished variable x is in position j1, and the existential variables yi are pairwise distinct.

We show that Σσ iff 1 is an answer to q in the PDMS N in which the only data is Ext(Ri).

() Suppose that Σσ. Let I be a model of Ext(Ri) satisfying the GLAV mappings of N . By construction of those GLAV mappings, I is a model of Σ. Because Σσ, I is a model of σ, and thus for each tuple a1,...,anin I(Ri), there exists a tuple b1,...,bkin I(Rj) such that ai1 = 1 = bj1. Therefore, Iy1,...,ykRj(y1,...,1,...yk), i.e., Iq(1). Thus 1 is an answer to q given the GLAV mapping of N and the extension Ext(Ri).

() Conversely, suppose that 1 is an answer to q given the GLAV mapping of N and the extension Ext(Ri). Note that 1 is also an answer to q if the extension of Ri is reduced to any tuple of the original Ext(Ri). Suppose that Σσ: there exists an interpretation I that satisfies Σ in which σ is not satisfied. This means that there exists a tuple e1,...,e,...en(where e is in position i1) in I(Ri) such that there does not exists a tuple in I(Rj) with the value e in position j1. Let t be the tuple of Ext(Ri) which corresponds to the equality pattern between values of e1,...,e,...en. By extending I to interpret each value of t by the element ei at the same position in e1,...,e,...en, we obtain a new interpretation Ithat satisfies Σ and thus each GLAV mapping of N , and Ri(t). Since 1 is an answer to q given the GLAV mapping of N and Ri(t), Iq(1), i.e., I(1) I(Rj[j1]). Since I(1) = e and I(Rj) = I(Rj), it means that there exists a tuple in I(Rj) with the value e in position j1, which contradicts our assumption that σ is not satisfied in I. Hence Σσ.

9.6.2  Decentralized DL-LITER

If we restrict the GLAV mappings in a PDMS to be inclusion statements that are expressible in DL-LITER , we get what we will call a DL-LITER PDMS. The decidability of query answering over a DL-LITER PDMS results from the algorithmic machinery described in the previous chapter for answering queries over DL-LITER knowledge bases. Given a query posed to a given peer, the application of the PerfectRef algorithm to the set of all the GLAV mappings in the PDMS provides a set of reformulations. The union of the answer sets obtained by evaluating each reformulation provides the answer set of the initial query. Note that a reformulation is of the form:

R1@i1(z1), , Rk@ik(zk)

where the different conjuncts Rj@ij(zj) may refer to relations of different peer schemas. Therefore, the evaluation of each reformulation may require the interrogation of different peers and the combination of the answers returned by each such sub-queries.

This provides a centralized algorithm for computing the reformulations of answering queries over a decentralized DL-LITER knowledge base. We next present a decentralized algorithm that computes exactly the same thing, i.e., we present a decentralized version of the PerfectRef algorithm seen in the previous chapter in order to deploy effectively DL-LITER PDMSs that avoids having to centralize all the GLAV mappings.

We denote PerfectRef i(q) the reformulation algorithm running on the peer Pi applied to a query q (asked to the peer Pi). The main procedure is the decentralized reformulation of each atom of the query using the positive inclusion statements that are distributed over the whole PDMS. Let us denote AtomRef i(g) the reformulation algorithm running on the peer Pi to reformulate the atom g (built on a relation of the schema of the peer Pi).

Within each peer Pi we distinguish the local positive inclusion axioms of the form Ci Di where Ci and Di are built over relations in the schema of the peer Pi, from the mappings which are positive inclusion mappings of the form Cj Di or Di Cj where Cj denotes a relation of another peer Pj (while Di refers to a relation in the schema of the peer Pi).

Let us denote LocalRef(g,PIi) the result of the reformulation of the atom g using the set PIi of local positive inclusion atoms of the peer Pi. We refer to the previous chapter (Definition 8.4.7, Section 8.4) for the computation of LocalRef(g,PIi) by backward application of the local PIs.

We just recall here that gr(g,I) denotes the reformulation of the atom g using the positive inclusion axiom I. We also recall that the atoms g that can be found as conjunct of a query q over a DL-LITER PDMS are of the following forms:

Running the algorithm AtomRef i on the peer Pi for reformulating the atom g consists first in computing the set LocalRef(g,PIi) of local reformulations of g, and then, for each mapping m with a peer Pj applicable to a local reformulation g, in triggering the application of AtomRef j(gr(g,m)) on Pj (by sending a message to Pj). Other peers Pk may be solicited in turn to run locally AtomRef k.

A loop may occur if a request of reformulation of an atom g initiated by a given peer P generates a branch of requests reaching a peer Pwhich in turn requests P to reformulate g. Such loops can be easily handled by transmitting with every request the history of the current reasoning branch. More precisely, an history hist is a sequence [(gk,Pk),,(g1,P1)] of pairs (gi,Pi) where gi is an atom of a peer Pi such that for each i[1..k- 1], gi+1 is a reformulation of gi using a mapping between Pi and Pi+1.

This is summarized in Algorithm 6, which is the atom reformulation algorithm with history running on Peer i.


AtomRefHisti(g,hist)

Input: An atom g in the vocabulary of the peer Pi, an history hist

Output: The set of its reformulations in the PDMS: R

(1)R←∅

(2)if (g,Pi) histreturn R

(3)else

(4)     Let PIi be the local PIs of the peer Pi

(5)     Let Mi be the mappings between the peer Pi and other peers

(6)     for each g′∈LocalRef(g,PIi)

(7)       for each mapping mMi between Pi and a peer Pj applicable to g

(8)        RRAtomRefHistj(gr(g,m),[(g,Pi)|hist])

Algorithm 6: The decentralized algorithm with history for reformulating atoms

Algorithm 7 is the atom reformulation algorithm (denoted AtomRef i) running on peer Pi, which just calls AtomRefHisti with an empty history.


AtomRef i(g)

Input: An atom g in the vocabulary of the peer Pi

Output: The set of its reformulations in the PDMS

(1)AtomRefHisti(g,)

Algorithm 7: The decentralized algorithm for reformulating atoms

The decentralized version of the PerfectRef algorithm that computes all the reformulations of a conjunctive query q is provided in Algorithm 8. The main difference with the centralized version is that the simplification of the produced reformulations (which is required for making some PIs applicable) are delayed after (decentralized) computation of the reformulations of all the atoms in the query.

We recall here the notation used for denoting the simplification of some atoms within a query under reformulation, which were introduced in the previous chapter when describing the PerfectRef algorithm:

For each atom in the query, it computes first (in the decentralized manner explained previously) the set of all of its reformulations, and then a first set of reformulations of the original query by building all the conjunctions between the atomic reformulations (denoted i=1nAtomRefi(gi) at Line 5). These reformulations are then possibly simplified by unifying some of their atoms (Lines 8 to 11), and the reformulation process is iterated on these newly produced reformulations until no simplification is possible (general loop starting on Line 4).


PerfectRefi(q)

Input: a conjunctive query q over the schema of the peer Pi

Output: a set of reformulations of the query using the union of PIs and mappings in the PDMS

(1)PR := {q}

(2)PR := PR

(3)while PR

(4) (a) foreach q= g1 g2 gn PR

(5)PR′′ :=  i=1nAtomRefi(gi)

(6)PR := 

(7)(b) foreach q′′PR′′

(8) foreach g1,g2 q′′

(9)if g1 and g2 unify

(10)PR := PR′∪{τ(reduce(q′′,g1,g2))}

(11)PR := PRPR′∪PR′′

(12)return PR

Algorithm 8: The decentralized PerfectRef algorithm running on the peer Pi

One can prove that the decentralized algorithm computes the same set of facts as the centralized one, and thus is correct. The proof results (1) from the observation that the centralized version of PerfectRefi (in which AtomRefi(gi) is computed by iterating the one-step application of PIs on each atom gi of the query) produces the same results than the original PerfectRef, and (2) from the completeness of AtomRefi(gi) ensuring the decentralized computation of all the reformulations of gi.

9.7  Further reading

The Bucket and Minicon algorithms can be extended ( [113137]) to handle (union of) conjunctives queries with interpreted predicates. When a query q includes interpreted predicates, finding all answers to q given the LAV mappings is co-NP hard in the size of the data. This complexity result shows that answering such queries cannot be fully realized with a finite set of conjunctive rewriting (unlike what we showed here in absence of interpreted predicates). The Inverse-rule algorithm does not handle interpreted predicates but is able to build recursive query plans for data integration [59].

A survey on answering queries using views can be found in [87], and a survey on query containment for data integration systems in [124].

More material can be found on PDMS in [8886].

Distributed reasoning in a peer to peer setting has been investigated in [14] as a basis for querying distributed data through distributed ontologies [731]. The subtle point that we have not treated in this chapter concerns consistency checking. In contrast with the centralized case, the global consistency of the PDMS cannot be checked at query time since the queried peer does not know all the peers in the PDMS. However, it can get the identifiers of the peers involved in a reformulation of the query. Then the (local) consistency of the union of the corresponding knowledge bases can be checked in a decentralized manner. The important point is that it can be shown that this local consistency is sufficient to guarantee that the answers obtained by evaluating the reformulations (computed by the decentralized algorithm that we have described) are well-founded.

The undecidability of the Dependency Implication Problem is shown in [42] even if σ is a unary inclusion dependency. More on this topic may be found in [9])

9.8  Exercices

Exercise 9.8.1 By applying the query containment algorithm (see Algorithm 4), determine which query is contained in which one among the three following queries. Are there equivalent queries ? (two queries qand qare equivalent if qis contained in qand qis contained in q).

q1(x) :- A(x,y), B(x,y’), A(y,z’) 
 
q2(x) :- A(x,y’), A(y’,z), B(x,x) 
 
q3(x) :- B(x,y), A(x,y’), B(z,z’), A(y’,u)

Exercise 9.8.2 Consider a global schema defined by the following relations:

emp(E): E is an employee

phone(E,P): E has P as phone number

office(E,O): E has O as office

manager(E,M): M is the manager of E

dept(E,D): D is the department of E

Suppose that the three following data sources are available for providing data:

Source1 provides the phone number and the manager for some employees. It is modeled by the local relation s1(E,P,M).

Source2 provides the office and the department for some employees. It is modeled by the local relation s2(E,O,D).

Source3 provides the phone number of employees of the ’toy’ department. It is modeled by the local relation s3(E,P).

  1. Model the content of these sources by GAV mappings.
  2. Model the content of these sources by LAV mappings.
  3. Consider the global query asking for Sally’s phone number and office:
    q(x,y) :- phone(’sally’, x), office(’sally’, y)

    Compute the reformulation of the query in terms of local relations:

    • by applying the query unfolding algorithmto the GAV mappings of Question 1,
    • by applying the Bucket algorithm to LAV mappings of Question 2.

    Which algorithm is easier ?

  4. Now Source1 disappears (becomes unavailable) and a new source comes in, that provides the phone number of their manager for some employees. Do the updates in the GAV and LAV mappings that are required to take into account these changes. What is the approach (GAV or LAV) for which updating the mappings between the global and local relations is easier ?

Exercise 9.8.3 Consider the three following LAV mappings:

V1(x) cite(x,y), cite(y,x)

V2(x,y) sameTopic(x,y)

V3(x,y) cite(x,z), cite(z,x), sameTopic(x,z)

  1. Provide the FOL semantics of these LAV mappings
  2. Suppose that the global relation cite(x,y) means that the paper x cites the paper y, and that the global relation sameTopic(x,y) means that the two papers x and y have the same topic. Suppose that each LAV mapping models the content of different available data sources. Express with an english sentence which information on papers each data source provides.
  3. Apply in turn the Bucket, Minicon and Inverse-rule algorithms to compute the different rewritings of the following query asking for papers that cite and are cited by a paper having the same topic:
    q(u) :- cite(u,v), cite(v,u), sameTopic(u,v)

1The same holds for the Bucket or Inverse-rules algorithm.