As we saw in the previous chapter, ontologies form the backbone of the Semantic Web by providing a conceptual view of data and services available worldwide via the Web. We discussed the RDF language for describing knowledge, and a family of languages, called description logics, that provide formal foundations for ontologies and in particular for the OWL ontology language recommended by the W3C.
In this chapter, the focus is on querying RDF data. Since massive volumes of RDF sources are more and more present on the Web, specifying queries for RDF data that can be evaluated efficiently is becoming every day more and more important.
We will see that the set of query answers strongly depends on the semantic context. We will study query answering when the ontology is specified in RDFS and then when the ontology is specified in DL-LITE. The ontology language DL-LITE belongs to the DL family. It has been designed as a trade-off between the ability to describe a wide range of domains of interest and query evaluation efficiency. More precisely, we will focus on two important fragments of DL-LITE, namely DL-LITE and DL-LITE.
We will observe that the problem of answering queries through ontologies is quite different from that of answering database queries for the following two essential reasons:
From this, it should be clear that query answering through ontologies is more complicated that in classical databases. We have to reason to find which inferences may participate in answering a given query. We also have to reason to verify the consistency of our knowledge.
In this section, we set the stage for querying RDF data. We also discuss the impact of ontologies (knowledge on the domain of interest) on the answers. To simplify, we ignore here blank nodes.
Figure 8.1 is an enhanced version of the University example that we will use throughout this chapter. The first column provides RDF data in the triple syntax, while the second column shows the corresponding facts in FOL.
Subject | Predicate | Object | FOL semantics |
dupond | Leads | infoDept | Leads(dupond,infoDept) |
dupond | rdf:type | Professor | Professor(dupond) |
durand | ResponsibleOf | ue111 | ResponsibleOf(durand,ue111) |
durand | Leads | csDept | Leads(durand,csDept) |
paul | TeachesTo | pierre | TeachesTo(paul,pierre) |
paul | rdf:type | PhdStudent | PhDStudent(paul) |
paul | EnrolledIn | infoDept | EnrolledIn(paul, infodept) |
pierre | EnrolledIn | infoDept | EnrolledIn(pierre, infodept) |
pierre | rdf:type | Undergrad | Undergrad(pierre) |
pierre | RegisteredTo | ue111 | Registered(pierre, ue111) |
ue111 | OfferedBy | infoDept | OfferedBy(ue111,infoDept) |
ue111 | rdf:type | CSCourse | CSCourse(ue111) |
jim | EnrolledIn | csDept | EnrolledIn(jim, csDept) |
csDept | rdf:type | TeachingDept | TeachingDept(csDept) |
RDF triples can be asserted in a very flexible way and almost without constraints. The association of some ontology is not a requirement. Users can update a collection of RDF statements freely by just adding/removing triples. The only reserved word in the RDF vocabulary is rdf:type that is used to relate constant names to types, i.e., classes in domain of interest or unary predicates in FOL world.
Let us now consider querying a set of RDF facts, for which the query language SPARQL has been proposed. We briefly consider it next. query language. We briefly consider it next.
SPARQL (pronounced “sparkle”) is a recursive acronym standing for SPARQL Protocol And RDF Query Language. It is a W3C recommendation as of 2008. Although it does borrow some features from XQuery (functions and operators), it is based on the graph model underlying RDF data.
For instance, the following query expresses in SPARQL the search of all the individuals who are enrolled in a department led by a Professor.
We used here an SQL-like syntax. There exists competing syntaxes for expressing SPARQL queries. The corresponding query in FOL notation is:
q(x) : - ∃y∃z[EnrolledIn(x,y) ∧Leads(z,y) ∧Professor(z)]
This is a conjunctive query, i.e., a FOL formula without negation or disjunction, of the form
In the remainder of this chapter, we use conjunctive queries as the query language for RDF. From the example, it should be clear that all we say is relevant to SPARQL. We use a (standard) simplified notation for conjunctive queries. We omit the existential quantifiers and denote the connector ∧ by a comma. Observe that this does not introduce any ambiguity. In particular, all variables not occurring in the “head of the query” (i.e., in q(x1,...,xm)) are understood as existentially quantified. The variables in x1,...,xm are said to be distinguished.
In this simplified form, the example SPARQL query becomes:
Now consider a query in the general form:
An essential issue is in the meaning of “holds” in the previous informal definition.
Inference with data only. In the simplest case, a fact Ri(ν(ui)) holds if it is a known fact (explicitly stated in the data store). For instance, consider the previous conjunctive query in the University example. The evaluation of the query q(x) against the facts of Figure 8.1 returns {paul, pierre} as its answer set. To see why paul is an answer, we just check that, by mapping the distinguished variable to the constant paul, and the existential variables y and z respectively to the constants infoDept and dupond, all the conjuncts in the query definition are satisfied by facts in the database. The same holds if the distinguished variable x is instantiated with the constant pierre.
More inference using an ontology. Now, let us assume that a fact Ri(ν(ui)) holds if it is a known fact or if it is a consequence of the known facts by taking the ontological statements into account. Suppose now that we also have the knowledge that someone responsible for a class has to be a professor, that is in DL syntax:
To see why, we just have to consider the FOL semantics of this DL statement:
This logical implication indeed allows inferring the fact Professor(durand) from the fact ResponsibleOf(durand, ue111).
More subtly, we can get answers from partially instantiated facts that can be logically entailed by the knowledge base. Suppose that we know that a professor teaches at least one course, that is in DL syntax:
From the explicit ground fact Professor(durand) and the contraint Professor ⊑∃TeachesIn, we know that TeachesIn(durand,v) holds for some unknown value v. The valuation of v may vary in the different “worlds” satisfying the constraint. This is however sufficient to infer that answer q(durand) is true in all these possible worlds.
Formal definition of answer set. Recall that φ⊧ψ (i.e., φimplies ψ) if each interpretation making φ true also makes ψ true, or equivalently, every model of φ is a model of ψ. We next provide a formal definition of the answer set of a query for a DL knowledge base, that captures the general setting where data (the Abox ) is associated to an ontology (the Tbox ) to form a DL knowledge base = ∪. A query to the knowledge base is a conjunctive query using class or property predicates from the given knowledge base with the proper arity. (Class predicates have arity one and property predicates arity 2.) A valuation ν of a set of variables {z1,...,zp} is a substitution (denoted {z1/a1,...,zp/ap}) that assigns each variable zi to a constant ai (two distinct variables may be assigned to a same constant). Given two valuations ν and ν′ of two disjoint sets of variables {z1,...,zp} and {v1,...,vk} , ν∘ν′ denotes the valuation assigning the variables zi to the corresponding constants in ν, and the variables vj to the corresponding constants in ν′.
We can now formally define the notion of answers.
Definition 8.2.1 Let q(x1,...,xm) : - R1(u1),...,Rp(up) be a query to a knowledge base . An answer is a ground fact q(ν(x1),...,ν(xm)) for some valuation ν of the distinguished variables such that in every model of there exists a valuation ν′of the existential variables for which Ri(ν∘ν′(ui)) is true for each i. The answer set of qfor is the set of all such answers. It is denoted q().
Consider again the previous University query example. We have seen that its answer set varies depending on the knowledge base against which it is evaluated. In particular, if is the set of facts of Figure 8.1 and is {∃ResponsibleOf ⊑Professor}, we have:
Boolean queries. To conclude this section, we consider a particular interesting case of queries, that of Boolean queries. The arity of a query is the number of its distinguished variables. A query of arity 0, i.e., a query of the form q() : ..., is called a Boolean query. Note that there is a single possible answer, namely q(). In this case, we see that as a positive answer to the query, i.e., as true. If the answer set is empty, we see that as false. To see an example, consider the query
This Boolean query asks whether there exists a student teaching to other students. Suppose ′ = {PhDStudent ⊑Student}. Then we can distinguish two cases:
In the second case, the fact Student(paul), although not in the Abox , can be inferred from the fact PhDStudent(paul) in and the inclusion statement PhDStudent ⊑Student in ′. Together with the fact TeachesTo(paul,pierre) present in the Abox, it makes the body of the query q′ satisfied.
In this section, we consider RDF data without blank nodes (that can be seen as an Abox), associated to an RDFS ontology (that can be seen as a very simple Tbox). RDF data and RDFS statements can be denoted and stored as triples. However, the important point is that RDFS statements have a logical semantics which can be operationalized as a set of inference rules (see Section 7.3 in Chapter 7). We illustrate here how this can be used to answer queries.
Figure 8.2 is an example of an RDFS ontology that can be associated to the RDF data in Figure 8.1. The RDFS statements composing the ontology are given in three notations: the triple notation, the DL notation, and the corresponding FOL notation.
RDFS notation | DL notation | FOL notation |
⟨AcademicStaff rdfs:subClassOfStaff ⟩ | AcademicStaff ⊑Staff | AcademicStaff (X) ⇒Staff (X) |
⟨Professor rdfs:subClassOfAcademicStaff ⟩ | Professor ⊑AcademicStaff | Professor(X) ⇒AcademicStaff (X) |
⟨Lecturer rdfs:subClassOfAcademicStaff ⟩ | Lecturer ⊑AcademicStaff | Lecturer(X) ⇒AcademicStaff (X) |
⟨PhDStudent rdfs:subClassOfLecturer⟩ | PhDStudent ⊑Lecturer | PhDStudent(X) ⇒Lecturer(X) |
⟨PhDStudent rdfs:subClassOfStudent⟩ | PhDStudent ⊑Student | PhDStudent(X) ⇒Student(X) |
⟨TeachesIn rdfs:domainAcademicStaff ⟩ | ∃TeachesIn ⊑AcademicStaff | TeachesIn(X,Y) ⇒AcademicStaff (X) |
⟨TeachesIn rdfs:rangeCourse⟩ | ∃TeachesIn-⊑Course | TeachesIn(X,Y) ⇒Course(Y) |
⟨ResponsibleOf rdfs:domainProfessor⟩ | ∃ResponsibleOf ⊑Professor | ResponsibleOf (X,Y) ⇒Professor(X) |
⟨ResponsibleOf rdfs:rangeCourse⟩ | ∃ResponsibleOf -⊑Course | ResponsibleOf (X,Y) ⇒Course(Y) |
⟨TeachesTo rdfs:domainAcademicStaff ⟩ | ∃TeachesTo ⊑AcademicStaff | TeachesTo(X,Y) ⇒AcademicStaff (X) |
⟨TeachesTo rdfs:rangeStudent⟩ | ∃TeachesTo-⊑Student | TeachesTo(X,Y) ⇒Student(Y) |
⟨Leads rdfs:domainAdminStaff ⟩ | ∃Leads ⊑AdminStaff | Leads(X,Y) ⇒AdminStaff (X) |
⟨Leads rdfs:rangeDept⟩ | ∃Leads-⊑Dept | Leads(X,Y) ⇒Dept(Y) |
⟨RegisteredIn rdfs:domainStudent⟩ | ∃RegisteredIn ⊑Student | RegisteredIn(X,Y) ⇒Student(X) |
⟨RegisteredIn rdfs:rangeCourse⟩ | ∃RegisteredIn-⊑Course | RegisteredIn(X,Y) ⇒Course(Y) |
⟨ResponsibleOf rdfs:subPropertyOfTeachesIn⟩ | ResponsibleOf ⊑TeachesIn | ResponsibleOf (X,Y) ⇒TeachesIn(X,Y) |
As already said, these RDFS statements can be used to infer new triples (i.e., new facts) from the RDF database. For example, the RDF triple ⟨durandResponsibleOf ue111⟩ in Figure 8.1 corresponds to the fact ResponsibleOf (durand,ue111), and the RDFS statement ⟨ResponsibleOf rdfs:domainProfessor⟩ corresponds to the logical rule: ResponsibleOf (X,Y) ⇒Professor(X). The condition of this rule can be mapped with the fact ResponsibleOf (durand,ue111) by the substitution {X/durand, Y/ue111}, and thus the corresponding instantiation of the conclusion Professor(durand) can be inferred. This new fact can in turn trigger a rule such as Professor(X) ⇒AcademicStaff (X), thereby allowing the inference of additional facts such as AcademicStaff (durand).
More generally, RDFS statements correspond to rules that can be applied in a forward-chaining manner to the initial set of facts until saturation, i.e., until no more fact can be inferred. It is important to see that the variables in the head of rule all occur in the body. In other words, no variable is quantified existentially. So rules always infer new ground facts. Such rules are said to be safe. We will use unsafe rules when we consider DL-LITE, which will render query processing more complicated.
The simple forward-chaining Algorithm 1 starts with the set of initial facts and repeats inference steps until saturation.
Input: An Abox and an RDFS Tbox
Output: The set of facts that are inferred: Δ0
(4) foreach rule condition ⇒conclusion in ,
(5) if there exists a substitution σ such that σ.condition ∈ Δ0
(10)until Δ1 = ∅
Figure 8.3 shows the facts resulting from the application of Algorithm 1 to the facts of Figure 8.1 and the rules of Figure 8.2.
Asserted facts | Inferred facts |
Leads(dupond,infoDept) | AdminStaff(dupond) |
Professor(dupond) | Dept(infoDept) |
ResponsibleOf(durand,ue111) | AcademicStaff(dupond) |
Leads(durand,csDept) | Professor(durand) |
TeachesTo(paul,pierre) | Course(ue111) |
PhDStudent(paul) | AcademicStaff(durand) |
EnrolledIn(paul, infodept) | AdminStaff(durand) |
EnrolledIn(pierre, infodept) | Dept(csDept) |
Undergrad(pierre) | AcademicStaff(paul) |
Registered(pierre, ue111) | Student(pierre) |
OfferedBy(ue111,infoDept) | Student(paul) |
CSCourse(ue111) | Student(pierre) |
EnrolledIn(jim, csDept) | Lecturer(paul) |
TeachingDept(csDept) | AcademicStaff(paul) |
Staff(paul) | |
Staff(dupond) | |
Staff(durand) | |
To answer queries from RDF facts associated to an RDFS ontology, one can proceed as follows. First one compute all the inferred facts (in a bottom-up manner) with the previous algorithm. Each step of the loop can be computed, for instance, using a standard relational query engine. This yields a new database consisting of the set of all the facts (asserted or inferred). Then one can evaluate the query directly on that database using a standard relational query engine.
For example, the standard evaluation against the set of (asserted + inferred) facts in Figure 8.3 of the query
Complexity analysis. It is interesting to estimate both the maximum number of inferred triples and the worst-case time complexity for inferring them. Of course, this depends on the number of asserted triples (i.e., the size of the data) and also on the number of axioms in the ontology (i.e., the size of the ontology). Let M be the number of facts in the Abox and N the number of axioms in the Tbox. From the presence of some initial fact C(a), one can derive a number of new facts C′(a) for some class C′. Note that the number of such C′(a) is bounded by the number of axioms in the ontology, i.e., it is less than N. Now consider some initial fact R(a,b). From it, one can derive some facts R′(a,b) or R′(b,a) as well as some facts C′(a) or C′(b) for some R′ and C′. Again, one can observe that for a particular R(a,b), the number of new facts one can derive is bounded by the number of axioms in the ontology, i.e., it is less than N. Since the number of initial facts is M, the number of facts one can derive is bounded by M×N. Observe in particular that it is linear in the number of database facts.
Now consider the worst-case time complexity for inferring them by the Algorithm 1. We have to perform at most M×N iterations. Each iteration can be performed in polynomial time. So the algorithm is in PTIME. One can show more precisely that it is in 0((M×N)2).
In this section, we consider two important fragments of the DL-LITE ontology language of the DL family. As we will see, querying is feasible for these two languages even though they provide a quite rich framework for describing semantics. We study querying through ontologies expressed in these two fragments.
A DL-LITE ontology may contain axioms corresponding (up to the syntax) to those allowed in an RDFS ontology. Besides, it may contain other axioms, of three kinds: positive inclusions (PI for short), negative inclusions (NI) and key constraints (Key). Figure 8.4 shows examples of these three kinds of DL-LITE axioms with their corresponding FOL semantics. These constraints are not expressible in RDFS.
DL notation | Corresponding logical rule | |
PI | Professor ⊑∃TeachesIn | Professor(X) ⇒∃YTeachesIn(X,Y) |
Course ⊑∃RegisteredIn- | Course(X) ⇒∃YRegisteredIn(Y,X) | |
NI | Student ⊑¬Staff | Student(X) ⇒¬Staff (X) |
Key | (functResponsibleOf -) | ResponsibleOf (Y,X) ∧ResponsibleOf (Z,X) ⇒Y = Z |
We next consider in turn these new kinds of axioms.
Positive inclusion and incompleteness. A positive inclusion axiom is an expression of one of the following forms:
DL notation | Corresponding logical rule |
B ⊑∃P | B(X) ⇒∃YP(X,Y) |
∃Q ⊑∃P | Q(X,Y) ⇒∃ZP(X,Z) |
B ⊑∃P- | B(X) ⇒∃YP(Y,X) |
∃Q ⊑∃P- | Q(X,Y) ⇒∃ZP(Z,X) |
P ⊑Q- or P-⊑Q | P(X,Y) ⇒Q(Y,X) |
where P and Q denote properties and B denotes a class. Recall that P- denotes the inverse of P, i.e., P-(x,y) iff P(y,x) for all x,y.
Observe that expressions of the form ∃P ⊑B belong to DL-LITE since they already are in RDFS. Expressions of the form P ⊑Q (so equivalently P-⊑Q-) also belong to DL-LITE for the same reason.
It is important to note that the logical rules corresponding to PI axioms expressible in DL-LITE are not necessarily safe (as opposed to RDFS that uses only safe rules.) Consider the rule
Negative inclusion and inconsistencies. A negative inclusion axioms is an expression that takes one of the forms:
DL notation |
B 1 ⊑¬B2 |
R1 ⊑¬R2 |
where
The corresponding logic rules are left as an exercise. An example of NI (expressing the constraint “Students do not teach courses”) and the corresponding logical rule are as follows:
DL notation | Corresponding logical rule |
Student ⊑¬∃TeachesIn | Student(X) ⇒¬∃YTeachesIn(X,Y) |
or equivalently, ∃YTeachesIn(X,Y) ⇒¬Student(X) | |
NIs express disjointness constraints between classes or between properties, and thus introduce negation in the language. Therefore, the knowledge base against which the queries have to be evaluated may be inconsistent, i.e., a model of the corresponding theory may not exist. Note that this is not possible with RDFS ontologies: we showed an algorithm that computed a model (indeed the smallest model).
For example, adding the NI Student ⊑¬Staff to the ontology of Figure 8.2 leads to the inconsistency of the knowledge base made of the facts in Figure 8.1 and of the axioms in the ontology of Figure 8.2 enriched with that NI. The reason is that from the fact PhDStudent(paul), and the inclusion axiom PhdStudent ⊑Student, we can infer the fact Student(paul), and in turn the literal ¬Staff (paul) from the NI Student ⊑¬Staff . On the other hand, the fact Staff (paul) can be inferred from the fact PhDStudent(paul) and the inclusion axioms PhdStudent ⊑Lecturer, Lecturer ⊑AcademicStaff and AcademicStaff ⊑Staff .
Key constraints and more inconsistencies. Key constraints are expressed by functionality axioms of the form (functP) or (functP-) where P is a property and P- denotes the inverse property of P. Figure 8.5 shows their logical semantics in the form of logical rules.
DL notation | corresponding logical rule |
(functP) | P(X,Y) ∧P(X,Z) ⇒Y = Z |
(functP-) | P(Y,X) ∧P(Z,X) ⇒Y = Z |
Observe that key constraints may also lead to inconsistencies. This is the case if we attempt to equate two distinct constants, e.g., durand and dupond. For instance, the axiom (functResponsibleOf -) expresses that a course must have a unique professor responsible for it. Therefore, a knowledge base containing this axiom and the two facts:
ResponsibleOf (durand,ue111) and ResponsibleOf (dupond,ue111),
would be inconsistent. This is because we assume implicitly that an individual is denoted by a single constant. This natural (in practice) assumption is called in logic the unique name assumption.
We will consider the following two fragments of DL-LITE:
One may wonder why one would choose such a convoluted language. Why not simply extend RDFS with the 3 kinds of axioms? This is because functional constraints interact with inclusion constraints in intricate ways. Query evaluation when they are all present is much more complex. This will be illustrated by an example in Section 8.4.4.
From the previous discussion, there are two fundamental differences between query answering in the context of RDFS and of DL-LITE knowledge bases:
In Section 8.4.2, we show an algorithm for checking consistency of a DL-LITE knowledge base, and in Section 8.4.3 an algorithm for answering conjunctive queries posed to a DL-LITE knowledge base. The particularity of these two algorithms is that they work in two-steps:
Separating ontology reasoning from data processing is typically a desired feature (when possible). In particular, such an approach has the practical interest that it makes it possible to use an SQL engine for the second step, thus taking advantage of well-established query optimization strategies supported by standard relational data management systems. In the first step, we deal with the Tbox only, typically of much smaller size.
In Section 8.4.4, we show by an example that DL-LITE and DL-LITE are two maximal fragments of the DL-Lite family for which reformulating queries into SQL is possible: combining constraints expressible in DL-LITE and DL-LITE may result in an infinite number of non redundant SQL reformulations for some queries.
Towards consistency checking, the first step uses the Tbox alone. It consists in computing the deductive closure of the Tbox, i.e., all the inclusion axioms that are logically entailed by the axioms declared in the Tbox. More precisely, the deductive closure (closure for short) of a DL-LITE Tbox is defined as follows.
Definition 8.4.1 (Closure of a Tbox) Let be a DL-LITEor a DL-LITETbox. The closure of , denoted by cl( ), is inductively defined as follows:
Observe that although all axioms should be considered to construct this closure, only negative inclusions and key constraints can raise an inconsistency. The set of all the negative inclusions and key constraints in cl( ) is called the NI-closure. For example, consider the Tbox of Figure 8.6 made of the RDFS ontology shown in Figure 8.2 enriched with the PIs and NI shown in Figure 8.4. The NI-closure of that Tbox is shown in Figure 8.7.
DL notation | FOL notation |
AcademicStaff ⊑Staff | AcademicStaff (X) ⇒Staff (X) |
Professor ⊑AcademicStaff | Professor(X) ⇒AcademicStaff (X) |
Lecturer ⊑AcademicStaff | Lecturer(X) ⇒AcademicStaff (X) |
PhDStudent ⊑Lecturer | PhDStudent(X) ⇒Lecturer(X) |
PhDStudent ⊑Student | PhDStudent(X) ⇒Student(X) |
∃TeachesIn ⊑AcademicStaff | TeachesIn(X,Y) ⇒AcademicStaff (X) |
∃TeachesIn-⊑Course | TeachesIn(X,Y) ⇒Course(Y) |
∃ResponsibleOf ⊑Professor | ResponsibleOf (X,Y) ⇒Professor(X) |
∃ResponsibleOf -⊑Course | ResponsibleOf (X,Y) ⇒Course(Y) |
∃TeachesTo ⊑AcademicStaff | TeachesTo(X,Y) ⇒AcademicStaff (X) |
∃TeachesTo-⊑Student | TeachesTo(X,Y) ⇒Student(Y) |
∃Leads ⊑AdminStaff | Leads(X,Y) ⇒AdminStaff (X) |
∃Leads-⊑Dept | Leads(X,Y) ⇒Dept(Y) |
∃RegisteredIn ⊑Student | RegisteredIn(X,Y) ⇒Student(X) |
∃RegisteredIn-⊑Course | RegisteredIn(X,Y) ⇒Course(Y) |
ResponsibleOf ⊑TeachesIn | ResponsibleOf (X,Y) ⇒TeachesIn(X,Y) |
Professor ⊑∃TeachesIn | Professor(X) ⇒∃YTeachesIn(X,Y) |
Course ⊑∃RegisteredIn- | Course(X) ⇒∃YRegisteredIn(Y,X) |
Student ⊑¬Staff | Student(X) ⇒¬Staff (X) |
This example shows that it is possible to infer an important number of new NIs. In fact, we have to compute all the consequences. But as we will see there is at most a polynomial number of consequences.
DL notation | FOL notation |
Student ⊑¬Staff | Student(X) ⇒¬Staff (X) |
PhDStudent ⊑¬Staff | PhDStudent(X) ⇒¬Staff (X) |
∃TeachesTo-⊑¬Staff | TeachesTo(Y,X) ⇒¬Staff (X) |
∃RegisteredIn ⊑¬Staff | RegisteredIn(X,Y) ⇒¬Staff (X) |
Lecturer ⊑¬Student | Lecturer(X) ⇒¬Student(X) |
Lecturer ⊑¬PhDStudent | Lecturer(X) ⇒¬PhDStudent(X) |
Lecturer ⊑¬∃TeachesTo- | Lecturer(X) ⇒¬∃Y[TeachesTo(Y,X)] |
Lecturer ⊑¬∃RegisteredIn | Lecturer(X) ⇒¬∃Y[RegisteredIn(X,Y)] |
Professor ⊑¬Student | Professor(X) ⇒¬Student(X) |
Professor ⊑¬PhDStudent | Professor(X) ⇒¬PhDStudent(X) |
Professor ⊑¬∃TeachesTo- | Professor(X) ⇒¬∃Y[TeachesTo(Y,X)] |
Professor ⊑¬∃RegisteredIn | Professor(X) ⇒¬∃Y[RegisteredIn(X,Y)] |
AcademicStaff ⊑¬Student | AcademicStaff (X) ⇒¬Student(X) |
AcademicStaff ⊑¬PhDStudent | AcademicStaff (X) ⇒¬PhDStudent(X) |
AcademicStaff ⊑¬∃TeachesTo- | AcademicStaff (X) ⇒¬∃Y[TeachesTo(Y,X)] |
AcademicStaff ⊑¬∃RegisteredIn | AcademicStaff (X) ⇒¬∃Y[RegisteredIn(X,Y)] |
Staff ⊑¬Student | Staff (X) ⇒¬Student(X) |
Staff ⊑¬PhDStudent | Staff (X) ⇒¬PhDStudent(X) |
Staff ⊑¬∃TeachesTo- | Staff (X) ⇒¬∃Y[TeachesTo(Y,X)] |
Staff ⊑¬∃RegisteredIn | Staff (X) ⇒¬∃Y[RegisteredIn(X,Y)] |
∃TeachesTo ⊑¬Student | TeachesTo(X,Y) ⇒¬Student(X) |
∃TeachesTo ⊑¬PhDStudent | TeachesTo(X,Y) ⇒¬PhDStudent(X) |
∃TeachesTo ⊑¬∃TeachesTo- | TeachesTo(X,Y) ⇒¬∃Z[TeachesTo(Z,X)] |
∃TeachesTo ⊑¬∃RegisteredIn | TeachesTo(X,Y) ⇒¬∃Z[RegisteredIn(X,Z)] |
∃TeachesIn ⊑¬Student | TeachesIn(X,Y) ⇒¬Student(X) |
∃TeachesIn ⊑¬PhDStudent | TeachesIn(X,Y) ⇒¬PhDStudent(X) |
∃TeachesIn ⊑¬∃TeachesTo- | TeachesIn(X,Y) ⇒¬∃Z[TeachesTo(Z,X)] |
∃TeachesIn ⊑¬∃RegisteredIn | TeachesIn(X,Y) ⇒¬∃Z[RegisteredIn(X,Z)] |
We use three propositions for analyzing the consistency problem: one for evaluating the complexity of evaluating the closure and the last two for showing the logical soundness and completeness of this closure. Finally, a fourth proposition will show how to use these results for data consistency checking.
Proposition 8.4.2 (Size of the closure of a Tbox and complexity of its computation) Let be a DL-LITEor a DL-LITETbox.
Proof (sketch). (1.) Follows from the form of the statements that are allowed in a DL-LITE or a DL-LITE Tbox.
For (2.), consider the items (2.) to (12.) in Definition 8.4.1. These are closure rules that are exhaustively applied to the Tbox until saturation. Let 0 = . For each i, let Δi be the set of statements that can be derived from i directly using the closure rules (2.) to (12.). Let i+1 = i ∪ Δi. Clearly, for each i, i ⊆cl( ), so its size is polynomial in the size of .
Now since the size of i is polynomial in the size of , each step of the computation can clearly be performed in PTIME. Since the number of steps is less than the number of statements in cl( ), the entire computation can be performed in PTIME. □
The next proposition states the soundness of the closure.
Proposition 8.4.3 (Soundness of the closure of a Tbox) For each , ≡ cl( ). In other words, for each Abox satisfying a Tbox , also satisfies cl( ).
Proof (sketch). cl( )⊧ , since is included in cl( ). Clearly, the application of each closure rule is sound. So for each i, i⊧i+1. By induction, = 0⊧i for each i. Thus ⊧cl( ), so ≡cl( ). □
The next proposition establishes the completeness of the closure of a Tbox : cl( ) contains all the PIs, NIs and key constraints that are logically entailed by that Tbox (up to equivalence).
Proposition 8.4.4 (Completeness of the closure of a Tbox) Let be a DL-LITEor a DL-LITETbox. Then
then X⊑¬X∈cl( ),
otherwise ⊧X⊑Y iff X⊑Y ∈cl( ) or ¬Y ⊑¬X∈cl( ).
Proof (sketch). We build a canonical interpretation I of the classes and properties appearing in as follows: for each X such that X⊑¬X∈cl( ), I(X) = ∅; for the other classes or properties, we associate a constant (respectively a pair of constants) with each class (respectively each property) and we initialize their interpretations with those (pairs of) constants. Then, we complete these interpretations by applying the positive inclusions in cl( ) as logical rules in a bottom up manner until saturation. For instance, if A⊑∃P is in cl( ), from the initial state where a is I(A), and (p1,p2) is in I(P), we add a new constant p3 in the domain of interpretation and the pair (a,p3) in I(P). Now, if ∃P ⊑∃Q is also in cl( ), we add two new constants p4 and p5 in the domain and the pairs (a,p4) and (p1,p5) to I(Q).
Clearly, by construction, I is a model of each PI in . It is also a model of each NI X⊑¬Y in . Suppose that this not the case: there would exist a constant x which is in I(X) and in I(Y). By construction of I, it would exist a chain of positive inclusions in cl( ) between X and Y and thus X ⊑Y would be in cl( ), and therefore X ⊑¬X would be in cl( ) too, and in this case I(X) would be empty, which contradicts that I is not a model of X⊑¬Y.
To prove (1), if ⊧X ⊑¬X, in every model of , X must be empty, in particular in I. By construction of I, this means that X⊑¬X∈cl( ). Otherwise, consider a PI X⊑Y such that ⊧X ⊑Y. Since I is a model of , I(X) ⊆I(Y). By construction of I, this means that there exists a chain of positive inclusions in cl( ) between X and Y and thus X ⊑Y is in cl( ).
Consider now a NI X⊑¬Y such that neither X⊑¬Y nor Y ⊑¬X belong to cl( ). Let us define the interpretation J such that
In particular J(X) = D (since X⊑¬X is not in cl( )), and J(Y) = D (since neither X⊑¬Y nor Y ⊑¬X belong to cl( )). Clearly, J is a model of , but it is not a model of X⊑¬Y. Therefore, ⊭X⊑¬Y. This ends the proof of (1).
For proving (the contraposite of) (2), we adapt the above canonical interpretation I by initializing with {(p,q),(p,r)} the interpretation of all the properties R such that neither (functR) nor ∃R⊑¬∃R belong to cl( ). And we show that the resulting interpretation I′ is a model of in which the constraints of functionality of such R is not satisfied. □
Finally, the last proposition establishes that checking consistency can be reduced to check whether the data in satisfy every NI in the closure.
Proposition 8.4.5 (Consistency checking using NI-closure) Let be a DL-LITE or a DL-LITETbox. Let be an Abox associated to . ⟨ ,⟩is unsatisfiable iff there exists a NI or a key constraint in the closure of which is violated by some facts of .
Proof (sketch). For every constant a appearing in the Abox , we define (a) as the set of facts extracted from as follows:
We first show that if ⟨ ,⟩ is unsatisfiable, there exists a constant a such that ⟨ ,(a)⟩ is unsatisfiable. In fact we show the contrapositive: suppose that for every constant a, ⟨ ,(a)⟩ is satisfiable: for each a, there exists an interpretation Ia satisfying the inclusions in and all the facts in (a). It is easy to show that the interpretation I defined on the union of domains of interpretations as follows is a model of ⟨ ,⟩ (which is then satisfiable):
Then, since each (a) is a conjunction of facts of the form X(a), if ⟨ ,⟩ is unsatisfiable, there exists a constant a such that ,X1(a) ∧ ... ∧Xn(a) is unsatisfiable. Therefore, ,∃x(X1(x) ∧ ... ∧Xn(x)) is unsatisfiable. This entails: ⊧∀x(¬X1(x) ∨ ... ∨¬Xn(x)), which in DL notation corresponds to: ⊧X1 ⊑¬X2 ⊔ ... ⊔¬Xn. Because of the form of the inclusion allowed in DL-LITE, there must exist i such that ⊧X1 ⊑¬Xi. According to Proposition 8.4.4, this entails that the corresponding NI X1 ⊑¬Xi is in the closure of and that violates it (since it includes X1(a) and Xi(a)).
Conversely, it is easy to show that if a NI in the closure of is violated by some facts in the Abox, then ⟨ ,⟩ is unsatisfiable. If it were not the case, since according Proposition 8.4.3 and cl( ) have the same models, there would be a model in which all the NIs in the closure of would be satisfied by the facts in . □
The second step of consistency checking, after the NI-closure is computed, does not require any further computation on the Tbox . This second step simply consists in evaluating against the Abox (seen as a relational database) a Boolean query corresponding to each negated NI in the NI-closure of the Tbox. If one of those Boolean queries is evaluated to true against , it means that some data in the Abox violates the corresponding NI, and therefore the knowledge base = ⟨ ,⟩ is inconsistent.
For example, consider the NI: ∃TeachesTo ⊑¬PhDStudent. Its corresponding FOL formula φ and its negation are:
Consider the evaluation of the qunsat query against the Abox of Figure 8.1. It evaluates to true: consider valuation ν with ν(x) = pierre, ν(y′) = paul and the facts
The transformation of NIs into Boolean queries that correspond to their negation is described in Definition 8.4.6.
Definition 8.4.6 (Transformation of NIs into Boolean queries) The transformation δ of NIs into Boolean queries corresponding to their negation is defined as follows:
This second step of consistency checking is summarized in the Consistent Algorithm (Algorithm 2). In the algorithm, for each NI clause α, the query qunsat,α is an SQL query computing the Boolean conjunctive queries δ(α). Also, db() denotes the set in a relational database.
Input: a KB = ⟨ ,⟩
Output: true if is satisfiable, false otherwise
(1)qunsat = ∅ (i.e., qunsat is false)
(2)foreach α∈cln( ) let qunsat = qunsat ∪qunsat,α(db())
(5)else return false
It is important to emphasize that this two-step approach for consistency checking does not require any inference on the data. The only inferences concern the Tbox and consist in computing the deductive closure of its axioms, from which the NI-closure (denoted cln( ) in the Algorithm) is extracted.
Consider the Abox ′ obtained from the inconsistent Abox in Figure 8.1 by deleting the fact PhDStudent(paul). The knowledge base made of the Abox ′ in Figure 8.8 and the Tbox in Figure 8.6 is consistent. (See Exercise 8.6.4.)
Subject | Predicate | Object | FOL semantics |
dupond | Leads | infoDept | Leads(dupond,infoDept) |
dupond | rdf:type | Professor | Professor(dupond) |
durand | ResponsibleOf | ue111 | ResponsibleOf(durand,ue111) |
durand | Leads | csDept | Leads(durand,csDept) |
paul | TeachesTo | pierre | TeachesTo(paul,pierre) |
pierre | EnrolledIn | infoDept | EnrolledIn(pierre, infodept) |
pierre | rdf:type | Undergrad | Undergrad(pierre) |
pierre | RegisteredTo | ue111 | Registered(pierre, ue111) |
ue111 | OfferedBy | infoDept | OfferedBy(ue111,infoDept) |
ue111 | rdf:type | CSCourse | CSCourse(ue111) |
jim | EnrolledIn | csDept | EnrolledIn(jim, csDept) |
csDept | rdf:type | TeachingDept | TeachingDept(csDept) |
In the previous section, the negative constraints played the main role. Once we know the knowledge base is consistent and move to query answering, the positive constraints take over.
Answering queries to a DL-LITE knowledge base is done in two steps. The first step is the query reformulation, which consists in translating the original query q into a set Q of queries. The second step consists in evaluating the queries in Q over the Abox (again seen as a relational database). The beauty of the approach is that this will provide the answer set. Of course, simply evaluating q over the Abox would possibly yield an incomplete answer. Completeness is achieved by the “reasoning” in the reformulation step. During this step, we access only the Tbox and not the data.
The query reformulation step is performed by the PerfectRef (Algorithm 3). It consists in reformulating the initial query by using the PIs in as rewriting rules. The intuition is that PIs are seen as logical rules that are applied in backward-chaining to query atoms in order to expand them (in a resolution style). In databases, this is called a chase.
The queries we consider, i.e., the conjunctive queries, consist of several atoms. In general, because of the existential variables, new variables are introduced in queries. So we could be lead to generate more and more queries with new variables. It turns out that we will be able to control this process and generate only a finite number of distinct queries. This is due to the limitations of the constraints allowed in the Tbox. As outlined in Section 8.4.4, as soon as we allow the combination of key constraints with inclusions of properties, we may generate an infinite number of non redundant queries.
Consider a PI rule α⇒β. Applicability of the rule to an atom of a query is defined by:
As usual, _ denotes here an unbounded existential variable of a query.
The following definition defines the result gr(g,I) of the goal reduction of the atom g using the PI I, which is at the core of PerfectRef.
Definition 8.4.7 (Backward application of a PI to an atom) Let I be an inclusion assertion that is applicable to the atom g. Then, gr(g,I) is the atom defined as follows:
if | g = A(x) | and | I = A1 ⊑A, | then | gr(g,I) = A1(x) |
if | g = A(x) | and | I = ∃P ⊑A, | then | gr(g,I) = P(x,_) |
if | g = A(x) | and | I = ∃P-⊑A, | then | gr(g,I) = P(_,x) |
if | g = P(x,_) | and | I = A⊑∃P, | then | gr(g,I) = A(x) |
if | g = P(x,_) | and | I = ∃P1 ⊑∃P, | then | gr(g,I) = P1(x,_) |
if | g = P(x,_) | and | I = ∃P1-⊑∃P, | then | gr(g,I) = P1(_,x) |
if | g = P(_,x) | and | I = A⊑∃P-, | then | gr(g,I) = A(x) |
if | g = P(_,x) | and | I = ∃P1 ⊑∃P-, | then | gr(g,I) = P1(x,_) |
if | g = P(_,x) | and | I = ∃P1-⊑∃P-, | then | gr(g,I) = P1(_,x) |
if | g = P(x1,x2) | and | either I = P1 ⊑P or I = P1-⊑P- | then | gr(g,I) = P1(x1,x2) |
if | g = P(x1,x2) | and | either I = P1 ⊑P-or I = P1-⊑P | then | gr(g,I) = P1(x2,x1) |
The subtle point of PerfectRef is the need of simplifying the produced reformulations, so that some PIs that were not applicable to a reformulation become applicable to its simplifications. A simplification amounts to unify two atoms of a reformulation using their most general unifier and then to switch the possibly new unbounded existential variables to the anonymous variable denoted _.
Let us illustrate the reformulation step of the following query using the PIs in the Tbox of Figure 8.6:
Figure 8.9 shows the result returned by PerfectRef (q(x), ).
Initial query | ||
q(x):- TeachesIn(x,y), RegisteredIn(z,y), Student(z) | ||
Reformulations | applied PI | Reformulated query |
q1(x):- ResponsibleOf(x,y), RegisteredIn(z,y), Student(z) | ResponsibleOf ⊑TeachesIn | q(x) |
q2(x):- TeachesIn(x,y), RegisteredIn(z,y), PhDStudent(z) | PhDStudent ⊑Student | q(x) |
q3(x):- TeachesIn(x,y), RegisteredIn(z,y), TeachesTo(_,z) | ∃TeachesTo-⊑Student | q(x) |
q4(x):- TeachesIn(x,y), RegisteredIn(_,y) | ∃RegisteredIn ⊑Student | q(x) |
q5(x):- ResponsibleOf(x,y), RegisteredIn(z,y), PhDStudent(z) | PhDStudent ⊑Student | q1(x) |
q6(x):- ResponsibleOf(x,y), RegisteredIn(z,y), TeachesTo(_,z) | ∃TeachesTo-⊑Student | q1(x) |
q7(x):- ResponsibleOf(x,y), RegisteredIn(_,y), | ∃RegisteredIn ⊑Student | q1(x) |
q8(x):- ResponsibleOf(x,y), RegisteredIn(z,y), PhDStudent(z) | ResponsibleOf ⊑TeachesIn | q2(x) |
q9(x):- ResponsibleOf(x,y), RegisteredIn(z,y), TeachesTo(_,z) | ResponsibleOf ⊑TeachesIn | q3(x) |
q10(x):- ResponsibleOf(x,y), RegisteredIn(_,y) | ResponsibleOf ⊑TeachesIn | q4(x) |
q11(x):- TeachesIn(x,y), Course(y) | Course ⊑∃RegisteredIn- | q4(x) |
q12(x):- TeachesIn(x,_) | ∃TeachesIn-⊑Course | q11(x) |
q13(x):- ResponsibleOf(x,_) | ResponsibleOf ⊑TeachesIn | q12(x) |
q14(x):- Professor(x) | Professor ⊑∃TeachesIn | q12(x) |
We detail here the inference chain leading to some reformulations that are particularly interesting for getting answers for q from the data in the Abox ′ of Figure 8.8. This also illustrates the need of the simplification step. The reformulation:
is obtained by:
In turn, q4(x) can be reformulated by the backward application of the PI Course ⊑∃RegisteredIn- to the atom RegisteredIn(_,y), which results in the reformulation q11(x):
Then, the reformulation q12(x) is produced by the backward application of the PI ∃ TeachesIn-⊑Course, and the simplification by unification of the two atoms followed by the replacement of the existential variable y, now unbounded, with the anonymous variable _.
Finally, the reformulations q13(x) and q14(x) are obtained from the backward application of the PIs ResponsibleOf ⊑TeachesIn and Professor ⊑∃TeachesIn respectively.
It is important to notice that the answers durand and dupond are obtained for the initial query q(x) thanks to those reformulations q13(x) and q14(x): they would not be returned by the standard evaluation of the query q(x) against the Abox ′ of Figure 8.8.
In the PerfectRef algorithm (Algorithm 3):
Input: a conjunctive query q and a Tbox
Output: a union of conjunctive queries: PR
(6)if a PI I ∈ is applicable to g
(9)if g1 et g2 sont unifiables
(10)PR := PR∪{τ(reduce(q,g1,g2))}
(12)
Figure 8.9 shows the result returned by PerfectRef (q(x), ), where q(x) is the query of the previous example, and is the Tbox of Figure 8.6. The second column makes explicit the PI used for obtaining the corresponding reformulation. Note that equivalent reformulations can be produced by different inferences, such as for example the reformulations q4(x) and q7(x).
Although we will not prove it here, the following properties hold:
In this section, we exhibit an example showing that the interaction of key constraints (the specificity of DL-LITE) with inclusion constraints between properties (the specificity of DL-LITE) may lead to a reformulation of a query into an infinite number of conjunctive rewritings, each one likely to bring additional answers. This makes an algorithmic approach such as the one we used for DL-LITE and DL-LITE in isolation incomplete for query answering when DL-LITE and DL-LITE are combined together.
Consider a Tbox made of the following inclusion axioms, in which R and P are two properties and S is a class:
To see this, we observe that from the fact S(x1) and the PI S ⊑∃R, it can be inferred that there exists y such that R(x1,y) holds, and thus P(x1,y) holds too (since R⊑P). From the functionality constraint on P and the conjunct P(x1,x) in the body of r1, we can now infer that y = x, and thus that R(x1,x) holds. Therefore, ∃zR(z,x) is logically entailed by ∃x1S(x1) ∧P(x1,x), i.e., r1(x) is contained in the query q(x), and thus is a valid reformulation of the query q(x).
It turns out that the situation is even more subtle. Surprisingly, this reformulation r1(x) is not the only one. In fact there exists an infinite number of different reformulations for q(x). Let k ≥ 2. The following query is a valid reformulation of q(x):
To show that rk(x) is logically contained in q(x), we exploit again the axiom of functionality of P and the inclusion axiom between R and P: from the fact S(xk) and the PI S ⊑∃R, it can be inferred that there exists yk such that R(xk,yk) holds, and thus P(xk,yk) holds too (since R⊑P). Since P is functional, we get: yk = xk-1, and thus R(xk,xk-1) holds. Now, based on the PI ∃R-⊑∃R, there exists yk-1 such that R(xk-1,yk-1) holds, and with the same reasoning as before, we get yk-1 = xk-2, and thus R(xk-1,xk-2) holds. By induction, we obtain that R(x1,x) holds, i.e., rk(x) is logically contained in the query q(x).
One can also show that for each k, there exists an Abox such that the reformulation rk returns answers that are not returned by the reformulation rk′ for k′<k. Thus, there exists an infinite number of non redundant conjunctive reformulations.
It can be shown that if we combine key constraints and inclusions of properties in a restricted way, this problem can be avoided. For instance, if key constraints are forbidden on properties involved in right-hand side of an inclusion axiom, there is a finite number of non redundant conjunctive reformulations and they can be found by the PerfectRef algorithm.
The spreading of RDF data on the Web and the importance of queries for such data is illustrated by the Billion Triple Track of the Semantic Web Challenge 1. The idea is to “do something” efficiently with one billion of RDFS triples.
RDFS. Reasoners for RDFS are available on line and can be downloaded, like for instance the Jena2 ontology API [104], which implements the forward-chaining algorithm we described. A SPARQL [175] engine included in the same programmatic Jena environment enables storing and querying data sets made of (asserted + inferred) triples. In fact, since RDFS statements are also stored as RDF triplets, SPARQL can also be used to query the schema, and not only the RDF data. For guaranteeing the completeness of the answers to a schema query, all the inference rules that we have given Figure 7.7 ( section describing RDFS) in Chapter 7 must be taken into account, and not only the subset that we considered in the forward-chaining algorithm we described. We mentioned the practical advantage of separating the computation over a Tbox from that over the Abox. This is useful also from a theoretical point of view. This gives a bound on the data complexity (the complexity in terms of the Abox only) of consistency checking and of query answering. We showed that they can be performed using FOL queries and it is known [159, 9] that evaluating FOL queries over a relational database is in LOGSPACE in the size of the database.
DL-lite. DL-LITE has been recently incorporated into the version OWL2 [176] of OWL as the profile called OWL2 QL. The proof that the PerfectRef Algorithm computes the whole answer is shown in [38]. It follows that the complexity of query answering by reformulation in these fragments of DL-LITE is polynomial in the size of the Tbox, and in LOGSPACE in the size of the Abox.
A major result in [38] is that DL-LITE and DL-LITE are two maximal fragments of the DL-Lite family supporting tractable query answering over large amounts of data. It has been shown in [38] that consistency checking and instance recognition (which a particular case of query answering), while being LOGSPACE both for DL-LITE and DL-LITE Tboxes, are PTIME-COMPLETE for the languages that combine the axioms of both (denoted DL-LITE). This complexity result shows it is unlikely that an approach based on query reformulation would provide a complete query answering algorithm for DL-LITE.
QuOnto ([13]) is a JAVA tool implementing the DL-Lite family of ontology representation languages. It permits the declaration of an ontology as a DL-LITE Tbox, the construction of an associated Abox that can be stored and as a MySQL database. The consistency checking of the resulting DL-LITE knowledge base, and query answering by reformulation are the core functionalities of QuOnto, based on the implementation in Java of the algorithms presented in this chapter.
Datalog+-. Recent research [12, 11] has extended the Datalog database query language towards query answering over ontologies. This has resulted in a unifying framework based on a family of expressive extensions of Datalog, called Datalog+-, that captures DL-LITE and DL-LITE.
Exercise 8.6.1 With the University example, find a new query that has a different answer:
Exercise 8.6.2 Prove that the Saturation algorithm runs in 0((M×N)2).
Exercise 8.6.3 Prove that the rules used for computing the TBox closure are sound.
Exercise 8.6.4 Consider Abox in Figure 8.1 and ′obtained by deleting the fact PhDStudent(paul), and the Tbox in Figure 8.6. Show that:
Exercise 8.6.5 Give the FOL rule corresponding to the different cases of negative inclusion axioms.
1http://challenge.semanticweb.org/