3  Typing

In this chapter, we discuss the typing of semistructured data. Typing is the process of describing, with a set of declarative rules or constraints called a schema, a class of XML documents, and verifying that a given document is valid for that class (we also say that this document is valid against the type defined by the schema). This is for instance used to define a specific XML vocabulary (XHTML, MathML, RDF, etc.), with its specificities in structure and content, that is used for a given application.

We first present motivations and discuss the kind of typing that is needed. XML data typing is typically based on finite-state automata. Therefore, we recall basic notion of automata, first on words, then on ranked trees, finally on unranked trees, i.e., essentially on XML. We also present the main two practical languages for describing XML types, DTDs and XML Schema, both of which endorsed by the W3C. We then briefly describe alternative schema languages with their key features. In a last section, we discuss the typing of graph data.

One can also consider the issue of “type checking a program”, that is, verifying that if the input is of a proper input type, the program produces an output that is of a proper output type. In Section 3.5, we provide references to works on program type checking in the context of XML.

 3.1  Motivating Typing
  Dynamic and Static Typing
 3.2  Automata
  3.2.1  Automata on Words
  3.2.2  Automata on Ranked Trees
  3.2.3  Unranked Trees
  3.2.4  Trees and Monadic Second-Order Logic
 3.3  Schema Languages for XML
  3.3.1  Document Type Definitions
  3.3.2  XML Schema
  3.3.3  Other Schema Languages for XML
 3.4  Typing Graph Data
  3.4.1  Graph Semistructured Data
  3.4.2  Graph Bisimulation
  3.4.3  Data guides
 3.5  Further reading
  Schema Inference and Static Typing
  Automata
  Schema Languages for XML
  Typing languages
  Typing Graph Data
 3.6  Exercises

3.1  Motivating Typing

Perhaps the main difference with typing in relational systems is that typing is not compulsory for XML. It is perfectly fine to have an XML document with no prescribed type. However, when developing and using software, types are essential, for interoperability, consistency, and efficiency. We describe these motivations next and conclude the section by contrasting two kinds of type checking, namely dynamic and static.

Interoperability. Schemas serve to document the interface of software components, and provide therefore a key ingredient for the interoperability between programs: a program that consumes an XML document of a given type can assume that the program that has generated it has produced a document of that type.

Consistency. Similarly to dependencies for the relational model (primary keys, foreign key constraints, etc.), typing an XML document is also useful to protect data against improper updates.

Storage Efficiency. Suppose that some XML document is very regular, say, it contains a list of companies, with, for each, an ID, a name, an address and the name of its CEO. This same information may be stored very compactly, for instance, without repeating the names of elements such as address for each company. Thus, a priori knowledge on the type of the data may help improve its storage.

Query Efficiency. Consider the following XQuery query:

for $b in doc("bib.xml")/bib//* 
where $b/*/zip = ’12345’ 
return $b/title

Knowing that the document consists of a list of books and knowing the exact type of book elements, one may be able to rewrite the query:

for $b in doc("bib.xml")/bib/book 
where $b/address/zip = ’12345’ 
return $b/title

that is typically much cheaper to evaluate. Note that in the absence of a schema, a similar processing is possible by first computing from the document itself a data guide, i.e., a structural summary of all paths from the root in the document. There are also other more involved schema inference techniques that allow attaching such an a posteriori schema to a schemaless document.

Dynamic and Static Typing

Assume that XML documents (at least some of them) are associated with schemas and that programs use these schemas. In particular, they verify that processed documents are valid against them. Most of the time, such verification is dynamic. For instance, a Web server verifies the type when sending an XML document or when receiving it. Indeed, XML data tend to be checked quite often because programs prefer to verify types dynamically (when they transfer data) than risking to run into data of unexpected structure during execution.

It is also interesting, although more complicated, to perform static type checking, i.e., verify that a program receiving data of the proper input type only generates data of the proper output type. More formally, let note dT when a document d is valid against a type T. We say that a type T1 is more specific than a type T2 if all documents valid against T1 are also valid against T2, i.e.,

∀d(d |= T1 ⇒ d |= T 2).
Let Ti be an input type and f be a program or query that takes as input an XML document and returns an XML document. This f might be an XPath or XQuery query, an XSLT stylesheet, or even a program in a classical programming language. Static typing implies either static verification or static inference, defined as follows:
Verification:
Is it true that dTi, f(d)To, for some given output type To?
Inference:
Find the most specific To such that dTi, f(d)To.

Note that in a particular case, we have no knowledge of the input type, and the inference problem becomes: Find the most specific To such that f(d)To.

The notion of smallest output type depends of the schema language considered. Assume for instance that types are described as regular expressions on the tag sequences1 and consider the following XQuery query:

for $p in doc("parts.xml")//part[color="red"] 
return <part>{$p/name, $p/desc}</part>

Assuming no constraint on the input type, it is easy to see that the type of the result is described by the following regular expression:

(<part> (<name> any </name >)* (<desc > an y </desc >)* </part >)*

where any stands for any well-formed tag sequence. The regular language described by this regular expression is also the smallest one that describes the output of the query, since any document in this language can be generated as the output of this query.

Verifying or inferring an output type for a program is in all generality an undecidable problem, even for XPath queries and a simple schema language such as DTDs. Even when the verification problem is decidable, a smallest output type might not exist. Consider for instance the XQuery program:

let $d:=doc("input.xml") 
for $x in $d//a, $y in $d//a 
return <b/>

Suppose that the input is <input><a/><a/></input>. Then the result is:

<b/><b/><b/><b/>

In general, the result consists in n2 b-elements for some n0. Such a type cannot be described by DTDs or XML schemas. One can approximate it by regular expressions but not obtain a “best” result:

<b/ >*
               4     *
ϵ + <b/> + <b/>4<b/ >  9     *
ϵ + <b/> + <b/>  + <b/> <b/ >
...

3.2  Automata

XML was recently introduced. Fortunately, the model can benefit from a theory that is well established, automata theory. We briefly recall some standard definitions and results on automata over words. Then we mention without proof how they extend to ranked trees with some limitations. As previously mentioned, XML is based on unranked trees, so we finally consider unranked trees.

3.2.1  Automata on Words

This is standard material. We recall here briefly some notation and terminology.

Definition 3.2.1 A finite-state word automaton (FSA for short) is a 5-tuple ,Q,q0,F,δ) where

  1. Σ is a finite alphabet;
  2. Qis a finite set of states;
  3. q0 Qis the initial state;
  4. F Qis the set of final states; and
  5. δ, the transition function, is a mapping from ∪{ϵ}) ×Qto 2Q.

Such a nondeterministic automaton accepts or rejects words in Σ*. A word wΣ* is accepted by an automaton if there is a path in the state space of the automaton, leading from the initial state to one of the final states and compatible with the transition functions, such that the concatenation of all labels from Σ ∪{ϵ} along the path is w. The set of words accepted by an automaton A is denoted L(A). A language accepted by an FSA is called a regular language. They can alternatively be described as regular expressions built using concatenation, union and Kleene closure, e.g., a(b + c)*d.

_____________________________________________________________________

Example 3.2.2 Consider the FSA A with Σ = {a,b}, Q = {q0,q1,q2,q3}, F = {q2}, δ(a,q0) = {q0,q1}, δ(b,q1) = {q0}, δ(ϵ,q1) = {q2}, δ(ϵ,q2) = {q3}, δ(ϵ,q3) = {q3}. Then abaab is not in L(A) whereas aba is in L(A).
_____________________________________________________________________

An automaton is deterministic if (i) it has no ϵ-transition, i.e., δ(ϵ,q) = for all q; and (ii) there are no multiple transitions from a single pair of symbol and state, i.e., |δ(a,q)|≤ 1 for all (a,q).

The following important results are known about FSAs:

  1. For each FSA A, one can construct an equivalent deterministic FSA B (i.e., a deterministic FSA accepting exactly the same words). In some cases, the number of states of B is necessarily exponential in that of A. This leads to another fundamental problem that we will ignore here, the problem of state minimization, i.e., the problem of finding an equivalent deterministic automaton with as few states as possible.
  2. There is no FSA accepting the language {aibii0}.
  3. Regular languages are closed under complement.
    (To see this, consider an automaton A. Construct a deterministic automaton that accepts the same language but never “blocks”, i.e., that always reads the entire word. Let Q be the set of states of this automaton and F its set of accepting states. Then the automaton obtained by replac