5  Putting into Practice: Managing an XML Database with EXIST

This chapter proposes some exercises and projects to manipulate and query XML documents in a practical context. The software used in these exercises is EXIST, an open-source native XML database which provides an easy-to-use and powerful environment for learning and applying XML languages. We begin with a brief description on how to install EXIST and execute some simple operations. EXIST provides a graphical interface which is pretty easy to use, so we limit our explanations below to the vital information which can be useful to save some time to the absolute beginner.

 5.1  Pre-requisites
 5.2  Installing EXIST
 5.3  Getting started with EXIST
 5.4  Running XPath and XQuery queries with the sandbox
  5.4.1  XPath
  5.4.2  XQuery
  5.4.3  Complement: XPath and XQuery operators and functions
 5.5  Programming with EXIST
  5.5.1  Using the XML:DB API with EXIST
  5.5.2  Accessing EXIST with Web Services
 5.6  Projects
  5.6.1  Getting started
  5.6.2  Shakespeare Opera Omnia
  5.6.3  MusicXML on line

5.1  Pre-requisites

In the following, we assume that you plan to install EXIST in your Windows or Linux environment. You need a Java Development Kit for running the EXIST java application (version 1.5 at least). If you do not have a JDK already installed, get it from the Sun site (try searching “download JDK 1.5” with Google to obtain an appropriate URL) and follow the instructions to set up your Java environment.

Be sure that you can execute Java applications. This requires the definition of a JAVA_HOME environment variable, pointing to the JDK directory. The PATH variable must also contain an entry to the directory that contain the Java executable, $JAVA_HOME/bin.

  1. Under Windows: load the configuration panel window; run the System application; choose Advanced, then Environment variables. Create a new variable JAVA_HOME with the appropriate location, and add the $JAVA_HOME/bin path to the PATH variable.
  2. Under Linux: same as before, the exact command depending on your shell language. For instance, with bash, put the following in the .bashrc file:
    export JAVA_HOME=your_path_to_java  
    export PATH=$PATH:$JAVA_HOME/bin  
           

  3. Under MacOS X, the Java environment should be natively configured.

Check that you can run the java program. If yes you are ready to install EXIST.

5.2  Installing EXIST

EXIST (software, documentation, and many other things) can be found at http://www.exist-db.org/.

The software consists of a Java archive which can be downloaded from the home page. Assume its name is exist.jar (actually the archive name will contain the version as well). Run the installation package with the following command:

java -jar exist.jar


PIC

Figure 5.1: The home page of a local EXIST server.

Just follow the instructions, which asks in particular for the EXIST installation directory (referred to a ’EXIST home’ in the following). Once the installation is completed, you can start the EXIST server as follows:

Linux, Mac OS X and other Unix systems.
Move to the EXIST home directory, and type either bin/startup.sh or bin/startup.bat. If something goes wrong, look at the README file.
Windows.
The installation procedure creates an EXIST menu and shorcuts in the desktop. Simply use them to start/stop the server.

When the server is up and running, it waits for HTTP requests on the port 80801. So, using any Web browser, you can access to the EXIST interface (Figure 5.1) at the URL http://localhost:8080/exist/.

From this interface, you can carry out administration tasks, get the documentation, and run a few predefined client applications. Look at the QuickStart part of the documentation for further information on the configuration of EXIST.

5.3  Getting started with EXIST

EXIST comes with a predefined set of XML samples which can be loaded in the database. To start using these example applications, you must log in to the EXIST administration page. Enter the username "admin" and leave the password field empty2. Next, choose Examples Setup from the menu on the left. Click on the "Import Data" button to start the setup and begin downloading example data from the Internet (Figure 5.2).


PIC

Figure 5.2: Loading XQuery examples in EXIST.

The import creates so-called "collections". A collection can be used to store a set of documents sharing the same schema, but it can be itself organized recursively in sub-collections for more flexible document sets organizations. Loading the example creates several collections, including:

  1. library: a bibliographic RDF document, biblio.rdf ;
  2. mondial: administrative information on several countries;
  3. shakespeare: a few plays from William Shakespeare.

You can access these collections from the "Browse collections" option of the admin menu. You can also create a new collection with the form at the bottom of the page. Do the following:

  1. create a movies collection;
  2. add to the movies collection the document movies.xml which can be downloaded from our site.

EXIST stores now movies.xml in its repository, and you can search, update or transform the document. Figure 5.3 shows the Web interface that you should obtain.


PIC

Figure 5.3: The movies collection, containing a movies.xml sample document.

Now, get back the home page, and choose the "XQuery Sandbow" option. It provides a simple interface that allows to enter XQuery (or XPath) expressions, and displays the result. Check the following XPath/XQuery query:

/movies

This shows the content of the movies elements found in all the collections stored under the /db root element of EXIST (which plays the role of a global root for all the documents stored in the repository). Figure 5.4 shows how to run XPath queries with the sanbox: the interface shows the result sequence in the bottom window.


PIC

Figure 5.4: Running the /movies//title XPath query in the sandbox.

You can restrict the search to a specific documents (or set of documents) with the document() function. Here is an example:

document(’/db/movies/movies.xml’)/movies/movie[year=2005]

The collection() function allows to refer to a collection, as in:

collection(’movies’)/movies/movie[year=2005]

You are now ready to play with XPath and XQuery, using the sandbox. The next section proposes exercises.

5.4  Running XPath and XQuery queries with the sandbox

5.4.1  XPath

Get the movies.xml and movies_refs.xml documents from the book’s Web site, and insert them into EXIST. Look at the document structure: in the first one, each movie is representated as one element, including the director’s and actors’ names. In the second one the document consists of two lists, one for movies, one for actors, the former referencing the latter.

Express the following queries in XPath 1.0, on both documents (note: movies.xml does not require joining the two lists, which makes expressions easier).

  1. All title elements.
  2. All movie titles (i.e., the textual value of title elements).
  3. Titles of the movies published after 2000.
  4. Summary of “Spider-Man”.
  5. Who is the director of Heat?
  6. Title of the movies featuring Kirsten Dunst.
  7. Which movies have a summary?
  8. Which movies do not have a summary?
  9. Titles of the movies published more than 5 years ago.
  10. What was the role of Clint Eastwood in Unforgiven?
  11. What is the last movie of the document?
  12. Title of the film that immediatly precedes Marie Antoinette in the document?
  13. Get the movies whose title contains a “V” (hint: use the function contains()).
  14. Get the movies whose cast consists of exactly three actors (hint: use the function count()).

5.4.2  XQuery

Exercise 5.4.1 Get the movies_alone.xml and artists_alone.xml documents from the book’s Web site. They contain respectively the list of movies with references to artists, and the list of artists. It might be simpler to first express complex queries over the movies.xml document which contains all the data in one file, before reformulating them over these two documents.

For convenience, you may define a variable for each of these document with a let at the beginning of each query:

let $ms:=doc("movies/movies_alone.xml"), 
    $as:=doc("movies/artists_alone.xml")

Express the following XQuery queries:

  1. List the movies published after 2002, including their title and year.
  2. Create a flat list of all the title-role pairs, with each pair enclosed in a “result” element. Here is an example of the expected result structure:
    <results> 
      <result> 
        <title>Heat</title> 
        <role>Lt. Vincent Hanna</role> 
      </result> 
      <result> 
        <title>Heat</title> 
        <role>Neil McCauley</role> 
      </result> 
    </results>
  3. Give the title of movies where the director is also one of the actors.
  4. Show the movies, grouped by genre. Hint: function distinct-values() removes the duplicates from a sequence. It returns atomic values.
  5. For each distinct actor’s id in movies_alone.xml, show the titles of the movies where this actor plays a role. The format of the result should be:
    <actor>16, 
    <title>Match Point</title> 
    <title>Lost in Translation</title> 
    </actor>

    Variant: show only the actors which play a role in at least two movies (hint: function count() returns the number of nodes in a sequence).

  6. Give the title of each movie, along with the name of its director. Note: this is a join!
  7. Give the title of each movie, and a nested element <actors>giving the list of actors with their role.
  8. For each movie that has at least two actors, list the title and first two actors, and an empty "et-al" element if the movie has additional actors. For instance:
    <result> 
    <title>Unforgiven</title> 
    <actor>Clint Eastwood as William Bill Munny</actor> 
    <actor>Gene Hackman as Little Bill Daggett</actor> 
    <et-al/> 
    </result>
  9. List the titles and years of all movies directed by Clint Eastwood after 1990, in alphabetic order.

5.4.3  Complement: XPath and XQuery operators and functions

XPath proposes many operators and functions to manipulate data values. These operators are mostly used in predicates, but they are also important as part of the XQuery language which inherits all of the XPath language. We briefly describe the most important ones here, all from XPath 1.0. XPath 2.0 widely extends the library of available functions.

Operators.

Note that $a is a reference to the variable a. Variables cannot be defined in XPath 1.0, they can only be referred to.

Node functions.

The following functions apply to node sets.

String functions.

The following functions apply to character strings.

Boolean and number functions.

5.5  Programming with EXIST

We give below some examples of the programming interfaces (API) provided by EXIST. They can be used to access an XML repository from an application program. Two interfaces are shown: the first one, XML:DB, is a JDBC-like component that can be introduced in a Java application as a communication layer with EXIST; the second one are the web services of EXIST.

Along with XPath and XQuery, these APIs constitute a complete programming environment which is the support of projects proposed in the final section of the Chapter.

5.5.1  Using the XML:DB API with EXIST

Documents stored in EXIST can be accessed from a Java application. EXIST provides an implementation of the XML:DB API specification, which is to XML databases what JDBC is to relational databases.

We give below a few examples of Java programs that connect to EXIST, access documents and perform XQuery queries. You can get these files from our site. You need a Java programming environment (Java Development Kit) with the JAVA_HOME environment variable properly set, and JAVA_HOME/bin directory added to the PATH environment variable.

In order to use the EXIST API, the following archives must be put in your CLASSPATH.

  1. exist.jar, found in EXIST_HOME or EXIST_HOME/lib;
  2. xmldb.jar, xmlrpc-1.2-patched.jar, and log4j-1.2.14.jar, found in EXIST_HOME/lib/core.

These archives are sufficient to compile and run our examples. For more complex Java programs, some other archives might be necessary (for instance the XERCES parser or the XALAN XSLT processor). They can usually be found in EXIST_HOME/lib

You can find many explanations and examples on Java programming with EXIST at http://exist.sourceforge.net/devguide.html.


 
import org.xmldb.api.DatabaseManager; 
import org.xmldb.api.base.Collection; 
import org.xmldb.api.base.CompiledExpression; 
import org.xmldb.api.base.Database; 
import org.xmldb.api.base.Resource; 
import org.xmldb.api.base.ResourceIterator; 
import org.xmldb.api.base.ResourceSet; 
import org.xmldb.api.modules.XMLResource; 
 
public class ExistAccess { 
        protected static String DRIVER = "org.exist.xmldb.DatabaseImpl"; 
        protected static String URI = "xmldb:exist://localhost:8080/exist/xmlrpc"; 
        protected static String collectionPath = "/db/movies/"; 
        protected static String resourceName = "Heat.xml"; 
 
        public static void main(String[] args) throws Exception { 
 
                // initialize database driver 
                Class cl = Class.forName(DRIVER); 
                Database database = (Database) cl.newInstance(); 
                DatabaseManager.registerDatabase(database); 
 
                // get the collection 
                Collection col = DatabaseManager.getCollection(URI + collectionPath); 
 
                // get the content of a document 
                System.out.println("Get the content of " + resourceName); 
                XMLResource res = (XMLResource) col.getResource(resourceName); 
                if (res == null) { 
                        System.out.println("document not found!"); 
                } else { 
                        System.out.println(res.getContent()); 
                } 
        } 
}

Figure 5.5: First example: retrieving a document



 
import org.exist.xmldb.XQueryService; 
import org.xmldb.api.DatabaseManager; 
import org.xmldb.api.base.Collection; 
import org.xmldb.api.base.CompiledExpression; 
import org.xmldb.api.base.Database; 
import org.xmldb.api.base.Resource; 
import org.xmldb.api.base.ResourceIterator; 
import org.xmldb.api.base.ResourceSet; 
import org.xmldb.api.modules.XMLResource; 
 
public class ExistQuery { 
        protected static String DRIVER = "org.exist.xmldb.DatabaseImpl"; 
        protected static String URI = "xmldb:exist://localhost:8080/exist/xmlrpc"; 
        protected static String collectionPath = "/db/movies"; 
        protected static String resourceName = "movies.xml"; 
 
        public static void main(String[] args) throws Exception { 
 
                // initialize database driver 
                Class cl = Class.forName(DRIVER); 
                Database database = (Database) cl.newInstance(); 
                DatabaseManager.registerDatabase(database); 
 
                // get the collection 
                Collection col = DatabaseManager.getCollection(URI + collectionPath); 
 
                // query a document 
 
                String xQuery = "for $x in doc(’" + resourceName + "’)//title " 
                                + "return data($x)"; 
                System.out.println("Execute xQuery = " + xQuery); 
 
                // Instantiate a XQuery service 
                XQueryService service = (XQueryService) col.getService("XQueryService", 
                                "1.0"); 
                service.setProperty("indent", "yes"); 
 
                // Execute the query, print the result 
                ResourceSet result = service.query(xQuery); 
                ResourceIterator i = result.getIterator(); 
                while (i.hasMoreResources()) { 
                        Resource r = i.nextResource(); 
                        System.out.println((String) r.getContent()); 
                } 
        } 
}

Figure 5.6: Second example: query a collection


5.5.2  Accessing EXIST with Web Services

If you do not want to use Java, or if the architecture of your application makes the embedding of XQuery calls in your code unsuitable, you can use the Web Service layer of EXIST to send queries and get result. This is actually an excellent example of the advantage of a well-defined query language as an interface to a remote data source, and serves as a (small) introduction to the world of distributed computing with web services.

EXIST runs a server on a machine, and the server opens several communication port to serve requests that might come from distant locations. Several protocols are available, but the simple one is based on REST (Representational state transfer), a service layer implementation that completely relies on HTTP.

Recall that a web service allows a Client to send a function call to a server without having to deal with data conversion and complicated network communication issues. In the case of REST-style services, the function call is encoded as a URL, including the function parameters, and transmitted to the server with an HTTP GET or POST request. The REST servers sends back the result in HTTP. REST services are quite easy to deal with: the client application just needs an HTTP client library, available in all programming languages. Moreover, the service can easily be tested with a browser by sending “manual” URL.

Here is a first example: the following GET query retrieves the movies.xml document from your local EXIST repository:  
http://localhost:8080/exist/rest/db/movies/movies.xml

Note that this is a “pseudo-url”: there is nothing on your local disk that resembles directory path /exist/rest/db/movies/movies.xml. Actually,

  1. the REST service is located at /exist/rest/; thanks to a URL rewriting mechanism, any GET request that begins with this address is redirected to the service;
  2. the remaining part of the URL, db/movies/movies.xml, is a parameter sent to the service; it must denote either a collection or a document.

In summary, a REST query such as the above one is tantamount to a document() or collection() call sent to EXIST. The service accepts other parameters, and in particular the _query parameter whose value may be an XPath or XQuery expression. Try the following URL:  
http://localhost:8080/exist/rest/db/movies?_query=//title

You should get the result of the XPath query collection(’movies’)//title.

Remark 5.5.1 REST services are called via GET or POST requests. In the case of GET, all the values must be URL-encoded. This is automatically done by a browser (or any HTTP client) when the URL is a link in an HTML page, but be careful when you directly enter the URL.

Here is a selected list of parameters accepted by the EXIST REST service (for details, please refer to the Developper’s guide on the EXIST web site).

5.6  Projects

The following projects are intended to let you experiment XML data access in the context of an application that needs to manipulate semi-structured data. In all cases, you must devise an architecture that fulfills the proposed requirements, the only constraint being that the data source model must be XML. Of course you are invited to base your architecture on an appropriate combination of the XML tools and languages presented in this book and during the classes: XPath, XSLT (see the resources available on the companion Web site of this book), XQuery, and Java APIs or REST services.

5.6.1  Getting started

The minimal project is a very simple Web application that allows to search some information in an XML document, and displays this information in a user-friendly way. You can take the movies.xml document, or any other XML resource that can be used for the same purpose.

The application must be accessible from a Web Browser (e.g., Firefox) or from a smartphone browser (e.g., a mobile phone: take one of the many mobile phone simulators on the Web). Here are the requirements:

  1. there must be a form that proposes to the user a list of search criteria: (fragment of) the title, list of genres, director and actors names, years, and key-words that can be matched against the summary of the movie;
  2. when the user submits the form, the application retrieves the relevant movies from the XML repository, and displays the list of these movies in XHTML;
  3. in the previous list, each movie title should be a link that allows to display the full description of the movie.

This is a simple project. It can be achieved by a single person in limited time. In that case you are allowed to omit other markup languages (but doing it will result in a better appreciation!).

5.6.2  Shakespeare Opera Omnia

The project is based on the Shakespeare’s collection of plays shipped with the EXIST software. The application’s purposes can be summarized as follows: browsing through a play in order to analyze its content, read some specific parts and maybe find related information.

Basically, it consists in writing a small Web application devoted to navigating in a play and extracting some useful information. Here are a few precise requirements:

  1. show the part of a given character, for a given act and/or a given scene;
  2. show the table of contents of the play, along with the organization in acts and scenes, and the characters present in each scene;
  3. show a full summary of the play, including the author, list of characters, stages requirements, etc.

The web application should be presented in a consistent way, allowing users to switch from one summary to another. It should be possible to navigate from the table of contents to the full description of a scene, from the description of the scene to a character, and from anywhere to the full summary. This list is by no way restrictive. Any other useful extract you can think of will raise the final appreciation!

Finally, the web application should be available on a traditional browser, as well as on a smartphone.

The project must be conducted in two steps:

  1. write a short description of the architecture and design of your XML application; check with some limited tests than you know how to put each tool at the right place where it communicates correctly with the rest of the application;
  2. once the design has been validated, you can embark in the development.

5.6.3  MusicXML on line

This is an exploratory project, since there are no guarantees that all the ideas presented below can be implemented in a reasonable amount of time. The project is also focused on a specific area: music representation and manipulation. So, it should be chosen only by people with both musical inclination and appetite for not yet explored domains.

Music is traditionally distributed on the Web via audio files, described by a few meta-data (author, style, etc.). A quite different approach consists in distributing an accurate content description based on a digital score. The MusicXML DTD serves this purpose. It allows to represent the music notation of any piece of music (voice, instruments, orchestra, etc.). The goal of the projet is to specify a Web portal distributing music scores, and to investigate the functionalities of such a portal.

The following is a list of suggestions, but the project is really open to new directions. Other ideas are welcome, but please talk first to your advisor before embarking in overcomplicated tasks.

Data

It is not that difficult to find data (e.g., digital music scores). Look for instance at http://icking-music-archive.org/. A collection is also available from your advisor, but you are encouraged to search resources on the Web. Any kind of music could do, from full symphony to voice-piano/guitar reduction of folk songs. You can look for digital score produced by tools such as Finale or Sibelius. From these digital scores it is possible to export MusicXML files.

Core functions

The basic requirement is to be able to store XML music sheets in EXIST, and to display the music on demand. Displaying score can be achieved with the Lilypond software (http://lilypond.org/), along with a convertor (musicxml2ly) from MusicXML to Lilypond format. Putting all these tools together is probably the first thing you should do.

It would be useful to extract some important parts from the MusicXML document. For instance, you can extract the lyrics from a song, as well as the melody. A basic search form to extract lyrics and/or melody based on a simple pattern (i.e., e keyword or a musical fragment) would be welcome.

Advanced options

Here is now a list of the possible additional functionalities that could be envisaged. You are free to limit yourself to a state-of-the-art of the possible solutions, to implement (in a simple way) some of them, or both.

  1. Input: how can we enter music (e.g., a song) in a database, and how can we query music (e.g., with a keybard simulator, by whistling, by putting your iPod in front of a microphone, etc.);
  2. Ouput: OK, we can print a score; but what if we we want to listen music? Can we transform an XML document to a MIDI document? Yes, this is possible with Lilypond: you are encouraged to investigate further.

Finally, the web application should be available on a traditional browser, as well as on a smartphone.

This remains, of course, about XML and its multiple usages. You must devise a convenient architecture, using appropriately all the tools that, together, will enable the functionalities of your application.

1This is the default port. You can change in the jetty.xml file in the sub-directory tools/jetty of EXIST.

2The installation tool sometimes requires a password to set up EXIST, in which case access to the administration page is protected by this password.