This chapter proposes some exercises and projects to manipulate and query XML documents in a practical context. The software used in these exercises is EXIST, an open-source native XML database which provides an easy-to-use and powerful environment for learning and applying XML languages. We begin with a brief description on how to install EXIST and execute some simple operations. EXIST provides a graphical interface which is pretty easy to use, so we limit our explanations below to the vital information which can be useful to save some time to the absolute beginner.
In the following, we assume that you plan to install EXIST in your Windows or Linux environment. You need a Java Development Kit for running the EXIST java application (version 1.5 at least). If you do not have a JDK already installed, get it from the Sun site (try searching “download JDK 1.5” with Google to obtain an appropriate URL) and follow the instructions to set up your Java environment.
Be sure that you can execute Java applications. This requires the definition of a JAVA_HOME environment variable, pointing to the JDK directory. The PATH variable must also contain an entry to the directory that contain the Java executable, $JAVA_HOME/bin.
Check that you can run the java program. If yes you are ready to install EXIST.
EXIST (software, documentation, and many other things) can be found at http://www.exist-db.org/.
The software consists of a Java archive which can be downloaded from the home page. Assume its name is exist.jar (actually the archive name will contain the version as well). Run the installation package with the following command:
java -jar exist.jar
Just follow the instructions, which asks in particular for the EXIST installation directory (referred to a ’EXIST home’ in the following). Once the installation is completed, you can start the EXIST server as follows:
When the server is up and running, it waits for HTTP requests on the port 80801. So, using any Web browser, you can access to the EXIST interface (Figure 5.1) at the URL http://localhost:8080/exist/.
From this interface, you can carry out administration tasks, get the documentation, and run a few predefined client applications. Look at the QuickStart part of the documentation for further information on the configuration of EXIST.
EXIST comes with a predefined set of XML samples which can be loaded in the database. To start using these example applications, you must log in to the EXIST administration page. Enter the username "admin" and leave the password field empty2. Next, choose Examples Setup from the menu on the left. Click on the "Import Data" button to start the setup and begin downloading example data from the Internet (Figure 5.2).
The import creates so-called "collections". A collection can be used to store a set of documents sharing the same schema, but it can be itself organized recursively in sub-collections for more flexible document sets organizations. Loading the example creates several collections, including:
You can access these collections from the "Browse collections" option of the admin menu. You can also create a new collection with the form at the bottom of the page. Do the following:
EXIST stores now movies.xml in its repository, and you can search, update or transform the document. Figure 5.3 shows the Web interface that you should obtain.
Now, get back the home page, and choose the "XQuery Sandbow" option. It provides a simple interface that allows to enter XQuery (or XPath) expressions, and displays the result. Check the following XPath/XQuery query:
/movies
This shows the content of the movies elements found in all the collections stored under the /db root element of EXIST (which plays the role of a global root for all the documents stored in the repository). Figure 5.4 shows how to run XPath queries with the sanbox: the interface shows the result sequence in the bottom window.
You can restrict the search to a specific documents (or set of documents) with the document() function. Here is an example:
document(’/db/movies/movies.xml’)/movies/movie[year=2005]
The collection() function allows to refer to a collection, as in:
collection(’movies’)/movies/movie[year=2005]
You are now ready to play with XPath and XQuery, using the sandbox. The next section proposes exercises.
Get the movies.xml and movies_refs.xml documents from the book’s Web site, and insert them into EXIST. Look at the document structure: in the first one, each movie is representated as one element, including the director’s and actors’ names. In the second one the document consists of two lists, one for movies, one for actors, the former referencing the latter.
Express the following queries in XPath 1.0, on both documents (note: movies.xml does not require joining the two lists, which makes expressions easier).
Exercise 5.4.1 Get the movies_alone.xml and artists_alone.xml documents from the book’s Web site. They contain respectively the list of movies with references to artists, and the list of artists. It might be simpler to first express complex queries over the movies.xml document which contains all the data in one file, before reformulating them over these two documents.
For convenience, you may define a variable for each of these document with a let at the beginning of each query:
Express the following XQuery queries:
Variant: show only the actors which play a role in at least two movies (hint: function count() returns the number of nodes in a sequence).
XPath proposes many operators and functions to manipulate data values. These operators are mostly used in predicates, but they are also important as part of the XQuery language which inherits all of the XPath language. We briefly describe the most important ones here, all from XPath 1.0. XPath 2.0 widely extends the library of available functions.
Warning! div is used instead of the usual /.
Warning! “-” is also a valid character inside XML names; this means an expression a-b tests for an element with name a-b and does not compute the substraction between the values contained in elements a and b. This substration can be expressed as a - b.
Warning! //a!=3 means: there is an a element in the document whose string value is different from 3.
Warning! They can only be used to compare numbers, not strings.
Warning! If an XPath expression is embedded in an XML document (this is the case in XSLT), < must be used in place of <.
Note that $a is a reference to the variable a. Variables cannot be defined in XPath 1.0, they can only be referred to.
The following functions apply to node sets.
The following functions apply to character strings.
We give below some examples of the programming interfaces (API) provided by EXIST. They can be used to access an XML repository from an application program. Two interfaces are shown: the first one, XML:DB, is a JDBC-like component that can be introduced in a Java application as a communication layer with EXIST; the second one are the web services of EXIST.
Along with XPath and XQuery, these APIs constitute a complete programming environment which is the support of projects proposed in the final section of the Chapter.
Documents stored in EXIST can be accessed from a Java application. EXIST provides an implementation of the XML:DB API specification, which is to XML databases what JDBC is to relational databases.
We give below a few examples of Java programs that connect to EXIST, access documents and perform XQuery queries. You can get these files from our site. You need a Java programming environment (Java Development Kit) with the JAVA_HOME environment variable properly set, and JAVA_HOME/bin directory added to the PATH environment variable.
In order to use the EXIST API, the following archives must be put in your CLASSPATH.
These archives are sufficient to compile and run our examples. For more complex Java programs, some other archives might be necessary (for instance the XERCES parser or the XALAN XSLT processor). They can usually be found in EXIST_HOME/lib
You can find many explanations and examples on Java programming with EXIST at http://exist.sourceforge.net/devguide.html.
If you do not want to use Java, or if the architecture of your application makes the embedding of XQuery calls in your code unsuitable, you can use the Web Service layer of EXIST to send queries and get result. This is actually an excellent example of the advantage of a well-defined query language as an interface to a remote data source, and serves as a (small) introduction to the world of distributed computing with web services.
EXIST runs a server on a machine, and the server opens several communication port to serve requests that might come from distant locations. Several protocols are available, but the simple one is based on REST (Representational state transfer), a service layer implementation that completely relies on HTTP.
Recall that a web service allows a Client to send a function call to a server without having to deal with data conversion and complicated network communication issues. In the case of REST-style services, the function call is encoded as a URL, including the function parameters, and transmitted to the server with an HTTP GET or POST request. The REST servers sends back the result in HTTP. REST services are quite easy to deal with: the client application just needs an HTTP client library, available in all programming languages. Moreover, the service can easily be tested with a browser by sending “manual” URL.
Here is a first example: the following GET query retrieves the movies.xml document from your
local EXIST repository:
http://localhost:8080/exist/rest/db/movies/movies.xml
Note that this is a “pseudo-url”: there is nothing on your local disk that resembles directory path /exist/rest/db/movies/movies.xml. Actually,
In summary, a REST query such as the above one is tantamount to a document() or
collection() call sent to EXIST. The service accepts other parameters, and in particular the
_query parameter whose value may be an XPath or XQuery expression. Try the following
URL:
http://localhost:8080/exist/rest/db/movies?_query=//title
You should get the result of the XPath query collection(’movies’)//title.
Remark 5.5.1 REST services are called via GET or POST requests. In the case of GET, all the values must be URL-encoded. This is automatically done by a browser (or any HTTP client) when the URL is a link in an HTML page, but be careful when you directly enter the URL.
Here is a selected list of parameters accepted by the EXIST REST service (for details, please refer to the Developper’s guide on the EXIST web site).
The following projects are intended to let you experiment XML data access in the context of an application that needs to manipulate semi-structured data. In all cases, you must devise an architecture that fulfills the proposed requirements, the only constraint being that the data source model must be XML. Of course you are invited to base your architecture on an appropriate combination of the XML tools and languages presented in this book and during the classes: XPath, XSLT (see the resources available on the companion Web site of this book), XQuery, and Java APIs or REST services.
The minimal project is a very simple Web application that allows to search some information in an XML document, and displays this information in a user-friendly way. You can take the movies.xml document, or any other XML resource that can be used for the same purpose.
The application must be accessible from a Web Browser (e.g., Firefox) or from a smartphone browser (e.g., a mobile phone: take one of the many mobile phone simulators on the Web). Here are the requirements:
This is a simple project. It can be achieved by a single person in limited time. In that case you are allowed to omit other markup languages (but doing it will result in a better appreciation!).
The project is based on the Shakespeare’s collection of plays shipped with the EXIST software. The application’s purposes can be summarized as follows: browsing through a play in order to analyze its content, read some specific parts and maybe find related information.
Basically, it consists in writing a small Web application devoted to navigating in a play and extracting some useful information. Here are a few precise requirements:
The web application should be presented in a consistent way, allowing users to switch from one summary to another. It should be possible to navigate from the table of contents to the full description of a scene, from the description of the scene to a character, and from anywhere to the full summary. This list is by no way restrictive. Any other useful extract you can think of will raise the final appreciation!
Finally, the web application should be available on a traditional browser, as well as on a smartphone.
The project must be conducted in two steps:
This is an exploratory project, since there are no guarantees that all the ideas presented below can be implemented in a reasonable amount of time. The project is also focused on a specific area: music representation and manipulation. So, it should be chosen only by people with both musical inclination and appetite for not yet explored domains.
Music is traditionally distributed on the Web via audio files, described by a few meta-data (author, style, etc.). A quite different approach consists in distributing an accurate content description based on a digital score. The MusicXML DTD serves this purpose. It allows to represent the music notation of any piece of music (voice, instruments, orchestra, etc.). The goal of the projet is to specify a Web portal distributing music scores, and to investigate the functionalities of such a portal.
The following is a list of suggestions, but the project is really open to new directions. Other ideas are welcome, but please talk first to your advisor before embarking in overcomplicated tasks.
It is not that difficult to find data (e.g., digital music scores). Look for instance at http://icking-music-archive.org/. A collection is also available from your advisor, but you are encouraged to search resources on the Web. Any kind of music could do, from full symphony to voice-piano/guitar reduction of folk songs. You can look for digital score produced by tools such as Finale or Sibelius. From these digital scores it is possible to export MusicXML files.
The basic requirement is to be able to store XML music sheets in EXIST, and to display the music on demand. Displaying score can be achieved with the Lilypond software (http://lilypond.org/), along with a convertor (musicxml2ly) from MusicXML to Lilypond format. Putting all these tools together is probably the first thing you should do.
It would be useful to extract some important parts from the MusicXML document. For instance, you can extract the lyrics from a song, as well as the melody. A basic search form to extract lyrics and/or melody based on a simple pattern (i.e., e keyword or a musical fragment) would be welcome.
Here is now a list of the possible additional functionalities that could be envisaged. You are free to limit yourself to a state-of-the-art of the possible solutions, to implement (in a simple way) some of them, or both.
Finally, the web application should be available on a traditional browser, as well as on a smartphone.
This remains, of course, about XML and its multiple usages. You must devise a convenient architecture, using appropriately all the tools that, together, will enable the functionalities of your application.