Contents

  Introduction
I  Modeling Web Data
1  Data Model
 1.1  Semistructured data
 1.2  XML
  1.2.1  XML documents
  1.2.2  Serialized and tree-based forms
  1.2.3  XML syntax
  1.2.4  Typing and namespaces
  1.2.5  To type or not to type
 1.3  Web Data Management with XML
  1.3.1  Data exchange
  1.3.2  Data integration
 1.4  The XML World
  1.4.1  XML dialects
  1.4.2  XML standards
 1.5  Further reading
 1.6  Exercises
  1.6.1  XML documents
  1.6.2  XML standards
2  XPath and XQuery
 2.1  Introduction
 2.2  Basics
  2.2.1  XPath and XQuery data model for documents
  2.2.2  The XQuery model (continued) and sequences
  2.2.3  Specifying paths in a tree: XPath
  2.2.4  A first glance at XQuery expressions
  2.2.5  XQuery vs XSLT
 2.3  XPath
  2.3.1  Steps and path expressions
  2.3.2  Evaluation of path expressions
  2.3.3  Generalities on axes and node tests
  2.3.4  Axes
  2.3.5  Node tests and abbreviations
  2.3.6  Predicates
  2.3.7  XPath 2.0
 2.4  FLWOR expressions in XQuery
  2.4.1  Defining variables: the for and let clauses
  2.4.2  Filtering: the where clause
  2.4.3  The return clause
  2.4.4  Advanced features of XQuery
 2.5  XPath foundations
  2.5.1  A relational view of an XML tree
  2.5.2  Navigational XPath
  2.5.3  Evaluation
  2.5.4  Expressiveness and first-order logic
  2.5.5  Other XPath fragments
 2.6  Further reading
 2.7  Exercises
3  Typing
 3.1  Motivating Typing
 3.2  Automata
  3.2.1  Automata on Words
  3.2.2  Automata on Ranked Trees
  3.2.3  Unranked Trees
  3.2.4  Trees and Monadic Second-Order Logic
 3.3  Schema Languages for XML
  3.3.1  Document Type Definitions
  3.3.2  XML Schema
  3.3.3  Other Schema Languages for XML
 3.4  Typing Graph Data
  3.4.1  Graph Semistructured Data
  3.4.2  Graph Bisimulation
  3.4.3  Data guides
 3.5  Further reading
 3.6  Exercises
4  XML Query Evaluation
 4.1  XML fragmentation
 4.2  XML identifiers
  4.2.1  Region-based identifiers
  4.2.2  Dewey-based identifiers
  4.2.3  Structural identifiers and updates
 4.3  XML evaluation techniques
  4.3.1  Structural join
  4.3.2  Optimizing structural join queries
  4.3.3  Holistic twig joins
 4.4  Further reading
 4.5  Exercises
5  Putting into Practice: Managing an XML Database with EXIST
 5.1  Pre-requisites
 5.2  Installing EXIST
 5.3  Getting started with EXIST
 5.4  Running XPath and XQuery queries with the sandbox
  5.4.1  XPath
  5.4.2  XQuery
  5.4.3  Complement: XPath and XQuery operators and functions
 5.5  Programming with EXIST
  5.5.1  Using the XML:DB API with EXIST
  5.5.2  Accessing EXIST with Web Services
 5.6  Projects
  5.6.1  Getting started
  5.6.2  Shakespeare Opera Omnia
  5.6.3  MusicXML on line
6  Putting into Practice: Tree Pattern Evaluation using SAX
 6.1  Tree-pattern dialects
 6.2  CTP evaluation
 6.3  Extensions
II  Web Data Semantics and Integration
7  Ontologies, RDF, and OWL
 7.1  Introduction
 7.2  Ontologies by example
 7.3  RDF, RDFS, and OWL
  7.3.1  Web resources, URI, namespaces
  7.3.2  RDF
  7.3.3  RDFS: RDF Schema
  7.3.4  OWL
 7.4  Ontologies and (Description) Logics
  7.4.1  Preliminaries: the DL jargon
  7.4.2  ALC: the prototypical DL
  7.4.3  Simple DLs for which reasoning is polynomial
  7.4.4  The DL-LITE family: a good trade-off
 7.5  Further reading
 7.6  Exercises
8  Querying Data through Ontologies
 8.1  Introduction
 8.2  Querying RDF data: notation and semantics
 8.3  Querying through RDFS ontologies
 8.4  Answering queries through DL-LITE ontologies
  8.4.1  DL-LITE
  8.4.2  Consistency checking
  8.4.3  Answer set evaluation
  8.4.4  Impact of combining DL-LITER and DL-LITEF on query answering
 8.5  Further reading
 8.6  Exercises
9  Data Integration
 9.1  Introduction
 9.2  Containment of conjunctive queries
 9.3  Global-as-view mediation
 9.4  Local-as-view mediation
  9.4.1  The Bucket algorithm
  9.4.2  The Minicon algorithm
  9.4.3  The Inverse-rules algorithm
  9.4.4  Discussion
 9.5  Ontology-based mediators
  9.5.1  Adding functionality constraints
  9.5.2  Query rewriting using views in DL-LITER
 9.6  Peer-to-Peer Data Management Systems
  9.6.1  Answering queries using GLAV mappings is undecidable
  9.6.2  Decentralized DL-LITER
 9.7  Further reading
 9.8  Exercices
10  Putting into Practice: Wrappers and Data Extraction with XSLT
 10.1  Extracting Data from Web Pages
 10.2  Restructuring Data
11  Putting into Practice: Ontologies in Practice (by Fabian M. Suchanek)
 11.1  Exploring and installing YAGO
 11.2  Querying YAGO
 11.3  Web access to ontologies
  11.3.1  Cool URIs
  11.3.2  Linked Data
12  Putting into Practice: Mashups with YAHOO! PIPES and XProc
 12.1  YAHOO! PIPES: A Graphical Mashup Editor
 12.2  XProc: An XML Pipeline Language
III  Building Web Scale Applications
13  Web search
 13.1  The World Wide Web
 13.2  Parsing the Web
  13.2.1  Crawling the Web
  13.2.2  Text Preprocessing
 13.3  Web Information Retrieval
  13.3.1  Inverted Files
  13.3.2  Answering Keyword Queries
  13.3.3  Large-scale Indexing with Inverted Files
  13.3.4  Clustering
  13.3.5  Beyond Classical IR
 13.4  Web Graph Mining
  13.4.1  PageRank
  13.4.2  HITS
  13.4.3  Spamdexing
  13.4.4  Discovering Communities on the Web
 13.5  Hot Topics in Web Search
 13.6  Further Reading
 13.7  Exercises
14  An Introduction to Distributed Systems
 14.1  Basics of distributed systems
  14.1.1  Networking infrastructures
  14.1.2  Performance of a distributed storage system
  14.1.3  Data replication and consistency
 14.2  Failure management
  14.2.1  Failure recovery
  14.2.2  Distributed transactions
 14.3  Required properties of a distributed system
  14.3.1  Reliability
  14.3.2  Scalability
  14.3.3  Availability
  14.3.4  Efficiency
  14.3.5  Putting everything together: the CAP theorem
 14.4  Particularities of P2P networks
 14.5  Case study: a Distributed File System for very large files
  14.5.1  Large scale file system
  14.5.2  Architecture
  14.5.3  Failure handling
 14.6  Further reading
15  Distributed Access Structures
 15.1  Hash-based structures
  15.1.1  Distributed Linear Hashing
  15.1.2  Consistent Hashing
  15.1.3  Case study: CHORD
 15.2  Distributed indexing: Search Trees
  15.2.1  Design issues
  15.2.2  Case study: BATON
  15.2.3  Case Study: BIGTABLE
 15.3  Further reading
 15.4  Exercises
16  Distributed Computing with MAPREDUCE and PIG
 16.1  MAPREDUCE
  16.1.1  Programming model
  16.1.2  The programming environment
  16.1.3  MAPREDUCE internals
 16.2  PIG
  16.2.1  A simple session
  16.2.2  The data model
  16.2.3  The operators
  16.2.4  Using MAPREDUCE to optimize PIG programs
 16.3  Further reading
 16.4  Exercises
17  Putting into Practice: Full-Text Indexing with LUCENE (by Nicolas Travers)
 17.1  Preliminary: a LUCENE sandbox
 17.2  Indexing plain-text with LUCENE – A full example
  17.2.1  The main program
  17.2.2  Create the Index
  17.2.3  Adding documents
  17.2.4  Searching the index
  17.2.5  LUCENE querying syntax
 17.3  Put it into practice!
  17.3.1  Indexing a directory content
  17.3.2  Web site indexing (project)
 17.4  LUCENE – Tuning the scoring (project)
18  Putting into Practice: Recommendation Methodologies (by Alban Galland)
 18.1  Introduction to recommendation systems
 18.2  Pre-requisites
 18.3  Data analysis
 18.4  Generating some recommendations
  18.4.1  Global recommendation
  18.4.2  User-based collaborative filtering
  18.4.3  Item-based collaborative filtering
 18.5  Projects
  18.5.1  Scaling
  18.5.2  The probabilistic way
  18.5.3  Improving recommendation
19  Putting into Practice: Large-Scale Data Management with HADOOP
 19.1  Installing and running HADOOP
 19.2  Running MAPREDUCE jobs
 19.3  PIGLATIN scripts
 19.4  Running in cluster mode (optional)
  19.4.1  Configuring HADOOP in cluster mode
  19.4.2  Starting, stopping and managing HADOOP
 19.5  Exercises
20  Putting into Practice: COUCHDB, a JSON Semi-Structured Database
 20.1  Introduction to the COUCHDB document database
  20.1.1  JSON, a lightweight semi-structured format
  20.1.2  COUCHDB, architecture and principles
  20.1.3  Preliminaries: set up your COUCHDB environment
  20.1.4  Adding data
  20.1.5  Views
  20.1.6  Querying views
  20.1.7  Distribution strategies: master-master, master-slave and shared-nothing
 20.2  Putting COUCHDB into Practice!
  20.2.1  Exercises
  20.2.2  Project: build a distributed bibliographic database with COUCHDB
 20.3  Further reading
  References