This chapter is an introduction to very large data management in distributed systems. Here, “very large” means a context where Gigabytes (1,000 MB = 109 bytes) constitute the unit size for measuring data volumes. Terabytes (1012 bytes) are commonly encountered, and many Web companies, scientific or financial institutions must deal with Petabytes (1015 bytes). In a near future, we can expect Exabytes (1018 bytes) data sets, with the world-wide digital universe roughly estimated (in 2010) as about 1 Zetabytes (1021 bytes).
Distribution is the key for handling very large data sets. Distribution is necessary (but not sufficient) to bring scalability, i.e., the means of maintaining stable performance for steadily growing data collections by adding new resources to the system. However, distribution brings a number of technical problems that make the design and implementation of distributed storage, indexing and computing a delicate issue. A prominent concern is the risk of failure. In an environment that consists of hundreds or thousands of computers (a common setting for large Web companies), it becomes very common to face the failure of components (hardware, network, local systems, disks), and the system must be ready to cope with it at any moment.
Our presentation covers principles and techniques that recently emerged to handle Web-scale data sets. We examine the extension of traditional storage and indexing methods to large-scale distributed settings. We describe techniques to efficiently process point queries that aim at retrieving a particular object. Here there typically is a human being waiting for an answer in front of a screen. So, efficient means a response time in the order of a few milliseconds, a difficult challenge in the presence of Terabytes of data. We also consider the batch analysis of large collections of documents to extract statistical or descriptive information. The problem is very different. Possibly Terabytes of data are streamed into a program. Efficient computation now means hours or even days and a most critical issue is the reliable execution of processes that may run so long, in spite of the many glitches that are likely to affect the infrastructure in such a time frame. We should keep in mind these specificities in the presentation that follows, as it motivates many design choices.
The present chapter introduces the essentials of distributed systems devoted to large scale data sets. Its material represents by no means an in-depth or accurate coverage of the topic, but merely aims at supplying the neophyte reader with the minimal background. As usual, the Further Reading section points to complementary references.
A distributed system is piece of software that serves to coordinate the actions of several computers. This coordination is achieved by exchanging messages, i.e., pieces of data conveying information. The system relies on a network that connects the computers and handles the routing of messages.
We limit the discussion in this chapter to the following two classes of networks: Local Area Networks and P2P Networks.
Local Area Network (LAN). LANs are for instance used in data centers to connect hundreds or even thousands of servers. Figure 14.1 shows the main features of a typical Local Area Network (LAN) in this context. We roughly distinguish three communication levels:
In all cases, servers only communicate via message passing. They do not share storage or computing resources. The architecture is said “shared-nothing”.
Example 14.1.1 At the beginning of 2010, a typical Google data center consists of 100-200 racks, each hosting about 40 servers. The number of servers in such a center is roughly estimated around 5,000. The number of data centers is constantly evolving, and the total number of servers is probably already above one million.
Peer-to-Peer Network (P2P). A P2P network is a particular kind of overlay network, a graph structure build over a native physical network. The physical network we consider here is the Internet. Nodes, or “peers” communicate with messages sent over the Internet. The route that connects two peers on the Internet is typically intricate. Typically (Figure 14.2), a message sent by peer A first reaches a local router, that forwards the message to other routers (local, regional, or world-wide) until it is delivered to peer B. By abstracting this complexity, a P2P network imagines a direct link between A and B, as if they were directly connected, as soon as they know the IP addresses of each other. This pseudo-direct connection that may (physically) consist of 10 or more forwarding messages, or “hops”, is called an overlay link, therefore the term overlay network.
Example 14.1.2 If you are connected to the Internet, you can use the tracerouteutility program to inspect the routers involved in the connection between your computer and a site of your choice. For instance: traceroute Webdam.inria.fr gives the list of routers on the forwarding Internet path to the Webdam INRIA Web site. Several sites propose a traceroute interface if you do not have access to a console. One can find some, e.g., at traceroute.org.
Next, it is often convenient to distinguish server nodes from client nodes. A server node provides, through cooperation with other server nodes, a service of the distributed system. A client node consumes this service. Nothing prevents a client node to run on the same computer than a server node (this is typically the case in P2P networks), but the point is most often irrelevant to the discussion. In practice, a client node is often a library incorporated in a larger application, that implements the communication protocol with the server nodes. When no ambiguity arises, we will simple use “Client” and “Server” to denote respectively a client node and a server node it communicates with.
Nodes exchange messages following a particular protocol. The Ethernet protocol is the most widely used. It splits messages into small packets of, typically, 1,500 bytes each. At the time of writing, the data transfer rate of a local Ethernet network can (theoretically) reach 1 Gigabytes/s. The bandwidth is higher that the maximal disk rate which is at most 100 MB/s. Roughly speaking, it is one order of magnitude faster to exchange in-memory data between two computers connected by a high-speed LAN, than for a single computer to read the same data written on the disk. However, bandwidth is a resource that many participants compete for, and this invites to use it with care in data intensive applications. The latency (time to initiate an operation) is also cheaper with networks, although the gap is less impressive.
Internet figures for latency and bandwidth are highly varying, as they depend both on the distance between the communicating nodes, and on the network devices involved, particularly at local ends. (For instance, a Wifi connection in an Internet cafe is a nightmare for data intensive manipulations!) As an illustration, the latency of a connection between INRIA Paris and Stanford University is less than 200 ms., and the bandwidth is 7 MB/s (download) and 3 MB/s (upload). You are encouraged to test these values on your own infrastructure, with the ping command or some of the numerous Web sites. For instance, see
http://www.pcpitstop.com/internet/Bandwidth.asp.
The latency of the average Internet path is estimated at 10 ms. The performance of Internet is definitely at least one order of magnitude worse than LANs. Table 14.1 summarizes the values that should be kept in mind.
Type | Latency | Bandwidth |
Disk | ≈ 5 × 10-3s (5 millisec.); | At best 100 MB/s |
LAN | ≈ 1 - 2 × 10-3s (1-2 millisec.); | ≈ 1GB/s (single rack); ≈ 100MB/s (switched); |
Internet | Highly variable. Typ. 10-100 ms.; | Highly variable. Typ. a few MBs.; |
The following reasoning helps understand the advantage of distributed storage (see Figure 14.3 for an illustration):
This is a good basis to discuss some important aspects of data distribution. Note first that we assume that the maximal transfer rate is achieved for each disk. This is only true for sequential reads, and can only be obtained for operations that fully scan a data set. As a result, the seek time (time to position the head on appropriate disk track) is negligible regarding the transfer time. Therefore the previous analysis mostly holds for batch operations that access the whole collection, and is particularly relevant for applications where most files are written once (by appending new content), then read many times. This scenario differs from the classical behavior of a centralized database.
Now consider in contrast a workload consisting of lots of operations, each one randomly accessing a small piece of data in a large collection. (Such an operation is more in the spirit of a database operation where a row in a large table is accessed.) The access may be a read or a write operation. In both cases, we have to perform a random access to a large file and seek time cannot be ignored. Distribution is here of little help to speed up a single operation. However, if we can afford to replicate the data on many servers, this is an opportunity to balance the query load by distributing evenly read and/or write requests. Architectures for such transactional scenarios can actually be classified by their read/write distribution policy: distributing writes raises concurrency issues; distributing reads raises consistency issues. We further develop this important point in the following.
Finally, look again at Figure 14.3. The distribution mechanism shows two possible data flows. The first one comes from the disk to the local CPU, the second one (with dotted arrows) represents exchanges between computers. The performance of network exchanges depends both on the latency and on the network bandwidth. As said above, the typical transfer rate is 100 MB/s and can reach 1 GB/s, one order of magnitude higher than disks, but bandwidth is a shared resource that must be exploited with care.
A general principle, known as the data locality principle, states that a data set stored on a disk should be processed by a task of the local CPU. The data locality principle is valid for data intensive applications. The architecture adopted in such cases is different from that of High Performance Computing or Grid Computing that distribute a task across a set of CPU that share a common file system. This works as long as the task is CPU intensive, but becomes unsuited if large data exchanges are involved.
To summarize:
Most of the properties required from a distributed system depend on the replication of data. Without replication, the loss of a server hosting a unique copy of some data item results in unrecoverable damages. As already said, replication also brings other advantages, including the ability to distribute read/write operations for improved scalability. However, it raises the following intricate issues:
Data replication complicates the management of consistency in a distributed setting. We illustrate next four typical replication protocols that show the interactions between performance considerations and consistency issues (Figure 14.4). The scenario assumes two concurrent Client applications, A and B, that put/read a data idem d which is replicated on two servers S1 and S2. The four cases depicted in Figure 14.4 correspond to the possible combinations of two technical choices: eager (synchronous) or lazy (asynchronous) replication, and primary or distributed versioning:
From an application point of view, such a design offers some nice properties. Because the replication is managed synchronously, a read request sent by Client 2 always access a consistent state of d, whether it reads from S1 or S2. And because there is a primary copy, requests sent by several clients relating to a same item can be queued, which ensures that updates are applied sequentially and not in parallel. The obvious downside is that these applications have to wait for the completion of other clients’ requests, both for writing and reading.
Inconsistencies sometimes entailed by asynchronous protocols never occur in centralized database systems whose transactional model guarantees ACID properties. This may however be the preferred choice of distributed systems that favor efficiency and adopt a more permissive consistency model. A pending issue in that case is the management of inconsistent versions, a process often called data reconciliation. What happens when the system detects that two replicas have been independently modified? It turns out that the answer is, in most cases, quite practical: data reconciliation is seen as application-dependent. The system is brought back to a consistent state, possibly by promoting one of the versions as the “current” one, and notifying the Client applications that a conflict occurred. Readers familiar with Concurrent Versioning Systems like CVS or Subversion will recognize the optimistic, lock-free mechanism adopted by these tools.
It has been argued that a distributed system cannot simultaneously satisfy consistency and availability while being tolerant to failures (“the CAP theorem”, discussed further). Therefore a system designer has to choose which property should be (at least partly) sacrificed. This often leads to giving up strong consistency requirements, in favor of availability and fault tolerance.
In summary, data replication leads to distinguishing several consistency levels, namely:
Existing database systems are often seen as too heavy and rigid in distributed systems that give up strong consistency to achieve better performance. This idea that the strong consistency requirements imposed by RDBMS are incompatible with distributed data management, is one of the founding principles of the “NoSQL” trend.
In a centralized system, if a program fails for any reason, the simple (and, actually, standard) solution is to abort then restart its transactions. On the other hand, chances to see a single machine fail are low. Things are quite different in the case of a distributed system with thousands of computers. Failure becomes a possibly frequent situation, due to program bugs, human errors, hardware or network problems, etc. For small tasks, it is just simpler to restart them. But for long lasting distributed tasks, restarting them is often not an acceptable option in such settings, since errors typically occur too often. Moreover, in most cases, a failure affects a minor part of the task, which can be quickly completed providing that the system knows how to cope with faulty components.
Some common principles are met in all distributed systems that try to make them resilient to failures. One of the most important is independence. The task handled by an individual node should be independent from the other components. This allows recovering the failure by only considering its initial state, without having to take into account complex relationships or synchronization with other tasks. Independence is best achieved in shared-nothing architectures, when both the CPU and the local disk of a server run in isolation of the other components of the servers.
Thanks to replication methods examined earlier, a failure can usually be recovered by replacing the faulty node by a mirror. The critical question in this context is to detect that a system met a failure. Why for instance is a Client unable to communicate with a server? This may be because of a failure of the server itself, of because the communication network suffers from a transient problem. The Client can wait for the failed node to come back, but this runs against availability, since the application becomes idle for an unpredictable period of time.
Figure 14.5 recalls the main aspects of data recovery in a centralized data management system, and its extension to distributed settings.
Consider first a client-server application with a single server node (left part). (1) The Client issues a write(a). The server does not write immediately a in its repository. Because this involves a random access, it would be very inefficient to do so. Instead, it puts a in its volatile memory. Now, if the system crashes or if the memory is corrupted in any way, the write is lost. Therefore, the server writes in a log file (2). A log is a sequential file which supports very fast append operations. When the log manager confirms that the data is indeed on persistent storage (3), the server can send back an acknowledgment to the Client (4). Eventually, the main memory data will be flushed in the repository (5).
This is standard recovery protocol, implemented in centralized DBMSs. In a distributed setting, the server must log a write operation not only to the local log file, but also to 1, 2 or more remote logs. The issue is close to replication methods, the main choice being to adopt either a synchronous or asynchronous protocol.
The multi-log recovery process, synchronous or asynchronous, has a cost, but it brings availability (and reliability). If the server dies, its volatile memory vanishes and its local log cannot be used for a while. However, the closest mirror can be chosen. It reads from it own log a state equivalent to that of the dead server, and can begin to answer Client’s requests. This is standard REDO protocol, described in detail in any classical textbook on centralized database. We do not elaborate further here.
A transaction is a sequence of data update operations, that is required to be an “all-or-nothing” unit of work. That is, when a commit is requested, the system has to perform all the updates in the transaction. We say the transaction has been validated. In case of problem, the system has also the option to perform nothing of it. We say the transaction has been aborted. On the other hand, the system is not allowed to perform some of the updates and not others, i.e., partial validation is forbidden.
In a distributed setting, the update operations may occur on distinct servers {S1,…,Sn}, called participants. A typical case is the eager replication explained earlier. The problem is to find a protocol, implemented and controlled by a distinct node called coordinator, that communicates with the participants so that the all-or-nothing semantics is ensured. The main algorithm that is used to achieve this goal is the two-phase commit (2PC) protocol:
Assume for the time being that everything proceeds gracefully, without node failure or network communication problem. In this ideal scenario, a distributed data update transaction conforms to the following workflow (Figure 14.6, with focus on S1). Initially (1) the Coordinator sends a Prepare message. Each participant then takes appropriate measures to guarantee that it will be able to fulfill its task in the second phase. Typically, updated data stored in volatile memory is written on a safe temporary persistent storage to prevent a loss due to a system crash (2). The participant can then send a confirmation message to the Coordinator (3), either confirm if the participant is ready to commit, or refuse.
The second phase begins when the Coordinator got all the answers from the participating nodes. It sends then a Decision message (4) which can either be commit or abort. The rule is that if at least one participant refused its part of the transaction, the whole operation must be aborted. If all confirm their readiness, the Coordinator can send a commit. (Although it is not compelled to do so: a refuse is also acceptable).
In case of commit, each participant copies the data from the temporary area to the main repository (5), else it can simply remove the temporary storage associated to the ongoing transaction. An acknowledgment of success is required for this second round, so that the Coordinator closes the transaction.
Now, the question is: what if a failure occurs somewhere? We can distinguish between network communication problems and node failures. In addition, we have to examine separately the roles of the Coordinator from that of the participants. We start with the latter and examine the appropriate recovery action, depending on the instant of the failure occurrence.
Technically speaking, such a distributed protocol must preserve some vital information regardless of the failures that affect a node. For instance, a Participant that fails in the prepared state must be able to figure out, when it recovers, whether the Coordinator sent its decision. For instance, it could contact the Coordinator to learn the current state. In commit state, a failure may occur while pi is proceeding with the validation of the transaction. After restart, the validation needs to be re-executed, which implies that it is implemented as an idempotent operation (a property common to all recovery mechanisms).
We now turn our attention to the Coordinator. A first remark is that it must implement the necessary actions to preserve its ability to monitor the distributed transaction, even if it fails. For instance, before sending the Decision message, the commit or abort choice must be logged in a safe (persistent) area. Indeed, if the Coordinator fails after sending its decision, it would restart in an undefined status if this information could not be recovered.
In general, if the Coordinator fails, the distributed process may be in an intermediate state which can only be solved if the Coordinator restarts and is able to resume properly the transaction. If, for instance, failure occurs when the Coordinator is sending prepare messages, some Participants may be informed of the transaction request, while others may not. On restart, the Coordinator should look for pending transactions in its log and re-send the message.
The same approach holds for dealing with Coordinator failure in the various steps of the protocol. The main problem is that the Coordinator may fail permanently, or suffer from network communication problems that leave the process pending for an unbounded time period. Meanwhile, the Participants are blocked and maintain their locks on the resources allocated to the transaction.
Note that a Participant cannot decide independently to commit or abort. (We could imagine for instance a timeout mechanism that triggers an abort if the Participant is left in a prepared state without receiving the decision.) Indeed, it may be the case that the Coordinator sent a commit decision that reached all the participants save one. Aborting this part of the transaction would break the all-or-nothing requirements. Several techniques have been proposed to overcome the blocking nature of the 2PL protocol, including communication among the Participants themselves. We invite the reader to consult the last section of the chapter for references.
The 2PL protocol is a good illustration of the difficulty to coordinate the execution of several related processes, in particular in case of failures. Applications that need to execute distributed transactions enter in a mechanism where nodes become dependent from one another, and this makes the whole data management much more intricate. Moreover, the mechanism tends to block the operations of other applications and therefore restricts the global throughput of the system. In general, solutions implemented by organizations dealing with Web scale data tend to adopt a non-transactional approach, or at least consistency rules less strict than the standard semantics.
There is a long list of “**-ity” that characterize the good properties of distributed systems: reliability, scalability, availability, etc. We briefly review some of particular interest to the book’s scope. The end of the section proposes a discussion on the ability of distributed systems to simultaneously maintain these good properties.
Reliability denotes the ability of a distributed system to deliver its services even when one or several of its software of hardware components fail. It definitely constitutes one of the main expected advantages of a distributed solution, based on the assumption that a participating machine affected by a failure can always be replaced by another one, and not prevent the completion of a requested task. For instance, a common requirements of large electronic Web sites is that a user transaction should never be canceled because of a failure of the particular machine that is running that transaction. An immediate and obvious consequence is that reliability relies on redundancy of both the software components and data. At the limit, should the entire data center be destroyed by an earthquake, it should be replaced by another one that has a replica of the shopping carts of the user. Clearly, this has a cost and depending of the application, one may more or less fully achieve such a resilience for services, by eliminating every single point of failure.
The concept of scalability refers to the ability of a system to continuously evolve in order to support a growing amount of tasks. In our setting, a system may have to scale because of an increase of data volume, or because of an increase of work, e.g., number of transactions. We would like to achieve this scaling without performance loss. We will favor here horizontal scalability achieved by adding new servers. But, one can also consider vertical scalability obtained by adding more resources to a single server.
To illustrate these options, suppose we have distributed the workload of an application between 100 servers, in a somehow perfect and abstract manner, with each holding 1/100 of the data and serving 1/100 of the queries. Now suppose we get 20% more data, or 20% more queries, we can simply get 20 new servers. This is horizontal scalability that is virtually limitless for very parallelizable applications. Now we could also add extra disk/memory to the 100 servers (to handle the increase in data), and add extra memory or change the processors to faster ones (to handle the increase in queries). This is vertical scalability that typically reaches rather fast the limits of the machine.
In parallel computing, one further distinguishes weak scalability from strong scalability (see Figure 14.7). The former analyzes how the time to obtain a solution varies with respect to the processor count with a fixed data set size per processor. In the perfect case, this time remains constant (per processor), indicating the ability of the system to maintain a perfect balance. Strong scalability refers to the global throughput of a system, for a fixed data set size. If the throughput raises linearly as new servers are added, the system does not suffer from an overhead due to the management tasks associated to a distributed job. (Note that the above discussion assumes a linear complexity of the system behavior, which is true at least for basic read/write/search operations.)
It is actually a common situation that the performance of a system, although designed (or claimed) to be scalable, declines with the system size, due to the management or environment cost. For instance network exchanges may become slower because machines tend to be far apart from one another. More generally, it may happen that some tasks are not distributed, either because of their inherent atomic nature or because of some flaw in the system design. At some point, these tasks (if any) limit the speed-up obtained by distribution (a phenomenon known as Amdahl’s law in the related context of parallel computing).
A scalable architecture avoids this situation and attempts to balance evenly the load on all the participating nodes. Let us consider the simple case of a server that would carry out 10% more work that the others, due to some special role. This is a source of non-scalability. For small workloads, such a difference is unnoticeable, but eventually it will reach an importance that will make the “stressed” node a bottleneck. However, a node dedicated to some administrative tasks that is really negligible or that does not increase proportionally to the global workload is acceptable.
Many architectures presented in the rest of this chapter are of type “one Master – many Servers”. The Master is a node that handles a few specific tasks (e.g., adding a new server to the cluster or connecting a client) but does not participate to the core functionalities of the application. The servers hold the data set, either via a full replication (each item is present on each each server) or, more commonly, via “sharding”: the data set is partitioned and each subset is stored on one server and replicated on a few others. This Master-Server approach is easier to manage than a cluster where all nodes play an equivalent role, and often remains valid on the long run.
A task that is partially allocated to a server may become idle if the server crashes or turns out to be unavailable for any reason. In the worst case, it can be delayed until the problem is fixed or the faulty server replaced by a replica. Availability is the capacity of a system to limit as much as possible this latency (note that this implicitly assumes that the system is already reliable: failures can be detected and repair actions initiated). This involves two different mechanisms: the failure (crash, unavailability, etc.) must be detected as soon as possible, and a quick recovery procedure must be initiated. The process of setting up a protection system to face and fix quickly node failures is usually termed failover.
The first mechanism is handled by periodically monitoring the status of each server (“heartbeat”). It is typically assigned to the node dedicated to administrative tasks (the “master”). Implementing this mechanism in a fully distributed way is more difficult due to the absence of a well-identified manager. Structured P2P networks promote one of the nodes as “Super-peer” in order to take in charge this kind of background monitoring surveillance. Note that some P2P approaches assume that a node will kindly inform its companions when it needs to leave the network, an assumption (sometimes called “fail-stop") that facilitates the design. This may be possible for some kinds of failures, but is unrealistic in many cases, e.g., for hardware errors.
The second mechanism is achieved through replication (each piece of data is stored on several servers) and redundancy (there should be more than one connection between servers for instance). Providing failure management at the infrastructure level is not sufficient. As seen above, a service that runs in such an environment must also take care of adopting adapted recovery techniques for preserving the content of its volatile storage.
How do we estimate the efficiency of a distributed system? Assume an operation that runs in a distributed manner, and delivers a set of items as result. Two usual measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item, and the throughput (or bandwidth) which denotes the number of items delivered in a given period unit (e.g., a second). These measures are useful to qualify the practical behavior of a system at an analytical level, expressed as a function of the network traffic. The two measures correspond to the following unit costs:
The complexity of operations supported by distributed data structures (e.g., searching for a specific key in a distributed index) can be characterized as a function of one of these cost units.
Generally speaking, the analysis of a distributed structure in terms of number of messages is over-simplistic. It ignores the impact of many aspects, including the network topology, the network load and its variation, the possible heterogeneity of the software and hardware components involved in data processing and routing, etc. However, developing a precise cost model that would accurately take into account all these performance factors is a difficult task, and we have to live with rough but robust estimates of the system behavior.
We now come to the question of building systems that simultaneously satisfy all the properties expected from a large-scale distributed system. It should scale to an unbounded number of transactions on unlimited data repositories, always be available with high efficiency (say, a few milliseconds to serve each user’s request) and provide strong consistency guarantees.
In a keynote speech given in 2000 at the Symposium on Principles of Distributed Computing, Eric Brewer proposed the following conjecture: no distributed system can simultaneously provide all three of the following properties: Consistency (all nodes see the same data at the same time), Availability (node failures do not prevent survivors from continuing to operate), and Partition tolerance (the system continues to operate despite arbitrary message loss). This conjecture, formalized and proved two years later, is now known as the CAP theorem, and strongly influences the design of Web-scale distributed systems.
The problem can be simply explained with a figure (Fig. 14.8). Assume two applications A and B running on two distinct servers S1 and S2. A executes writes to a repository, whereas B reads from a replicated version of the repository. The synchronization is obtained by replication messages sent from S1 to S2.
When the Client application sends a put(d) to update a piece of data d, A receives the request and writes in its local repository; S1 then sends the replication message that replaces d′, the older replica, with d, and a subsequent read(d) sent by the Client retrieves from S2 the updated version. So, the system seems consistent.
Now, assume a failure in the system that entails a loss of messages. If we want the system to be fault-tolerant, it continues to run, and the replica is out of date: the Client receives an old version of its data (inconsistency). If S1 synchronizes the write operation and the replication message M as an atomic transaction, this goes against availability, because waiting for the acknowledgment of S2 may take an unpredictable amount of time.
The CAP theorem essentially says that there is a trade-off between availability and consistency (partition tolerance is something we have to deal with anyway) in large-scale distributed systems. In an “eventual consistency” model, the replication message is asynchronous, but S1 resends the messages if it does not receive an acknowledgment until, eventually, the replica on S2 is known to be consistent with S1. Meanwhile, the Client may have to deal with an inconsistent state. In concrete terms, if you remove an item from your basket, it possibly re-appears later in the transaction! Obviously, this is a better choice for the e-commerce site than a user who gives up her transaction due to high system latency.
The CAP theorem gave rise to debates regarding its exact definition and consequences. We already noted that the partition tolerance property is not symmetric to the other ones, since we do not really have the choice to give it up. This leaves two possible combinations: CP (consistent and partition tolerant) and AP (available and partition tolerant). Moreover, the concept of availability (a transaction always terminates) ignores the efficiency aspect (how long does it take?) which is an important factor. Still, the theorem points out that consistency and availability are central, and somehow incompatible, issues in the design of distributed systems, and that a clear trade-off should be made explicit.
A peer-to-peer network is a large network of nodes, called peers, that agree to cooperate in order to achieve a particular task. A P2P system is a distributed system, and as such it shares a lot of features with the settings previously presented.
What makes P2P systems particular with respect to the cluster systems examined so far is their very loose and flexible (not to say unstable) organization. Peers often consist of personal computers connected to the network (e.g., the Internet) participating in a specific task. The rationale behind P2P emergence is the huge amount of available CPU, memory, disk, network resources available on the Web. One would like to use these existing resources to support heavy applications as close to zero hardware cost. Furthermore, this approach allows achieving high scalability using massively distribution and parallel computation.
A second particularity is that a peer plays simultaneously the role a client (of other peers) and a server (to other peers). This is in fact not such a strong specificity, if we recall that “Client” and “Server” actually denote processes hosted on possibly the same computer. Nothing in a distributed architecture prevents the same machine from running several processes, possibly client/server from one another. In P2P systems, however, this situation becomes the rule. A canonical application is file-sharing: a Client (node) gets a file from another (Server) node, and the file, once stored on the Client disk, becomes available to other peers (so, the former Client becomes indeed a Server). In theory, this leads to high availability, reliability (due to large replication) and adequate load balancing.
P2P systems raise many problems, though, even if we set aside the somewhat illegal nature of their most popular applications. First, the behavior of each peer is fully autonomous. A peer owns its computing power and storage resource and can independently choose to allocate these resources to a particular task. A peer can also join or leave the system at will (as mentioned above, the fail-stop hypothesis hardly holds in practice). Second, P2P networks connect nodes via a possibly slow communication channel (usually, the Internet) and this may bring a quite high communication overhead compared to a cluster of machine on a very high-speed local network (See Table 14.1, page 629). Finally, the lack of control on the infrastructure makes P2P networks not adapted to very rapidly changing data and high quality of services, and in particular not adapted to transactional tasks.
Peers in a P2P network refer to each other by their IP addresses, forming a structure over the Internet called an overlay network (e.g., a graph laid over a physical infrastructure). A peer p in this structure is connected to a few other peers (often called its “friends") which are its primary (and possibly) unique way to communicate with the rest of the system. P2P systems mostly differ by the topology of their overlay network, which dictates how dictionary operations (insert, search, update) can be implemented. We should be aware nevertheless that even if two peers p1 and p2 seem friends in the overlay, a message sent from p1 to p2 must actually follow a physical route in the underlying Internet graph, with possibly many hops. So, things may not work as nicely as expected when considering the overlay topology.
A general (although not very efficient) search technique is flooding: a peer p disseminates its request to all its friends, which flood in turn their own friends distinct from p, and so on until the target of the request (e.g., a peer holding the requested music file) is reached. A P2P system that only supports flooding is called an unstructured P2P network. The approach is simple and works as follows. A peer only needs to know some friends to join a network. From them, it can discover new friends. Queries are then supported using flooding typically limited by a “Time to live” bound (abbreviated TTL). The TTL limits the number of times a particular query is forwarded before it should be discarded to avoid using too much resource on a single query. Unstructured P2P networks are not very efficient. They are in particular inherently unstable. Because the peers in the community are autonomous and selfish, one can often observe a very high rate of peers going in and out of the system (one speaks of high churn). As a consequence, it is difficult to guarantee that a node stays connected to the system, or that the overall topology remains consistent.
More structured ways of looking up the network (“Structured P2P networks") have been designed to avoid the blind and uncontrolled nature of the flooding mechanism among which Distributed Hash Tables (DHTs) are probably the most popular. Joining the network becomes more involved, but the performance and stability are improved. We will consider DHTs in Chapter 15.
To conclude this introductory part, we study a simple distributed service: a file system that serves very large data files (hundreds of Gigabytes or Terabytes). The architecture presented here is a slightly simplified description of the Google File System and of several of its descendants, including the HADOOP Distributed File System (HDFS) available as an open-source project. The technical environment is that of a high speed local network connecting a cluster of servers. The file systems is designed to satisfy some specific requirements: (i) we need to handle very large collections of unstructured to semi-structured documents, (ii) data collections are written once and read many times, and (iii) the infrastructure that supports these components consists of thousands of connected machines, with high failure probability. These particularities make common distributed system tools only partially appropriate.
Why would we need a specialized architecture for distributing large files (DFS) in the first place? The answer is summarized by Figure 14.9 that shows, on the left side, a standard solution to share files among computers (a widespread implementation of this solution is NFS, the Network File System, in the Unix world). Assume that server A needs to access the files located in the directory dirC on server B. The DFS allows dirC to be “mounted” in the local file system as, say, a subdirectory of dirB. From the user point of view, this is transparent: s/he can navigate to the files stored in /dirA/dirB/dirC just as if it was fully located on its local computer. The network calls that maintain dirC as part of the Server A namespace are handled by the DFS.
Modern distributed systems like NFS care about reliability and availability, and provide for instance mechanisms to replicate files and handle node failures. In the context of large scale data-intensive applications, this solution is nevertheless not convenient because it breaks several of the principles mentioned so far, and does not satisfy some of its expected properties. The main broken principle is data locality. A process running on Server A in charge of manipulating data stored on Server B will strongly solicit the network bandwidth. Regarding the properties, one notes that the approach is hardly scalable. If we store 10% of our data set in file1 and 90% in file2, Server B will serve (assuming a uniform access pattern) 90% of the Client requests. One could carefully monitor the size and location of files to explicitly control load balancing, but this would lose the benefits of using a transparent file system namespace.
An NFS-like system is not natively designed to meet the specific requirements of a large scale repository. The right part of Figure 14.9 shows a different approach which explicitly addresses the challenge of very large files. Essentially, the difference lies in the fact that a file is no longer the storage unit, but is further decomposed in “chunks” of equal size, each allocated by the DFS to the participating nodes (of course, this works best for systems consisting of large files).
There exists a global file system namespace, shared by all the nodes in the cluster. It defines a hierarchy of directories and files which is “virtual”, as it does not affect in any way the physical location of its components. Instead, the DFS maps the files, in a distributed manner, to the cluster nodes viewed as blind data repositories. File file1 in the right part of Figure 14.9 is for instance split in three chunks. Each chunk is duplicated and the two copies are each assigned to a distinct node.
Because the DFS splits a file is equal-size chunks and evenly distributes the files, a fair balancing is natively achieved. Reliability is obtained by replication of chunks, and availability can be implemented by a standard monitoring process.
We now turn to the architecture of GFS, summarized on Figure 14.10. The distributed system consists of a Master node and many server nodes. The Master plays the role of a coordinator: it receives Client connections, maintains the description of the global file system namespace, and the allocation of file chunks. The Master also monitors the state of the system with “heartbeat” messages in order to detect any failure as early as possible. The role of Servers is straightforward. They receive files chunks, and must take appropriate local measures to ensure the availability and reliability of their (local) storage.
A single-master architecture brings simplicity to the design of the system but gives rise to some concern for its scalability and reliability. The scalability concern is addressed by a Client cache, called Client image in the following. Let us examine in detail how the system handles a read() request, as illustrated on Figure 14.10 with dotted arrows:
The approach is typical of distributed structures, and will be met in several other distributed services further on. The Client cache avoids a systematic access to the Master, a feature that would make the structure non scalable. By limiting the exchanges with the Master to messages that require metadata information, the coordination task is reduced and can be handled by a single computer.
From the Client point of view1, the distributed file system appears just like a directory hierarchy equipped with the usual Unix navigation (chddir, ls) and access (read, write) commands.
Observe again that the system works best for a relatively small number of very large files. GFS (and similar systems) expects typical files of several hundreds of MBs each, and sets accordingly the chunk size to 64 MBs. This can be compared to the traditional block-based organization of centralized storage systems (e.g., databases) where is block size is a small multiple of the disk physical block (typically, 4KB-8KB in a database block).
The design of GFS is geared toward batch processing of very large collections. The architectural choices are in line with the expected application features. For instance, having large chunks limits the size of the internal structure maintained by the Master, and allows keeping them in memory. On the other hand, it appears clearly that using a GFS-like system for a lot of small files would be counter-productive. The Master would have to deal with a lot more references that could not be held in memory anymore, and each file would consist of a single chunk, with poor exploitation of the distribution leverage.
Failure is handled by standard replication and monitoring techniques. First, a chunk is not written on a single server but is replicated on at least 2 other servers. Having three copies of the same chunk is sufficient to face failures (the number of replicas can be chosen by administrator to adapt to special applications). The Master is aware of the existing replicas because each server that joins the clusters initially sends the chunk that it is ready to serve.
Second, the Master is in charge of sending background heartbeat messages to each server. If a server does not answer to a heartbeat messages, the Master initiates a server replacement by asking to one of the (at least 2) remaining servers to copy to a new server the chunks that fell under their replication factor.
The Master itself must be particularly protected because it holds the file namespace. A recovery mechanism is used for all the updates that affect the namespace structure, similar to that presented in Figure 14.5, page 642. We refer the reader to the original paper (see last section) for technical details on the management of aspects that fall beyond the scope of our limited presentation, in particular access rights and data consistency.
Distributed computing systems have constituted an intensive area of research for more than three decades, and the area has been boosted further by the success of the Internet and the profusion of distributed resources now available. A wide coverage of distributed systems issues and techniques is [29], with strong emphasis on the reliability of distributed applications. In a data management perspective, [131] is a general reference for distributed databases, while [78] specializes on transaction management. The notion of scalability is analyzed in [122]. Failure management is an essential topic in the context of large scale distributed systems. Beyond the general principles that can be found in the text books mentioned above, the reader is referred to [84] for an in-depth approach. Replication protocols are ubiquitous in distributed systems design, along with the related problems of consistencies and availability: see [77] which clearly exposes the trade-offs involved in replication policies, and well as the recent survey in [142].
Properties expected from a cluster-based system are analyzed in [68]. A counterpart to ACID properties, called BASE (Basically Available, Soft-state, Eventually consistent) is proposed there. The CAP theorem has been conjectured during a keynote speech by Eric Brewer at PODC’2000, formalized in [72] (Figure 14.8 is a simplified illustration of the proof) and has been an active topic of discussion since then. The trade-off between consistency, availability and fault-tolerance has been investigated in several research works; see for instance [186]. Eventual consistency is the level adopted by many Web-scale systems, including Ebay and Amazon [52]. See also the on-line text from Werner Voegels on consistency models2.
The necessity to adopt a trade-off between availability and consistency in large-scale distributed systems is often presented as a reason of the “NoSQL systems” emergence. Since relational DBMS are not designed to satisfy eventual consistency, storage systems specifically designed to operate at large-scale have been implemented by several Web companies. They often adopt a very simplified data model based on key-value pairs (hence the “key-value stores” term). See the http://nosql-database.org/ site for a list of these systems.
Peer-to-peer networks emerged at the beginning of the millennium and have been since then mostly targeted toward file sharing. P2P systems are accountable for a major part of the Internet traffic (about 50%, sometimes more for certain parts of the world). P2P opponents argue that with P2P is mostly used to share illegal content. However, this is ignoring that P2P also has a growing number of legitimate uses (such as SKYPE).
A P2P system is deemed resilient to failures (due to the large replication factor) and scalable. This is probably true for basic applications (e.g., file management), but building complex and efficient data management systems over a P2P architecture is quite challenging. A major problem is the lack of control on nodes that appear and vanish at will. A trend in P2P system design is therefore to distinguish several classes of participants, ranging from the basic clients whose contributions to the overall computing effort are quite unreliable, to “super-peers” that provide dependable services that help stabilize the system.
The Google File System is presented in [71]. The paper contains detailed material on its internal design and implementation. Consistency issues in particular are presented in depth, as well as replicas management. Since its presentation, GFS inspired several other systems, often distributed in Open Source. HADOOP3 is the most widely used (it is, for instance, adopted and supported by Yahoo!), free implementation of the GFS principles. KOSMOS FILESYSTEM (now available as CLOUDSTORE4) is another derivation of GFS, implemented in C++. The technical documentation of these system shows that they quite closely match the design of GFS. A consequence is that they also target the same application range: large, append-only data files mostly subject to batch analysis.
1We recall that “Client” here technically means a component integrated to the Client application and implementing the communication protocol with the system.
2http://www.allthingsdistributed.com/2008/12/eventually_consistent.html