Friday, May 07, 2010

A Matter of Scale: The CAP Theorem and Memory Models

In his keynote for the 2000 ACM PoDC conference, Eric A. Brewer made a remarkable assertion. He said that given three desirable qualities of distributed systems, consistency, availability, and partition tolerance, you can at best have two out of three. This became to be known as Brewer's Conjecture.

You might be inclined to think Brewer knew what he was talking about. He was not only a professor of computer science at U. C. Berkeley, but was the co-founder of Inktomi, a company that produced the HotBot Internet search engine. For a time, HotBot was the most popular Internet search engine, replacing Altavista, until its supremacy was upset by Google and Inktomi was acquired by Yahoo!

Even more remarkably, a couple of years later Seth Gilbert and Nancy Lynch at MIT formally proved Brewer's Conjecture for a couple of abstract (but highly applicable) models, leading to the CAP Theorem.

The CAP Theorem has been getting more press lately, thanks in part to Werner Vogels, the CTO of Amazon.com, writing about how that e-commerce giant gets gets around the fundamental architectural implications of the theorem for large distributed systems. So it's not a big surprise that I would have come across the CAP Theorem in my obsessively wide reading.

But although I've done my time in web services (Fun Facts to Known and Tell: Java Business Integration), high performance computing (Post Modern Deck Construction), and distributed systems (Traffic Management and System Scalability), lately I've been agonizing over the machine code produced by compilers and the bare metal on which it runs (Cache Coherency and Memory Models). So I was standing in the shower the other morning, listening to that little voice in my head, trying to figure out why I was so focused on the implications of the CAP Theorem when my current interests leaned more towards cache coherency or the lack of it.

And it finally occurred to me: they're the same problem. It is simply a matter of scale.

Here's the gist of the CAP Theorem as I understand it.

Consistency means serialization. To have strong consistency, transactions run across distributed servers on shared data must look as if they had been serialized and run one at a time as atomic transactions, whether that's the case or not. This is what the database and transaction people are talking about when they cite the ACID properties: Atomicity, Consistency, Isolation, Durability.

Availability means reliability. Every request gets a timely response, and no client is left hanging. It's this requirement for high availability that typically leads to distributed systems in the first place, since HA systems achieve it by eliminating single points of failure. What constitutes failure is relative in this context. Julian Brown has remarked that an extra tenth of a second in response time costs Amazon.com one percent in sales.

Partition-tolerance means the system deals with the occasional temporary network outage. This is a reality even on a LAN inside a centralized data center, but obviously more so in a geographically distributed system depending on a WAN. Failure here is relative depending on the application. Network partitioning can be seen as an increase in latency. For real-time applications, an increase in latency of milliseconds may constitute a network failure, while applications requiring only an eventual data reconciliation will be more tolerant.

Brewer remarked that there is no such thing as a 100% working system (something the high performance computing folks are grappling with as they try to build exascale systems from commodity parts), and no such thing as 100% fault tolerance. But hugely scalable cloud providers like Google, Amazon.com, and Ebay achieve as close to 100% as they can by throwing a lot of parallel (to handle the workload) and redundant (to achieve high availability) hardware the problem. Geographic location is also a single point of failure, not just from natural disasters but from availability of infrastructure like bandwidth, latency, and power. So such systems tend to be geographically distributed as well. This makes them highly dependent on network connectivity to communicate with one another.

The theorem states that at most two of the three, (strong) consistency, (high) availability, and (network) partition-tolerance, can be achieved. Each distributed system designer has to make their own decision as to where their application lies in the CAP space, and what compromises they are going to make, based on their business needs. A distributed system may provide strong consistency and high availability but can only do so as long as the servers can talk to one another. Or it may tolerate network outages but become completely unavailable while waiting for the network to heal. Or it might relax its need for strong consistency while being highly available even in the face of temporary network outages.

In their paper on the CAP Theorem, Gilbert and Lynch use two different models to construct their proof. The asynchronous model has no temporal dimension. Agents in the asynchronous model are simply state machines that are only concerned with message order and not with when messages arrive. The partially synchronous model has a sense of time duration. Agents in the synchronous model can take action when an expected message doesn't arrive within a time out interval. In the synchronous model, they show that it is possible to achieve a useful compromise between consistency and availability in the face of network partitioning.

While both models are applicable to real world real-time systems that I have helped develop over the past few decades, it is the synchronous model that has been leveraged by the big e-commerce players. In particular, Vogel has written about how Amazon.com uses weak or eventual consistency as its own necessary sacrifice to the CAP Theorem.

So, finally, here's the thing: all this exists, writ small, in the world of multiple processors and cores, cache coherency protocols, and shared memory. The individual cores constitute the distributed agents. Their individual memory caches contain the copies of shared data. The interconnects between the cores form the network. The cache coherency protocols correspond to the transaction protocols. Threads in a multi-threaded application reading and writing shared data in main memory create the workload. The same issues of consistency, availability, and partition-tolerance apply. And the same trade-offs are made, whether you realize it or not.

Don't think the interconnect network in your multi-core system is ever partitioned? Sure it is. Network connectivity failures are never permanent, unless you are unplugging the power cord from your server and tossing the whole box into the dumpster. It's all just a question of how much latency your system is willing to tolerate. A distributed system of the likes of Google's may tolerate seconds, minutes, or even hours, relying on redundancy to carry the load until the network can heal. The kinds of message passing systems I've helped develop over the years might have a latency threshold measured in tens or hundreds of milliseconds. A microcontroller talking to an FPGA might tolerate microseconds. A microprocessor execution unit waiting for a word from cache might measure its latency threshold in fractions of a nanosecond before the execution pipeline stalls.

It's all the same model. Just the decimal point moves.

Ulrich Dreppler, of Red Hat, authored a massive but indispensable tome on how memory actually works in modern microprocessors. In it he wrote of the MESI (Modified, Exclusive, Shared, Invalid) cache coherency protocol:

If multiple threads modify the same memory location concurrently, processors do not guarantee any speciļ¬c result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

This is exactly the kind of compromise the distributed systems folk are talking about when they discuss the implications of the CAP Theorem. In fact, while reading Gilbert and Lynch's paper, I was struck by how much the model they were describing in their proof corresponded to the MESI protocol, and how closely their synchronous model described the kind of real-time systems, distributed and otherwise, with which I am familiar. In their quest for performance, microprocessor designers have left behind ACID for what the distributed systems folk refer to as BASE: Basically Available, Soft-state, Eventual consistency. And they've done it for all the same reasons.

Why is this correspondence important? Because I think the CAP Theorem is just as valuable a tool for designing and reasoning about multi-processor and multi-core systems as it is for globe-girdling distributed systems.

I was in that shower a long time. Mrs. Overclock (etc.) probably wondered what was going on.

Update (2010-05-08)

When I had my ah-ha moment in the shower, I knew it was obvious enough that someone must have thought of it before. But my own research, online and otherwise, didn't turn anything up. I had read Vogels' article in ACM Queue, but this morning I went back and read his original blog article from which that was taken. Sure enough, two years ago, in the comments section, Manish Vachharaiani made the same observation about the similarity between the CAP Theorem and cache coherency.

And in those same comments, Roy Friedman cites the first formal treatment of the CAP issue, Attiya and Welch, which I've added to my sources below.

Sources

Hagit Attiya, Jennifer L. Welch, "Sequential Consistency versus Linearizability", ACM Transactions on Computing Systems, 12(2):91-122, 1994

Eric A. Brewer, "Towards Robust Distributed Systems", ACM Symposium of Principles of Distributed Computing, July 2000

Julian Browne, "Brewer's CAP Theorem", January 2009

Ulrich Drepper, What Every Programmer Should Know About Memory, Red Hat Inc., November 2007

Armando Fox, Eric A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, IEEE Computer Society, 1999


J. L. Sloan, "Implications of Memory Models (or Lack of Them) for Software Developers", Digital Aggregates Corporation, April 2010

Werner Vogels, "Eventually Consistent", ACM Queue, October 2008

2 comments:

Anonymous said...

Hi.

I'm a little bit late for almost 10 years, but it is better late than never...

I think you are quite right person to ask the question.

The question is about mathematical foundations behind the distributed systems, which has almost none. And comparison to communication engineering which
has the whole information theory behind.

We have separate unreliable (noisy) components and an unreliable (noisy) channel. And now with the help of a whole theory,
which is based on the Shannon theorem, the essence of which is that from unreliable
individual components and an unreliable channel, we can build a reliable system up to some nines - 99.999 (9). Consider now a computer that has
limited physical capabilities - we can process on one machine only a finite number of requests per some unit of time. If we need to process much more requests than one machine can handle, we add more machines, etc. And we also have an unreliable channel, a network. And ... that's all, no formalisms, no math formulas, and indeed no mathematics. Although the analogy with receiving / transmitting signals and unreliable components is almost direct, namely from individual(unreliable) components, with the help of certain laws, we get a working system. In one case, a whole theory was developed with a detailed math apparatus, such as coding and recovery of information, compression, etc. And in the other, apart from CAP, in fact, there is nothing. There are no formulas, quantitative indicators, etc.
Building distributed systems is somewhat of an art, where everyone does the best they can.

The question is why there are no mathematical foundations in the theory of distributed systems? There is only CAP that lacks any
quantitative characteristics or formulas. Well, yes, there is some compromise between C, A and P, choose 2 out of 3. I've looked at Lynch paper
above, it has some proofs, probably some sort of set theory involved but still this is to far from information theory...
I've asked other engeeners similiar questions and some of them told that CAP is completely useless (see above why).
Yes, it defines some abstractions some directions to look for and that's all, everything else is up to you.

Chip Overclock said...

Many years ago I worked for a big multinational telecommunications equipment manufacturer. Their product line is (because they're still around) broad and deep. The code base for the biggest of those products was a large as eight million lines of code, mostly C, and that was just what ran on the central processor, never mind all the firmware that ran at a lower level. One of the specific projects I worked on was firmware for an ATM interface board that allowed the multiple cabinets of an enterprise telephone system to be connected by fiber optic cable over a SONET network. This allowed the system to be broadly physically distributed, even (as we came to learn from our more innovative customers) internationally. That company had a group of engineers who did performance analysis, for the purposes of configuring a customers hardware and software, often involving a lot of mathematics and modeling, using tools like queueing theory. This got applied a lot to ginormous installations typically of call centers, which had a lot of moving parts in the distributed system sense. That's the closest I can come to what you're looking for, and it wasn't my area, so that's about all I can say about it. But such analysis is done, or at least was when I was involved in such R&D. I suspect that such groups exist today, both in the telecom domain, and in cloud providers like Amazon and Google. However, whether they're publishing much about what they do is an open question; but from time to time I do see articles about best practices and design rules for such organizations. I suspect if you search, you may find some useful material.