Posted on:
May 06, 2011
by Wolfram Arnold

Chaos in the Cloud

Amazon’s EC2 Cloud experienced a large scale outage. The analysis report speaks of cascading failures and compounding bugs on many levels. Chaos Theory has some answers of why this happened, if not solutions of how to prevent it in the future.

I recently read the summary of the EC2 outage by Amazon’s engineering team. As they explained more and more how this system is architected, I couldn’t help but be amazed at the intricacy and the complexity that Amazon has engineerd into its cloud computing platform.

What struck me most was the intelligence bordering on self-awareness designed into each layer of the architecture. There are automatic failure detection mechanisms, automatic recovery procedures, transient authority systems, request throtteling features, and many others.

Despite the dense writing, my growing astonishment at the fateful sequence of events that led to the collapse on April 21st kept me reading. Little by little, a thought, a recognition pushed itself in the forefront of my consciousness. Is it possible that what happened in the EC2 cloud, is what physical systems experience as they descend into chaos?

Let’s review what chaos is. The most helpful operative definition I’ve found in my graduate school years in physics is this one: A system is chaotic when the trajectories of the system’s evolution for very similar initial conditions diverge exponentially quickly. In plainer language, when, given two starting points very close together, the outcomes after some time are vastly different. This phenomenon has been popularized, e.g. in the Butterfly Effect where the flapping of the wings of a butterly in China can trigger a hurricane in the Atlantic.

The opposite of chaos is stability, where the trajectories of two similar initial conditions evolve alongside each other and are still very close at any later time. An example of stability is Earth’s orbit around the Sun, or the Solar System’s trajectory inside our galaxy, the Milky Way.

Milky Way Source: NASA

To be precise, Earth’s path around the Sun is only “quasi-stable”, but before we get into that let’s discuss what the necessary ingredients are that make a system chaotic.

It’s not much, actually. You need some type of feedback mechanism that has a tendency to return the system to where it came from. This could, e.g. be a so-called restoring force, like in a pendulum, when it swings out to the side, it has gravity to pull it back down. Earth’s orbit around the Sun operates by a similar principle. Or it could be a trigger mechanism like in Amazon’s EBS nodes which causes them to try to generate a new copy when they notice a failing node, thereby restoring the system to a full complement of replicas. Or it could be a microphone held too closely to the speaker of a PA system with the amplifier cranked up. The mike picks up some small ambient noise, it gets amplified, output through the speaker and fed right back into the mike, to be amplified more, output, fed back in, and so forth. Eventually the mike-speaker-amplifier system hits a stable state in a high-pitched, loud and annoying sound.

A feedback mechanism by itself doesn’t cause chaos yet, and the example of the PA stuck in feedback is not a choatic state, it’s a stable oscillating state (which, albeit, is undesired by audio engineers). The second key ingredient for chaos is non-linear dynamics. Non-linear means a relationship that’s not proportional. For instance, a restoring force that is not proportial to the amount of displacement. A spring is a linear example. When you stretch out a spring, it will pull back together, with a force that is proportional to how far you’ve streched it. In nature, linear laws are rare, perhaps even non-existent, but often nature can be approximated by linear laws. Let’s take the example of a trigger algorithm in an EBS node. This is an on/off operation and highly non-linear. In fact all digital thresholding, triggering, counting logic is non-linear.

I’ve said that non-linear dynamics coupled with a feedback mechanism gives rise to chaos. That’s an over-generalization. For example, the physics governing Earth’s orbit around the Sun are linear (and thus perfectly stable) only in first approximation. In reality the moon, other planets, the galaxy, etc. also exert small gravitational pulls on the Earth which subtly distort the linear law. But that still doesn’t cause the solar system to be in a state of chaos, current policital affairs notwithstanding.

Chaotic systems are funny beasts. They can have so-called quasi-stable regions, like Earth’s orbit around the Sun. But when the right conditions come together, the system can drift out of the region of stability and into a chaotic state. Then, after a while, it may find another, perhaps qualitatively very different area of stability. A lot of fascinating phenomena can occur as they do this, at the so-called Edge of Chaos. The Mandelbrot Set features its most intricate and beautiful behavior right at the edge of chaos. The black center region is complete chaos, the blue outer region is stability, the edge is where things are most intriguing, surprising and unpredictable.

Mandelbrot Set

The clever feedback and trigger mechanisms in Amazon’s EC2 cloud provided all the right ingredients. Under normal circumstances they operate beautifully, stably, reliably, are robust to failures and self-correcting. Then, a small error was made during a localized network upgrade. That triggered a series of events which catapulted the system out of its stable regime and into chaos. After a short time the system found a new region of stability with EBS instances trying to replicate themselves like mad and getting stuck. The stuckness was another, very different stable state (in systems lingo), and obviously a highly undesirable one. The system was having a seizure of sorts.

As Amazon’s writeup illustrates, the unraveling of the chain of cause and effect was a non-trivial undertaking. It also showed that the system was taxed in unforeseen ways, and hidden bugs were discovered that compounded the situation.

But the bigger lesson that I take away is this: Even without hidden bugs, complex systems with many layers of interconnected, intelligent components, each having non-linear feedback mechanisms such as automatic trigger and self-correction procedures have systemic failure modes. Any small disturbance in such a system has the potential for unforeseen breakdowns, in the form of a system-wide regime shift. The total parameter space of such systems can be staggering and beyond reasonable test capabilities. What I’m missing from Amazon’s conclusions and proposed counter measures is a system theoretical discussion of how they are planning to address the inhert potential for chaotic behavior. Until this is addressed, the cloud may not have seen its last large scale outage yet.

blog comments powered by Disqus