Chaos Monkeys

I spent a chunk of Friday in an alpha training session on clouds. Much of it wasn’t new, but there were some insights into designing stuff for clouds as opposed to fixed infrastructure. The netflix chaos monkeys came up, and it made me think. It isn’t the first time I’ve come across the chaos monkeys, but it put an image in my mind of a chimp in an old fashioned data centre pulling cables out of a patch room and the unplugging servers from the network. Add in a bit of the cliche cleaner who pulls the plug out of the critical system to power the vacuum cleaner and you’re away with the mental image.

In the cloud no-one has to hear you scream. If you spaghetti cable all the boxes together and make them all redundant by synchronising data in real time across multiple data stores then you’re a good chunk of the way to ultra resilient systems. There’s more you need to do though.

You need to make all the transactions stateless, and I don’t mean that they can’t get a passport. No. You need to design things so that it doesn’t matter if the chaos monkey comes along and pulls out the plug. Every time a person talks to a server the packets need to contain everything they need for the transaction, none of the actions the server takes should be dependent on the knowledge of a previous transaction with the same person. There are no sessions here, just a bunch of discrete events. Each event divorced from those that may have come before it, and from any that follow on. This means that you typically consume fewer resources for a given set of transaction on a server (potentially slightly more per discrete event, but for a far shorter time slice).

So on top of this you want proper cloud provision, following the NIST cloud standard. This means that it should scale up and down practically instantaneously (although recognising that you’d probably have some limits on this because of a combination of cost and practical availability of resources to your cloud service provider). If it has enough standby capacity then surges in demand, or component failures shouldn’t stop the service from working.

The other thing you need for resilience is a way to ensure that there are no single points of failure in your infrastructure. It’s all very well having a hugely scalable cloud in a couple of data centres, but if they share a flood plain, power supply or network connection etc then they have a single point of failure that could take them both down.

With all this in place, when the chaos monkey comes to visit, and the chaos monkey *will* pay a visit, there will be little or no discernible impact on people in the middle of transactions. Not having single points of failure will mean that some of your infrastructure is still available. Cloud scalability means that the loss of servers will be made good. The statelessness means that when they react to the result of their last event they will still get a sensible response from the replacement server they send a message to.