The Antifragile PaaS

Much has been made of Nassim Nicholas Taleb’s Antifragility book.

At a basic level if fragile describes a system that suffers when put under stress, and robust describes a system that is impervious to stress, an anti-fragile system is one that benefits from stress.

Many things in nature are antifragile – a good example is the human body: you place a small amount of ‘stress’ on muscle, bone or skin and it’ll come back stronger than before the stress. In the world of IT we’re used to building robust systems: running apps on ‘traditional IT’ environments sized for peak load.

I’ve heard antifragility referenced in relation to microservices, heard recommendations on building similar systems and allowing for Darwinian selection amongst similar systems, etc. The next generation of applications will most certainly be antifragile - able to strengthen themselves in response to stressors. Will yours be amongst them?

Well, apps have to run on infrastructure of some kind. Today I’m going to talk about making antifragile PaaS infrastructure & how that can help make apps antifragile.

TL;DR

Some PaaS offerings are getting close to making apps antifragile using auto-scaling, bridge the gap using team process
Make your PaaS infrastructure antifragile using team process
Fail everything all the time & re-route traffic to simulate failures and normal operations

Apps on PaaS

I’m a Cloud Foundry user, both the open-source and vendor-provided types. The main advantage PaaS has over ‘traditional IT’ (by which I mean physical tin & VMs) for running apps, is the simplicity with which environments can scale. By building and storing an executable application ‘droplet’ before starting applications, Cloud Foundry can ‘scale’ an app to multiple instances when manually requested to do so. The Pivotal Cloud Foundry offering now has a built in feature to auto-scale applications between an upper and lower limit based on CPU usage.

This kind of feature is putting application environments on the road from robustness to antifragility. The applications infrastructure (number of instances) increases under load. The only reason it isn’t quite a ‘true’ antifragile infrastructure yet is that the thresholds upon which the app can scale aren’t learnt – yet.

Thankfully you can bridge the gap using team process: app support teams can use APM tools to check app performance metrics other than CPU, track how the number of instances has been adjusted over time and adjust the CPU thresholds & max/mix instance counts accordingly.

So PaaS is getting close to antifragile for apps, but it’s not there yet. It is the people and process of the app support team that can make the app antifragile.

PaaS Infrastructure

Whatever PaaS you go for there’ll be a number of nodes of each type, scaled to minimum level for the function they provide e.g. 6 compute nodes, 1 central controller, 2 clustered message bus nodes, 1 log aggregator, etc. After the running the PaaS for a while you’re going to hit a limit on one of these: you’ll run out of compute capacity, start receiving more logs you can handle, or something similar. PaaS offerings typically allow these components to be scaled individually.

Right now I haven’t seen any PaaS offerings that are detecting prolonged heavy activity and scale components accordingly, so it’s your PaaS Operations team – the folks who are running your PaaS infrastructure – who are going to pick up the gap.

These folks will need to take every day warning signs from the PaaS itself – the log messages and metrics coming out of it – and scale or reconfigure the relevant component to strengthen the PaaS as a whole. So in this case of PaaS infrastructure, the PaaS Operations people and process can make the PaaS infrastructure antifragile.

Fail everything. All the time.

I’ve blogged before about how infrastructure that is too stable is actually a bad thing. Antifragility supports this thinking – a small amount of ‘stress’ applied regularly to application infrastructure will cause the app developers to strengthen it to avoid getting called out at night. I’m aware of the following approaches for Cloud Foundry:

Randomised infrastructure death using Chaos Lemur, akin to Neflix’s Simian Army , to test PaaS infrastructure. Because these failures will include compute nodes, it will cover both apps and the CF infrastructure itself.
If PaaS environments have redundancy (e.g. run live-live), randomised routing of traffic from one Cloud Foundry instance to another to iron out routing / load-related problems

Experience dictates the requirement to do this right the way up to production. It’ll reassure developers and testers that the glitches they’re seeing in non-production aren’t just that – they’re real-world scenarios that need their root cause to be analysed – and keep them honest

Written on October 8, 2015