The big day is finally here. That project your team has been working on for the last six months? It’s launching today. A little giddy with anticipation, the team monitors the deployment process as dashboards click over from red to green. Everything is working perfectly.
These are some of the best days to work in the tech world. Through a lot of hard work and persistence, you’ve brought something into the world which wouldn’t have existed otherwise. Your project is a success; it does what it’s supposed to and people love using it.
Those days balance out by the days that lie on the opposite end of the spectrum. The days when your popular service crashes for reasons you can’t have anticipated. Maybe a key component fails at an inopportune time. Perhaps someone forgot to account for the database server not being available on the network. Could be that you assume your AWS servers will always be available, and your code doesn’t know what to do if servers switch off without warning.
For some companies, days like these are high-stress. Other companies take days like these in stride, because they’ve engineered their products to be able to handle these situations and keep on ticking.
Fault tolerance and chaos engineering
Building systems which withstand availability hiccups is not a new concept. Designing systems for fault tolerance is a practice as old as networked computing itself. It continues to be a best practice today, as well. It’s best practice for even early stage startups to design some of their systems for fault tolerance. For instance, it’s common for even the tiniest companies to back up their customer databases. While a database backup doesn’t mean that your application will be able to withstand your database server crashing, it is a good first step.
Some companies take designing fault-tolerant systems to an entirely different level. They build redundant systems in geographically distinct locations, so that even if an entire data center disappears, their application will persist. The most advanced of these companies implement a practice known as chaos engineering, wherein they deliberately trigger failures of those redundant systems, to make sure the backups work as expected.
Breaking things on purpose
The core principle of chaos engineering is building tools which break software in unexpected ways. For instance, Netflix, who were one of the pioneers of chaos engineering, have tools which shut down entire server clusters to make sure their application doesn’t crash. Their engineering team builds tools which automate this process. It happens right in production. Between the hours of 9 and 3, Monday through Friday, their automated software tools are free to shut down any part of their ecosystem. There’s no warning, things just break without notice.
For Netflix, this kind of chaos is valuable. Engineers and managers there can be confident that even if something as drastic as an entire AWS region goes offline, Netflix will still allow people to stream Friends reruns in peace.
Why adopt chaos engineering?
While chaos engineering is an evolutionary step from designing fault tolerant systems, it’s a pretty big one. It takes a lot of will to trust that the systems you’ve designed will stand up in adverse conditions. However, the benefits are significant, even if your systems are poorly-designed. That might seem counter-intuitive. If your systems are poorly designed, won’t that mean chaos engineering crashes your product?
Yes, you’re going to wind up crashing some systems. You’re also going to learn a lot about your systems in the process. Every time a system crashes, that’s data about something your team needs to fix. If that kind of crash happens because a hard drive fails, your system will be out of commission for some time. Instead, if you control the environment and bring down that hard drive in a controlled fashion, you can spin it right back up again. The outage lasts as long as you need to gather information about the failure, then the service is running again. Then your team can design a fix for the outage. Chaos engineering means that your systems are stronger than they otherwise would have been. The next time a hard drive fails, the system will keep humming along. You won’t be getting alerted at 3AM because the service crashed.
Bioware’s Anthem and ignoring chaos engineering
One of my favorite examples of what can go wrong when you ignore chaos engineering comes from video gaming. In January 2019, EA and Bioware released the beta for their much-anticipated game, Anthem. Anthem is an online multiplayer game, meaning that the beta would involve hundreds of thousands of players logging into the game to play across a long weekend. This beta was the beginning of a big marketing push in the lead-up to the game’s release in February, about six weeks later. Bioware ensured that there’d be a big pool of players by making sure that everyone who preordered the game would be able to log into the beta. They also had code giveaways on their website and across social media.
When the beta turned on, Bioware had a day like our opening. Dashboards clicked to green. The software was working great. Players started logging in to explore the world they’d built.
Then everything fell down
Everything didn’t break right away. Instead, players started reporting that they couldn’t log into the game. This wasn’t too surprising; most online games experience problems with authentication when they first boot up. Many more players than the authentication servers are prepared to handle try to log in all at once. Obviously, a hiccup like that isn’t ideal, but it’s something an experienced team expects.
What they didn’t expect is that the failure would cascade. It turns out that when they’d written Anthem’s login code, there was a minor oversight. If the authentication servers weren’t available, the game would continue trying to log in as quickly as possible. For decades, players of online games have been trained to keep trying when a login doesn’t work the first time. Temporary hiccups would resolve, and they’d be able to get back into the game they were trying to play.
This meant that all those players were in effect performing a denial of service attack on the very game they were trying to play. This is precisely why chaos engineering can be so valuable. If Bioware had a practice of engineering with missing services in their game system, they would’ve caught this oversight early. Instead, hundreds of thousands of players were locked out of the game completely, just when Bioware wanted them to play.
One step at a time
As we noted, chaos engineering can be a big step. If your team isn’t used to dealing with outages, randomly shutting down services at random in the middle of the work day probably isn’t the best first step. Instead, the right answer is to approach chaos engineering in an iterative fashion. You can plan outages, and have your team on hand to gather data about the ways your product fails. You can take a system offline, see what happens, then bring it back up quickly. Let your team fix the software, then try again. Eventually, you’ll get to a point where your software withstands missing pieces of the service without interruption. You’ll be able to maintain a steady state despite a chaotic environment. From there, you can work to make the environment more chaotic, and as you do, you can sleep more soundly, knowing that a little hiccup won’t take the whole thing down.
This post was written by Eric Boersma. Eric is a software developer and development manager who’s done everything from IT security in pharmaceuticals to writing intelligence software for the US government to building international development teams for non-profits. He loves to talk about the things he’s learned along the way, and he enjoys listening to and learning from others as well.