Chaos Engineering
eBook - ePub

Chaos Engineering

Mikolaj Pawlikowski

Share book
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Chaos Engineering

Mikolaj Pawlikowski

Book details
Book preview
Table of contents
Citations

About This Book

Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Site reliability through controlled disruption, you'll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You'll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you'll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Chaos Engineering an online PDF/ePUB?
Yes, you can access Chaos Engineering by Mikolaj Pawlikowski in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Control y garantía de calidad. We have over one million books available in our catalogue for you to explore.

Information

1 Into the world of chaos engineering

This chapter covers
  • What chaos engineering is and is not
  • Motivations for doing chaos engineering
  • Anatomy of a chaos experiment
  • A simple example of chaos engineering in practice
What would you do to make absolutely sure the car you’re designing is safe? A typical vehicle today is a real wonder of engineering. A plethora of subsystems, operating everything from rain-detecting wipers to life-saving airbags, all come together to not only go from A to B, but to protect passengers during an accident. Isn’t it moving when your loyal car gives up the ghost to save yours through the strategic use of crumple zones, from which it will never recover?
Because passenger safety is the highest priority, all these parts go through rigorous testing. But even assuming they all work as designed, does that guarantee you’ll survive in a real-world accident? If your business card reads, “New Car Assessment Program,” you demonstrably don’t think so. Presumably, that’s why every new car making it to the market goes through crash tests.
Picture this: a production car, heading at a controlled speed, closely observed with high-speed cameras, in a lifelike scenario: crashing into an obstacle to test the system as a whole. In many ways, chaos engineering is to software systems what crash tests are to the car industry: a deliberate practice of experimentation designed to uncover systemic problems. In this book, you’ll look at the why, when, and how of applying chaos engineering to improve your computer systems. And perhaps, who knows, save some lives in the process. What’s a better place to start than a nuclear power plant?

1.1 What is chaos engineering?

Imagine you’re responsible for designing the software operating a nuclear power plant. Your job description, among other things, is to prevent radioactive fallout. The stakes are high: a failure of your code can produce a disaster leaving people dead and rendering vast lands uninhabitable. You need to be ready for anything, from earthquakes, power cuts, floods, or hardware failures, to terrorist attacks. What do you do?
You hire the best programmers, set in place a rigorous review process, test coverage targets, and walk around the hall reminding everyone that we’re doing serious business here. But “Yes, we have 100% test coverage, Mr. President!” will not fly at the next meeting. You need contingency plans; you need to be able to demonstrate that when bad things happen, the system as a whole can withstand them, and the name of your power plant stays out of the news headlines. You need to go looking for problems before they find you. That’s what this book is about.
Chaos engineering is defined as “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” (Principles of Chaos Engineering, http://principlesofchaos.org/). In other words, it’s a software testing method focusing on finding evidence of problems before they are experienced by users.
You want your systems to be reliable (we’ll look into that), and that’s why you work hard to produce good-quality code and good test coverage. Yet, even if your code works as intended, in the real world plenty of things can (and will) go wrong. The list of things that can break is longer than a list of the possible side effects of painkillers: starting with sinister-sounding events like floods and earthquakes, which can take down entire datacenters, through power supply cuts, hardware failures, networking problems, resource starvation, race conditions, unexpected peaks of traffic, complex and unaccounted-for interactions between elements in your system, all the way to the evergreen operator (human) error. And the more sophisticated and complex your system, the more opportunities for problems to appear.
It’s tempting to discard these as rare events, but they just keep happening. In 2019, for example, two crash landings occurred on the surface of the Moon: the Indian Chandrayaan-2 mission (http://mng.bz/Xd7v) and the Israeli Beresheet (http://mng .bz/yYgB), both lost on lunar descent. And remember that even if you do everything right, more often than not, you still depend on other systems, and these systems can fail. For example, Google Cloud,1 Cloudflare, Facebook (WhatsApp), and Apple all had major outages within about a month in the summer of 2019 (http://mng.bz/ d42X). If your software ran on Google Cloud or relied on Cloudflare for routing, you were potentially affected. That’s just reality.
It’s a common misconception that chaos engineering is only about randomly breaking things in production. It’s not. Although running experiments in production is a unique part of chaos engineering (more on that later), it’s about much more than that—anything that helps us be confident the system can withstand turbulence. It interfaces with site reliability engineering (SRE), application and systems performance analysis, and other forms of testing. Practicing chaos engineering can help you prepare for failure, and by doing that, learn to build better systems, improve existing ones, and make the world a safer place.

1.2 Motivations for chaos engineering

At the risk of sounding like an infomercial, there are at least three good reasons to implement chaos engineering:
  • Determining risk and cost and setting service-level indicators, objectives, and agreements
  • Testing a system (often complex and distributed) as a whole
  • Finding emergent properties you were unaware of
Let’s take a closer look at these motivations.

1.2.1 Estimating risk and cost, and setting SLIs, SLOs, and SLAs

You want your computer systems to run well, and the subjective definition of what well means depends on the nature of the system and your goals regarding it. Most of the time, the primary motivation for companies is to create profit for the owners and shareholders. The definition of running well will therefore be a derivative of the business model objectives.
Let’s say you’re working on a planet-scale website, called Bookface, for sharing photos of cats and toddlers and checking on your high-school ex. Your business model might be to serve your users targeted ads, in which case you will want to balance the total cost of running the system with the amount of money you can earn from selling these ads. From an engineering perspective, one of the main risks is that the entire site could go down, and you wouldn’t be able to present ads and bring home the revenue. Conversely, not being able to display a particular cat picture in the rare event of a problem with the cat picture server is probably not a deal breaker, and will affect your bottom line in only a small way.
For both of these risks (users can’t use the website, and users can’t access a cat photo momentarily), you can estimate the associated cost, expressed in dollars per unit of time. That cost includes the direct loss of business as well as various other, less tangible things like public image damage, that might be equally important. As a real-life example, Forbes estimated that Amazon lost $66,240 per minute of its website being down in 2013.2
Now, to quantify these risks, the industry uses service-level indicators (SLIs). In our example, the percentage of time that your users can access the website could be an SLI. And so could the ratio of requests that are successfully served by the cat photos service within a certain time window. The SLIs are there to put a number to an event, and picking the right SLI is important.
Two parties agreeing on a certain range of an SLI can form a service-level objective (SLO), a tangible target that the engineering team can work toward. SLOs, in turn, can be legally enforced as a service-level agreement (SLA), in which one party agrees to guarantee a certain SLO or otherwise pay some form of penalty if they fail to do so.
Going back to our cat- and toddler-photo-sharing website, one possible way to work out the risk, SLI, and SLO could look like this:
  • The main risk is “People can’t access the website,” or simply the downtime
  • A corresponding SLI could be “the ratio of success responses to errors from our servers”
  • An SLO for ...

Table of contents