eBook - ePub

Chaos Engineering

Name: Chaos Engineering
Author: Mikolaj Pawlikowski

Mikolaj Pawlikowski

Buch teilen

English
ePUB (handyfreundlich)
Über iOS und Android verfügbar

eBook - ePub

Chaos Engineering

Mikolaj Pawlikowski

Angaben zum Buch

Buchvorschau

Inhaltsverzeichnis

Quellenangaben

Über dieses Buch

Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In Chaos Engineering: Site reliability through controlled disruption, you'll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You'll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you'll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

Häufig gestellte Fragen

Wie kann ich mein Abo kündigen?

Gehe einfach zum Kontobereich in den Einstellungen und klicke auf „Abo kündigen“ – ganz einfach. Nachdem du gekündigt hast, bleibt deine Mitgliedschaft für den verbleibenden Abozeitraum, den du bereits bezahlt hast, aktiv. Mehr Informationen hier.

(Wie) Kann ich Bücher herunterladen?

Derzeit stehen all unsere auf Mobilgeräte reagierenden ePub-Bücher zum Download über die App zur Verfügung. Die meisten unserer PDFs stehen ebenfalls zum Download bereit; wir arbeiten daran, auch die übrigen PDFs zum Download anzubieten, bei denen dies aktuell noch nicht möglich ist. Weitere Informationen hier.

Welcher Unterschied besteht bei den Preisen zwischen den Aboplänen?

Mit beiden Aboplänen erhältst du vollen Zugang zur Bibliothek und allen Funktionen von Perlego. Die einzigen Unterschiede bestehen im Preis und dem Abozeitraum: Mit dem Jahresabo sparst du auf 12 Monate gerechnet im Vergleich zum Monatsabo rund 30 %.

Was ist Perlego?

Wir sind ein Online-Abodienst für Lehrbücher, bei dem du für weniger als den Preis eines einzelnen Buches pro Monat Zugang zu einer ganzen Online-Bibliothek erhältst. Mit über 1 Million Büchern zu über 1.000 verschiedenen Themen haben wir bestimmt alles, was du brauchst! Weitere Informationen hier.

Unterstützt Perlego Text-zu-Sprache?

Achte auf das Symbol zum Vorlesen in deinem nächsten Buch, um zu sehen, ob du es dir auch anhören kannst. Bei diesem Tool wird dir Text laut vorgelesen, wobei der Text beim Vorlesen auch grafisch hervorgehoben wird. Du kannst das Vorlesen jederzeit anhalten, beschleunigen und verlangsamen. Weitere Informationen hier.

Ist Chaos Engineering als Online-PDF/ePub verfügbar?

Ja, du hast Zugang zu Chaos Engineering von Mikolaj Pawlikowski im PDF- und/oder ePub-Format sowie zu anderen beliebten Büchern aus Informatique & Assurance qualité et tests. Aus unserem Katalog stehen dir über 1 Million Bücher zur Verfügung.

Information

Verlag

Manning Publications

Jahr

2021

ISBN

9781617297755

Thema

Informatique

Thema

Assurance qualité et tests

1 Into the world of chaos engineering

This chapter covers

What chaos engineering is and is not
Motivations for doing chaos engineering
Anatomy of a chaos experiment
A simple example of chaos engineering in practice

What would you do to make absolutely sure the car you’re designing is safe? A typical vehicle today is a real wonder of engineering. A plethora of subsystems, operating everything from rain-detecting wipers to life-saving airbags, all come together to not only go from A to B, but to protect passengers during an accident. Isn’t it moving when your loyal car gives up the ghost to save yours through the strategic use of crumple zones, from which it will never recover?

Because passenger safety is the highest priority, all these parts go through rigorous testing. But even assuming they all work as designed, does that guarantee you’ll survive in a real-world accident? If your business card reads, “New Car Assessment Program,” you demonstrably don’t think so. Presumably, that’s why every new car making it to the market goes through crash tests.

Picture this: a production car, heading at a controlled speed, closely observed with high-speed cameras, in a lifelike scenario: crashing into an obstacle to test the system as a whole. In many ways, chaos engineering is to software systems what crash tests are to the car industry: a deliberate practice of experimentation designed to uncover systemic problems. In this book, you’ll look at the why, when, and how of applying chaos engineering to improve your computer systems. And perhaps, who knows, save some lives in the process. What’s a better place to start than a nuclear power plant?

1.1 What is chaos engineering?

Imagine you’re responsible for designing the software operating a nuclear power plant. Your job description, among other things, is to prevent radioactive fallout. The stakes are high: a failure of your code can produce a disaster leaving people dead and rendering vast lands uninhabitable. You need to be ready for anything, from earthquakes, power cuts, floods, or hardware failures, to terrorist attacks. What do you do?

You hire the best programmers, set in place a rigorous review process, test coverage targets, and walk around the hall reminding everyone that we’re doing serious business here. But “Yes, we have 100% test coverage, Mr. President!” will not fly at the next meeting. You need contingency plans; you need to be able to demonstrate that when bad things happen, the system as a whole can withstand them, and the name of your power plant stays out of the news headlines. You need to go looking for problems before they find you. That’s what this book is about.

Chaos engineering is defined as “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” (Principles of Chaos Engineering, http://principlesofchaos.org/). In other words, it’s a software testing method focusing on finding evidence of problems before they are experienced by users.

You want your systems to be reliable (we’ll look into that), and that’s why you work hard to produce good-quality code and good test coverage. Yet, even if your code works as intended, in the real world plenty of things can (and will) go wrong. The list of things that can break is longer than a list of the possible side effects of painkillers: starting with sinister-sounding events like floods and earthquakes, which can take down entire datacenters, through power supply cuts, hardware failures, networking problems, resource starvation, race conditions, unexpected peaks of traffic, complex and unaccounted-for interactions between elements in your system, all the way to the evergreen operator (human) error. And the more sophisticated and complex your system, the more opportunities for problems to appear.

It’s tempting to discard these as rare events, but they just keep happening. In 2019, for example, two crash landings occurred on the surface of the Moon: the Indian Chandrayaan-2 mission (http://mng.bz/Xd7v) and the Israeli Beresheet (http://mng .bz/yYgB), both lost on lunar descent. And remember that even if you do everything right, more often than not, you still depend on other systems, and these systems can fail. For example, Google Cloud,1 Cloudflare, Facebook (WhatsApp), and Apple all had major outages within about a month in the summer of 2019 (http://mng.bz/ d42X). If your software ran on Google Cloud or relied on Cloudflare for routing, you were potentially affected. That’s just reality.

It’s a common misconception that chaos engineering is only about randomly breaking things in production. It’s not. Although running experiments in production is a unique part of chaos engineering (more on that later), it’s about much more than that—anything that helps us be confident the system can withstand turbulence. It interfaces with site reliability engineering (SRE), application and systems performance analysis, and other forms of testing. Practicing chaos engineering can help you prepare for failure, and by doing that, learn to build better systems, improve existing ones, and make the world a safer place.

1.2 Motivations for chaos engineering

At the risk of sounding like an infomercial, there are at least three good reasons to implement chaos engineering:

Determining risk and cost and setting service-level indicators, objectives, and agreements
Testing a system (often complex and distributed) as a whole
Finding emergent properties you were unaware of

Let’s take a closer look at these motivations.

1.2.1 Estimating risk and cost, and setting SLIs, SLOs, and SLAs

You want your computer systems to run well, and the subjective definition of what well means depends on the nature of the system and your goals regarding it. Most of the time, the primary motivation for companies is to create profit for the owners and shareholders. The definition of running well will therefore be a derivative of the business model objectives.

Let’s say you’re working on a planet-scale website, called Bookface, for sharing photos of cats and toddlers and checking on your high-school ex. Your business model might be to serve your users targeted ads, in which case you will want to balance the total cost of running the system with the amount of money you can earn from selling these ads. From an engineering perspective, one of the main risks is that the entire site could go down, and you wouldn’t be able to present ads and bring home the revenue. Conversely, not being able to display a particular cat picture in the rare event of a problem with the cat picture server is probably not a deal breaker, and will affect your bottom line in only a small way.

For both of these risks (users can’t use the website, and users can’t access a cat photo momentarily), you can estimate the associated cost, expressed in dollars per unit of time. That cost includes the direct loss of business as well as various other, less tangible things like public image damage, that might be equally important. As a real-life example, Forbes estimated that Amazon lost $66,240 per minute of its website being down in 2013.2

Now, to quantify these risks, the industry uses service-level indicators (SLIs). In our example, the percentage of time that your users can access the website could be an SLI. And so could the ratio of requests that are successfully served by the cat photos service within a certain time window. The SLIs are there to put a number to an event, and picking the right SLI is important.

Two parties agreeing on a certain range of an SLI can form a service-level objective (SLO), a tangible target that the engineering team can work toward. SLOs, in turn, can be legally enforced as a service-level agreement (SLA), in which one party agrees to guarantee a certain SLO or otherwise pay some form of penalty if they fail to do so.

Going back to our cat- and toddler-photo-sharing website, one possible way to work out the risk, SLI, and SLO could look like this:

The main risk is “People can’t access the website,” or simply the downtime
A corresponding SLI could be “the ratio of success responses to errors from our servers”
An SLO for ...