A comprehensive guide with basic to advanced SRE practices and hands-on examples.
Key Features ? Demonstrates how to execute site reliability engineering along with fundamental concepts. ? Illustrates real-world examples and successful techniques to put SRE into production. ? Introduces you to DevOps, advanced techniques of SRE, and popular tools in use.
Description Hands-on Site Reliability Engineering (SRE) brings you a tailor-made guide to learn and practice the essential activities for the smooth functioning of enterprise systems, right from designing to the deployment of enterprise software programs and extending to scalable use with complete efficiency and reliability.The book explores the fundamentals around SRE and related terms, concepts, and techniques that are used by SRE teams and experts. It discusses the essential elements of an IT system, including microservices, application architectures, types of software deployment, and concepts like load balancing. It explains the best techniques in delivering timely software releases using containerization and CI/CD pipeline. This book covers how to track and monitor application performance using Grafana, Prometheus, and Kibana along with how to extend monitoring more effectively by building full-stack observability into the system.The book also talks about chaos engineering, types of system failures, design for high-availability, DevSecOps and AIOps.
What you will learn ? Learn the best techniques and practices for building and running reliable software. ? Explore observability and popular methods for effective monitoring of applications. ? Workaround SLIs, SLOs, Error Budgets, and Error Budget Policies to manage failures. ? Learn to practice continuous software delivery using blue/green and canary deployments.
Who this book is for This book caters to experienced IT professionals, application developers, software engineers, and all those who are looking to develop SRE capabilities at the individual or team level.
Table of Contents 1. Understand the World of IT 2. Introduction to DevOps 3. Introduction to SRE 4. Identify and Eliminate Toil 5. Release Engineering 6. Incident Management 7. IT Monitoring 8. Observability 9. Key SRE KPIs: SLAs, SLOs, SLIs, and Error Budgets 10. Chaos Engineering 11. DevSecOps and AIOps 12. Culture of Site Reliability Engineering
About the Authors Shamayel M. Farooqui is a technology leader who specializes in driving digital transformation for organizations and is the author of 'Enterprise DevOps Framework - Transforming IT Operations'.He has expertise in implementing IT security, cloud migrations, and IT automation and a proven track record of building teams of skilled site reliability engineers focused on delivering solutions for optimizing and running hybrid, multi-cloud environments. He thrives on building creative solutions to solve complex IT problems and has mastered the art of building reusable automation for driving efficiency in IT/business processes and cloud management. Blog links: http://www.shamayelfarooqui.com, http://www.shamayelfarooqui.com, https://www.xfgeek.com/home LinkedIn Profile: https://www.linkedin.com/in/shamayel/ Vishnu Vardhan Chikoti has diverse experience in the areas of Application and Database design and development, Micro-services & Micro-frontends, DevOps, Site Reliability Engineering, and Machine Learning.With the ability to conduct deep analysis, strong execution skills, and an innovative mindset, he has successfully led R&D teams to build engineering solutions to improve the reliability of applications. He is also an expert in building high-volume transaction processing applications for middle and back-office functions for Investment Banks using a variety of architectures. LinkedIn Profile: https://www.linkedin.com/in/vishnu-vardhan-chikoti-3763262/
Frequently asked questions
Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go. Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Hands-on Site Reliability Engineering by Shamayel Mohammed Farooqui, Vishnu Vardhan Chikoti in PDF and/or ePUB format, as well as other popular books in Computer Science & Software Development. We have over one million books available in our catalogue for you to explore.
In today’s world, software powered systems and service digitization have reached highly evolved states. Almost every business is a software business or has at least a major segment of its revenue being driven through software and digitization. Have you ever thought about what it takes to build these digital systems and who are the people behind making these digital systems available to us?
Writing code is not the only requirement for providing a software service that can be consumed by its intended users. It is equally important that this code can be packaged and hosted on a stable, efficient, and secure platform which is available for almost 100% of the time. By the way, there is a reason that we say “almost” 100%, and during the course of this book, you will learn why. The people who are responsible for bringing software to its end users are the ones who are known as IT professionals. In this chapter, we will be focusing on this aspect of IT, and understand the roles and responsibilities of IT teams. We will also talk about how security is relevant in these practices.
Structure
In this chapter, we will discuss the following topics:
What is the role of IT in an organization?
Understanding the IT organization structure
Role of infrastructure teams
Role of application teams
IT Security
Change management team
The TCP/IP protocol suite
Domain Name System (DNS)
Objective
This chapter will help you in developing a sense of the critical role that the IT function of an organization performs. You will learn about the diversity of the roles within IT and what the focus of each function is.
Apart from ensuring that the software reaches its users, there is another critical role that IT teams perform. This is about providing end user services, which means ensuring that the requirements of all the employees of an organization to function properly are met. These generally cover the client systems (desktops or laptops), phone services, networking services, and a few others. This is not an area of focus in this book and will not be dealt with in depth beyond being mentioned here.
Please note that if you are a working professional who already has an understanding of the IT world, feel free to skip to Chapter 2.
What is the role of IT in an organization?
It takes quite a lot of work from the time when a software application is developed to the point it reaches its audience. An IT team has to execute a series of tasks performed by multiple humans and systems, governed by many processes and standards to provide a software service to its users. These tasks are related to some areas that are as follows.
Hardware availability
It refers to end to end lifecycle management of all the hardware that is required to host applications, workloads, and services that are required by the various teams in the organization. This hardware includes the computer systems or servers, storage, networking components, appliances, and communication systems.
In IT, there are typically infrastructure teams who are responsible for the hardware management of the organization. These teams mainly comprise of system and network admins/engineers. A few members of these teams have specialized skills in certain areas like storage, virtualization, firewall, routing, and so on. These teams are also responsible for ensuring that the configuration and design of the hardware architecture is aptly set up for supporting the DR (Disaster Recovery) and HA (High Availability) requirements of the organization.
Core software services
In order to efficiently utilize the hardware, software is needed. Also, software is needed to execute a number of business and operations-related processes like security, virtualization, end user services, HR, finance, and so on. The software life cycle management is IT team’s responsibility which includes the licensing, testing, procurement, deployment, patching, and in some cases, troubleshooting any issues that may arise. The software services are managed in partnership between the infrastructure teams, application teams, and the vendor management teams.
Compliance and security
This refers to ensuring the organization is compliant to the industry standards that apply to it and to the standards that the organization has adopted to based on its operating area. For example, banking, healthcare, and the auto industry are critical responsibilities of the IT team.
Also, the IT team is accountable for securing the assets, data, and services of the organization. When it comes to security, all teams in the organization are accountable in some way or the other while the ownership usually lies with a central cyber security function within IT. You will learn more about this in a later segment of this chapter.
Application development and hosting
For an organization to function properly, there are many different applications that are needed. Among others, some of these critical applications are the ERP systems, CRM applications, communication systems, and HR applications. In some cases, these applications are developed internally by the IT teams while in most cases, the external software is procured. The IT teams are responsible for any development, procurement, implementation, and maintenance of these internal serving applications that are required to support the business of the organization.
As an example, let us consider a scenario where an organization is required to maintain an inventory of all the assets that it owns. This is a common requirement in many organizations and the need for such a service could have arisen due to varied reasons which could be compliance-related or operations-related.
In order to deliver a solution for this requirement, the application team within IT may decide to either procure a third-party solution or to build an application internally. In either of these scenarios, multiple hosting environments (production and non-production) are needed to host the end solution, and in case the organization decides to build this application internally, then an application development platform is also needed which enables all the steps of SDLC (software development life cycle). All these tasks are the responsibility of the teams within IT.
Another example of an application is a business application like the trade booking system. Brokerage houses have applications that are used by their customers to book to buy/sell trades. This type of application is usually designed and developed with a user interface, backend services, and databases. While the user interface can be a web application or a mobile application running accesses on a user’s device, the required web servers and other services/databases are hosted and maintained by the IT team. Different options to host these servers/services/databases, and so on are provided further in this chapter.
Enterprise Architecture (EA)
Different organizations have different views on the role and responsibility of the EA function. In most cases, the EA is an advisory function that is focused on ensuring the architecture of any new application introduced in the environment is meeting the required standards. The adoption of new technology by means of conducting various POCs (proof of concepts) and evaluations of software and providing reference architectures in the form of templates to the various teams is the core job of this function.
As a part of its IT strategy, an organization may make certain choices with respect to its preferred technology vendors like cloud providers, database engines, software programming languages, and so on. The EA team plays a critical role in this decision by providing guidance on these selections, and is also responsible for ensuring that the adoption of these technologies and frameworks happens across the teams in a proper fashion.
The Enterprise Architecture team assesses an application design on a number of areas before it can be considered as production ready. Some of these areas are security, scalability, elasticity, resiliency, availability, performance, latency, failover, architecture patterns used, databases, and so on. Apart from the ones mentioned above, certain architecture patterns are also considered during the architecture review of a software design. These patterns can be around microservices design, master-slave, client-server, cloud ready, and loose coupling. The EA review is a gate that is established during a software development lifecycle and can be critical to the durability and effectiveness of an application.
Software delivery
The application delivery process usually consists of multiple steps that include building, packaging, testing, deploying, and monitoring an application. From the moment that the software code is written and pushed to the code repository, it becomes the responsibility of the release/operations team that has to deploy it all the way to the production environment by following a series of steps along the way. More details on these steps will be shared during the course of this book.
Understanding the IT organization structure
Understanding the IT organization structure helps in getting a better idea of the division of responsibility between the different teams in IT. As mentioned previously in this chapter, there are many different responsibilities of IT which need the support of the dedicated teams for execution. In an organization, the IT teams are spearheaded by the CIO. The CIO is typically the decision-maker in terms of coming up with the structure of the teams in the IT organization.
Application management, software and hardware management, and implementing security are typically the three core areas around which an IT organization structure is formed. One such example is as follows:
Figure 1.1
Figure 1.1 provides a generic view of how an IT team is usually structured. The hierarchy of the structure may differ with organizations, depending on what works best for them. Also, there are usually a few other teams in addition to the technical teams mentioned such as the PMO (project management office), VMO (vendor management office), and risk management which completes this structure.
Role of infrastructure teams
Infrastructure teams in an organization are responsible for setting up the relevant infrastructure for running various software applications in the organization. They also procure and maintain any vendor software that will be required by the software applications. These software applications can be the business applications that support the business or internal applications for teams like finance and HR. For the purpose of this book, we will focus on the business applications that are used by clients of the organizations and internal business operations.
In the modern world, there are a number of options from which to choose the right infrastructure for the organization. Applications can be run on on-premise virtual machines, Platform as a Service (PaaS) platforms or on the infrastructure provided by the cloud providers. It is common for large organizations to have a hybrid model where a few of the applications run on one type of infrastructure and some others on a different type of infrastructure.
To understand the different types of infrastructure, it is important to first understand the three main concepts. These are as follows:
Data centers
Data centers are physical locations/premises where the physical hardware/servers are located. When organizations decide to use their own physical servers to host applications, they set up their own data centers in their premises. This is what the word “on-premise” refers to. These organizations require additional resources to maintain the data center in terms of security, server maintenance, and so on. There are also other challenges like space constraints in case there is a need for more physical servers as the business grows.
To avoid the need to maintain their own data centers, organizations are opting to use the infrastructure from cloud providers like Amazon for their AWS services, Microsoft for their Azure services, or Google for their GCP services these days. In this case, the data centers reside on the cloud provider premises. The responsibility of maintenance and security of the servers resides with the cloud provider.
Virtualization
Virtualization refers to the technology that is used to create virtual machines on top of physical servers. The virtualizati...
Table of contents
Cover Page
Title Page
Copyright Page
Foreword
Dedication Page
About the Authors
About the Reviewer
Acknowledgement
Preface
Errata
Table of Contents
1. Understanding the World of IT
2. Introduction to DevOps
3. Introduction to SRE
4. Identify and Eliminate Toil
5. Release Management
6. Incident Management
7. IT Monitoring
8. Observability
9. Key SRE KPIs: SLAs, SLOs, SLIs, and Error Budgets