eBook - ePub

Practical Guide to IT Problem Management

Name: Practical Guide to IT Problem Management
ISBN: 9781000586626

Andrew Dixon,

88 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Practical Guide to IT Problem Management

Andrew Dixon,

About this book

Some IT organisations seem to expend all their energy firefighting – dealing with incidents as they arise and fixing, or patching over, the breakage. In organisations like this, restarting computers is seen as a standard method to resolve many issues. Perhaps the best way to identify whether an organisation understands problem management is to ask what they do after they have restarted the computer. If restarting the computer fixes the issue, it is very tempting to say that the incident is over and the job is done. Problem management recognises that things do not improve if such an approach is taken. Such organisations are essentially spending their time running to stay in the same place.

Written to help IT organisations move forward, Practical Guide to IT Problem Management presents a combination of methodologies including understanding timelines and failure modes, drill down, 5 whys and divide and conquer. The book also presents an exploration of complexity theory and how automation can assist in the desire to shift left both the complexity of the problem and who can resolve it. The book emphasises that establishing the root cause of a problem is not the end of the process as the resolution options need to be evaluated and then prioritised alongside other improvements. It also explores the role of problem boards and checklists as well as the relationship between problem management and Lean thinking. This practical guide provides both a framework for tackling problems and a toolbox from which to select the right methodology once the type of problem being faced has been identified. In addition to reactive methods, it presents proactive activities designed to reduce the incidence of problems or to reduce their impact and complexity should they arise.

Solving problems is often a combination of common sense and methodologies which may either be learnt the hard way or may be taught. This practical guide shows how to use problem solving tools and to understand how and when to apply them while upskilling IT staff and improving IT problem solving processes.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Auerbach Publications

Year

eBook ISBN

Topic

Subtopic

Chapter 1 Getting Your Priorities Right

DOI: 10.1201/9781003119975-2

This book is not designed to help you pass an exam in problem management, although it may help you set up a problem management process within your organisation (Chapter 10 looks at formal processes). Above and beyond that, this book looks at the bigger picture of how problem management adds value to an organisation and why it is important.

Take, for example, the Apollo 13 moon mission. This is an oft-quoted example because of the famous expression

Okay, Houston, we’ve had a problem here.¹

We could, at this point, concern ourselves with the nature of a problem. Rather, I want to initially focus on what was important at that moment as the three astronauts and Mission Control tried to understand what was happening. They needed to know the impact of what had happened. They did not need to know what had happened, or why it had happened. For some time, they did not need to know what they would do about it – that came later. The first question in problem management (and indeed in the related discipline of Major Incident Management) is:

What Is the Impact?

Let us consider the event from the view of the astronauts (Table 1.1):

Table 1.1 Status in the initial minutes²
Incident:	‘pretty large bang’
Readings:	Main B Bus undervolt Oxygen tank 2 was empty and tank 1’s pressure slowly falling The computer on the spacecraft had reset The high-gain antenna was not working
Observations:	‘a gas of some sort’ venting into space The volume surrounding the spacecraft was filled with myriad small bits of debris from the accident

In Chapter 5 we will explore how to ask the right questions at this point. The impact was assessed as:

Oxygen is required for power, heating and breathing. Oxygen has and is being lost. Without sufficient oxygen, the astronauts will die.

Normally, in such a situation, they would have used the Service Module’s main engine to return to Earth, but they determined that there was a significant risk that it had been damaged in the explosion.

Workarounds

In problem management a solution which addresses the immediate impact issues without addressing the underlying causes is known as a workaround. In the case of Apollo 13, they realised that they had a spare source of oxygen – the Lunar Module and a spare source of propulsion – the gravity of the moon. The astronauts lived in the Lunar Module for the next four days whilst the spacecraft travelled to the moon and back, using the Lunar Module’s propulsion system to guide the whole craft. This preserved the resources in the Command Module, so that it could be used for re-entry. The astronauts survived and were hailed as heroes, as were the staff of Mission Control who had assessed the impact and provided the workaround.

ITIL 4 defines an incident to be

An unplanned interruption to a service or reduction in the quality of service.³

The explosion and subsequent readings and observations amounted to a serious incident.

ITIL 4 defines a problem to be

A cause, or potential cause, of one or more incidents.⁴

The incident was over once the mission was over and the astronauts were safe. The tank which had exploded was somewhere in space – so it couldn’t be repaired.

The problem remained. Before another Apollo mission could take place, they needed to understand what had happened and how they could remove or reduce the risk of it happening again. This is called root cause analysis.

Note that although evidence was gathered, it was not important that this analysis was done until after the workaround had brought the astronauts home. In any problem, mitigating the impact is the first priority. Sometimes, this can only be done by identifying the root cause and addressing it. However, that is not always the case and is a call to make.

The review board determined that Oxygen Tank 2 was faulty before the mission and that activating a fan within the tank caused an electric arc which caused the fire and explosion.⁵ There were a number of contributing factors. The tank was later redesigned to remove the risk from all of the contributing factors. Performing the review was critical to the success of later Apollo missions – any one of which could have ended in disaster if the root cause analysis had not been done correctly.

The root cause analysis identified both a sequence of events which led to the accident and a design fault:

Tank 2 was originally in Apollo 10, but was removed to fix a fault. It was dropped when it was removed.
There were thermostats which were designed to operate at 28 volts, but were powered with 65 volts – they failed to operate correctly.
The temperature gauge was only rated up to 29° Celsius (84° Fahrenheit), so failed to detect the failed thermostats.
During testing, tank 2 needed to be emptied and the drain system didn’t work, so they boiled off the oxygen. Without the functioning thermostats, temperatures may have reached 540° Celsius (1004° Fahrenheit).
The high temperatures appear to have damaged the Teflon insulation.

Tests on similarly configured tanks produced telemetry readings which were in accord with the telemetry readings captured during Apollo 13’s flight, which gave the investigators confidence that this is what had happened.

Preventing Problems

Problem management does not occur in a vacuum. When I trained to do First Aid at Work, one of the things I was taught was that it was better to avoid an accident than to pick up the pieces afterwards. If I saw a trip hazard, I could remove it or wait until someone tripped and then administer first aid. If I saw a drawing pin on the floor, then I could pick it up and put it back on the noticeboard, or I could treat someone with a drawing pin in the foot.

The cost of the Apollo series of missions is estimated at $25.4 billion, so it can be argued that this mission cost in excess of $1 billion and failed to achieve its primary objective of reaching the moon. The mistakes which led up to this were, therefore, very expensive mistakes.

The thermostatic switches used in Oxygen Tank 2 should have been replaced when the operating specifications were changed.

When the tank was dropped, it should have been fully tested in an end to end lifecycle test.

Oxygen Tank 2 was filled during a countdown demonstration test. When it could not be emptied using the correct procedure, a workaround was applied of boiling off the oxygen (which would normally be stored in liquid form).

At each point, if a different decision had been taken then this disaster may not have happened and a $1 billion mission may not have failed.

Problem management exists in the context of providing an end to end service and needs to operate alongside enterprise architecture, continual improvement and risk management.

Workarounds should not be used to pass the problem further down the line. If the drain pipe did not work, this should have indicated that there was a more serious issue in existence. Just removing the oxygen ignored the issue.

A No Blame Culture

There is no suggestion in this case that people covered up a story, but it is good practice in problem management and in its sister major incident management to operate a no blame culture. People make mistakes. This is part of human nature. We all make mistakes. If someone is doing their job and makes a mistake, there should be no blame attributed to them. Clearly, if they wilfully avoid safety rules or if they persistently fail to follow process, then that is a different situation. However, it is not helpful to blame someone for a genuine mistake. The first reason why it is not helpful is that people will conceal information if they believe that they will get the blame – valuable time will be lost trying to gather data which people could provide but which will incriminate themselves. The second reason is that tomorrow we all have to work together. I once accidentally deleted an entire web site. I had double-checked that I was only deleting a backup copy of it, but still managed to delete the live site. If I had pretended that it wasn’t me, it would have taken ages to conduct fault diagnosis in order to understand what had happened. Because I immediately owned up to it, we recovered 90% of the site in under an hour and the complete site by the close of the day.

The Apollo 13 mission is a useful case study because it was a complex problem with a number of causes which could have been avoided. Once the incident had occurred, the immediate need was for a workaround, which was successfully applied. Afterwards a full analysis identified the root causes, which could then be addressed. Above all, it is an example of a team which worked well together under pressure and were clear as to what their priorities were.

Summ...

Cover Page
Half Title Page
Series Page
Title Page
Copyright Page
Contents
Biography
Introduction
Chapter 1 Getting Your Priorities Right
Chapter 2 Timelines
Chapter 3 Failure Modes
Chapter 4 Complexity Theory
Chapter 5 Automation and Artificial Intelligence
Chapter 6 Drill Down
Chapter 7 Divide and Conquer
Chapter 8 Cause and Effect
Chapter 9 Resolution Evaluation Methods
Chapter 10 ITIL Problem Management
Chapter 11 Problem Boards and Problem Records
Chapter 12 The Drive for Efficiency
Chapter 13 Applying the Principles to the World Outside of IT
Chapter 14 Using Checklists
Conclusion
Appendix A Glossary
Appendix B Sample Checklists
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Practical Guide to IT Problem Management by Andrew Dixon in PDF and/or ePUB format, as well as other popular books in Informatica & Gestione. We have over one million books available in our catalogue for you to explore.