This book is not designed to help you pass an exam in problem management, although it may help you set up a problem management process within your organisation (Chapter 10 looks at formal processes). Above and beyond that, this book looks at the bigger picture of how problem management adds value to an organisation and why it is important.
Take, for example, the Apollo 13 moon mission. This is an oft-quoted example because of the famous expression
Okay, Houston, we’ve had a problem here.1
We could, at this point, concern ourselves with the nature of a problem. Rather, I want to initially focus on what was important at that moment as the three astronauts and Mission Control tried to understand what was happening. They needed to know the impact of what had happened. They did not need to know what had happened, or why it had happened. For some time, they did not need to know what they would do about it – that came later. The first question in problem management (and indeed in the related discipline of Major Incident Management) is:
What Is the Impact?
Let us consider the event from the view of the astronauts (Table 1.1):
Table 1.1 Status in the initial minutes2| Incident: | ‘pretty large bang’ |
| Readings: | Main B Bus undervolt Oxygen tank 2 was empty and tank 1’s pressure slowly falling The computer on the spacecraft had reset The high-gain antenna was not working |
| Observations: | ‘a gas of some sort’ venting into space The volume surrounding the spacecraft was filled with myriad small bits of debris from the accident |
In Chapter 5 we will explore how to ask the right questions at this point. The impact was assessed as:
Oxygen is required for power, heating and breathing. Oxygen has and is being lost. Without sufficient oxygen, the astronauts will die.
Normally, in such a situation, they would have used the Service Module’s main engine to return to Earth, but they determined that there was a significant risk that it had been damaged in the explosion.
Workarounds
In problem management a solution which addresses the immediate impact issues without addressing the underlying causes is known as a workaround. In the case of Apollo 13, they realised that they had a spare source of oxygen – the Lunar Module and a spare source of propulsion – the gravity of the moon. The astronauts lived in the Lunar Module for the next four days whilst the spacecraft travelled to the moon and back, using the Lunar Module’s propulsion system to guide the whole craft. This preserved the resources in the Command Module, so that it could be used for re-entry. The astronauts survived and were hailed as heroes, as were the staff of Mission Control who had assessed the impact and provided the workaround.
ITIL 4 defines an incident to be
An unplanned interruption to a service or reduction in the quality of service.3
The explosion and subsequent readings and observations amounted to a serious incident.
ITIL 4 defines a problem to be
A cause, or potential cause, of one or more incidents.4
The incident was over once the mission was over and the astronauts were safe. The tank which had exploded was somewhere in space – so it couldn’t be repaired.
The problem remained. Before another Apollo mission could take place, they needed to understand what had happened and how they could remove or reduce the risk of it happening again. This is called root cause analysis.
Note that although evidence was gathered, it was not important that this analysis was done until after the workaround had brought the astronauts home. In any problem, mitigating the impact is the first priority. Sometimes, this can only be done by identifying the root cause and addressing it. However, that is not always the case and is a call to make.
The review board determined that Oxygen Tank 2 was faulty before the mission and that activating a fan within the tank caused an electric arc which caused the fire and explosion.5 There were a number of contributing factors. The tank was later redesigned to remove the risk from all of the contributing factors. Performing the review was critical to the success of later Apollo missions – any one of which could have ended in disaster if the root cause analysis had not been done correctly.
The root cause analysis identified both a sequence of events which led to the accident and a design fault:
- Tank 2 was originally in Apollo 10, but was removed to fix a fault. It was dropped when it was removed.
- There were thermostats which were designed to operate at 28 volts, but were powered with 65 volts – they failed to operate correctly.
- The temperature gauge was only rated up to 29° Celsius (84° Fahrenheit), so failed to detect the failed thermostats.
- During testing, tank 2 needed to be emptied and the drain system didn’t work, so they boiled off the oxygen. Without the functioning thermostats, temperatures may have reached 540° Celsius (1004° Fahrenheit).
- The high temperatures appear to have damaged the Teflon insulation.
Tests on similarly configured tanks produced telemetry readings which were in accord with the telemetry readings captured during Apollo 13’s flight, which gave the investigators confidence that this is what had happened.
Preventing Problems
Problem management does not occur in a vacuum. When I trained to do First Aid at Work, one of the things I was taught was that it was better to avoid an accident than to pick up the pieces afterwards. If I saw a trip hazard, I could remove it or wait until someone tripped and then administer first aid. If I saw a drawing pin on the floor, then I could pick it up and put it back on the noticeboard, or I could treat someone with a drawing pin in the foot.
The cost of the Apollo series of missions is estimated at $25.4 billion, so it can be argued that this mission cost in excess of $1 billion and failed to achieve its primary objective of reaching the moon. The mistakes which led up to this were, therefore, very expensive mistakes.
The thermostatic switches used in Oxygen Tank 2 should have been replaced when the operating specifications were changed.
When the tank was dropped, it should have been fully tested in an end to end lifecycle test.
Oxygen Tank 2 was filled during a countdown demonstration test. When it could not be emptied using the correct procedure, a workaround was applied of boiling off the oxygen (which would normally be stored in liquid form).
At each point, if a different decision had been taken then this disaster may not have happened and a $1 billion mission may not have failed.
Problem management exists in the context of providing an end to end service and needs to operate alongside enterprise architecture, continual improvement and risk management.
Workarounds should not be used to pass the problem further down the line. If the drain pipe did not work, this should have indicated that there was a more serious issue in existence. Just removing the oxygen ignored the issue.
A No Blame Culture
There is no suggestion in this case that people covered up a story, but it is good practice in problem management and in its sister major incident management to operate a no blame culture. People make mistakes. This is part of human nature. We all make mistakes. If someone is doing their job and makes a mistake, there should be no blame attributed to them. Clearly, if they wilfully avoid safety rules or if they persistently fail to follow process, then that is a different situation. However, it is not helpful to blame someone for a genuine mistake. The first reason why it is not helpful is that people will conceal information if they believe that they will get the blame – valuable time will be lost trying to gather data which people could provide but which will incriminate themselves. The second reason is that tomorrow we all have to work together. I once accidentally deleted an entire web site. I had double-checked that I was only deleting a backup copy of it, but still managed to delete the live site. If I had pretended that it wasn’t me, it would have taken ages to conduct fault diagnosis in order to understand what had happened. Because I immediately owned up to it, we recovered 90% of the site in under an hour and the complete site by the close of the day.
The Apollo 13 mission is a useful case study because it was a complex problem with a number of causes which could have been avoided. Once the incident had occurred, the immediate need was for a workaround, which was successfully applied. Afterwards a full analysis identified the root causes, which could then be addressed. Above all, it is an example of a team which worked well together under pressure and were clear as to what their priorities were.