CHAPTER 1
Introduction to Fault Tolerance
Like any subject of study, there is a specialized language associated with fault tolerance. This chapter introduces these terms.
The focus of this book is on âFault Toleranceâ in general and in particular on things that can be done during the design of software to support fault tolerant operation. A system of software or hardware and software that is fault tolerant is able to operate even though some part is no longer performing correctly. Thus the focus of this book is on the software structures and mechanisms that can be designed into a system to enable its continued operation, even though a different part isnât working correctly. This book describes practices to improve the reliability and availability of software systems. These practices are currently in use in a variety of software application domains.
The next few sections define the vocabulary needed to discuss fault tolerance.
Fault -> Error -> Failure
The terms fault, error and failure have very specific meanings.
A system failure occurs when the delivered service no longer complies with the specification, the latter being an agreed description of the systemâs expected function and/or service. An error is that part of the system state that is liable to lead to subsequent failure; an error affecting the service is an indication that a failure occurs or has occurred. The adjudged or hypothesized cause of an error is a fault. [Lap91, p. 4]
Every fault tolerant system composed of software and hardware must have a specification that describes what it means for that system to operate without failure. The systemâs specification defines its expected behavior, such as available 99.999% of the time. When the system doesnât behave in the manner specified in its requirements, it has failed. The term failure refers to system behavior that does not conform to the systems specification.
These are examples of failures: The system crashes to a stop when it shouldnât, the system computes an incorrect result, the system is not available for service, the system is unable to respond to user interaction. Whenever the system does the wrong thing it has failed.
Failures are detected by the observer and users of the system.
Failures are dependant upon the requirements and the definition of agreed-upon correct operation of the system. If there is not a specification of what the system should do, there cannot be a failure.
Failures are caused by errors.
An error is the incorrect system behavior from which a failure may occur. Errors can be categorized into two types, timing or value. Errors that manifest as value errors might be incorrect discrete values or incorrect system state. Timing errors can include total non-performance (the time was infinite).
Some common examples of errors include:
Timing or Race conditions: communicating processes get out of synchronization and a race for resources occurs.
Infinite Loops: continuous execution of a tight loop without pausing and without acknowledging the requests of others for shared resources.
Protocol Error: errors in the messaging stream because of non-conformance with the protocol in use. Unexpected messages sent to other parts of the system, messages sent at inappropriate times, or out of sequence.
Data inconsistency: Data may be different between two locations, for example memory and disk, or between different elements in a network.
Failure to Handle Overload conditions: the system is unable to handle the workload.
Wild Transfer or Wild Write: Data written to an incorrect location of memory or a transfer to an incorrect location occurs if there is a fault in the system.
Any of these example errors could be failures if they deviate from the systemâs specification.
Errors are important when talking about fault tolerant systems because errors can be detected before they become failures. Errors are the manifestation of faults, and errors are the way that we can look into the system to discover if faults are present.
A fault is the defect that is present in the system that can cause an error. It is the actual deviation from correctness. In a computer program it is the misplaced comma or period, or the missing break statement in a C++ switch statement. Colloquially the fault is often called a âbugâ, but that word will not appear elsewhere in this book.
The fault might be a latent software defect, or it might be a garbled message received on a communications channel, or a variety of other things. In general, neither the software nor the observers are aware of the presence of a fault until an error occurs.
A number of causes lead to the introduction of a fault into software. These include:
Incorrect Requirement Specification: Sometimes the software designers and coders were told to build the wrong thing.
Incorrect Designs: Translating system requirements into a working software design is a complicated process that sometimes results in incorrect designs. The design might not be workable from a pure software standpoint, or it might not be an accurate translation of the requirements. In either case it is faulty.
Coding Errors: Translating the design into working code can also introduce faults into the system. The compiler/interpreter/code examination tool can catch some faults or a fault can produce syntactically correct code that just does not perform the specified task.
Faults are present in every system. When a fault is lying dormant and not causing any mischief it is said to be latent. When the circumstances arise that the latent fault causes something incorrect to happen it is said to become active. A faultâs activation results in an error.
Examples of Fault -> Error -> Failure
To help make these very important definitions clear, here are a few examples.
A misrouted telephone call is an example of a failure. Telephone system requirements specify that calls should be delivered to the correct recipient. When a faulty system prevents them from being delivered correctly, the system has failed. In this case the fault might have been an incorrect call routing data being stored in the system. The error occurs when the incorrect data is accessed and an incorrect network path is computed with that incorrect data.
A robotic arm used to drill a part in a manufacturing environment provides another example. Consider the fault of a misplaced decimal point in a data constant that is used in the computation of the rotation of the robotâs arm. The data constant might be the number of steps required to rotate the robotic arm one degree. The error might be that it rotates in the wrong direction because of the erroneous computation made with the faulty decimal point. The arm fails by lowering its drill at the wrong location
The preparation of an incorrect bill for service is another example of a failure. The system requirements specify that the customer will be accurately charged for service received. A faulty identification received in a message by a billing system can result in the charges being erroneously applied to the wrong account. The fault in this case might have been in the communications channel (a garbled message), or in the system component that prepares the message for transmission. The error was applying the charges to the wrong account. The fact that the customer receives an incorrect charge is the failure, since they agreed with the carrier to pay for the service that they used and not for unused service.
Consider a spacecraft that is given an updated set of program instructions by the Earth station controlling it. An error occurs because someone designing the update incorrectly computed the memory range to be updated. The new program was updated to this incorrect range, which corrupted another part of the programming. The corrupted instructions caused the spacecraftâs antenna to point away from Earth, breaking off communications between Earth and the spacecraft, which led to the mission being considered a failure. The initial fault was the computation of the incorrect memory range.
Banking systems fail when they do not safeguard funds. An example of failure is when a bankâs automatic teller machine (ATM) dispenses too much cash to a customer. Several errors might lead to this failure. One error is that the machine counted out more bills than it should have. In this case the fault might...