Part One
The Socio-technical Context of System Health Management
Charles D. Mott
Complete Data Management, USA
Part One provides an overview of system health management (SHM), its basic theory and concepts, and its relationship to individual and social factors that both enable and constrain its development, usage, and effectiveness.
The goal of SHM is to improve system dependability, which is the characteristic of a system that causes it to operate as intended for a defined period of time. As such, SHM is a branch of engineering, which is the process used to create useful objects or processes within a set of given requirements and constraints. Engineers design, analyze, build, and operate systems using science and mathematics to reach an acceptable (preferably, the optimal) solution. To build any but the most simple of objects, engineers work in one or many groups, in which the engineers must communicate and cooperate with each other and with non-engineers to create the system. The system in turn is often operated by non-engineers, whose needs must be taken into account by the engineers to design a system that serves the requirements of its users. The skills and knowledge of the people, the structure of the organization, and the larger society that they operate in all have considerable effects on the system's final form. This part discusses and highlights how these non-technical processes affect systems dependability.
This part starts with the assumptions, concepts, and terminology of SHM theory. This theory makes clear how communication and knowledge sharing are embedded in technology, identifying the primary source of faults as cognitive and communication failures. It also shows that SHM extends systems theory and control theory into the realm of faults and failures.
The importance of communication and its role in introducing faults into systems is the subject of Chapter 2. Without communication between users, designers, builders, and operators the system cannot be built. Communication is essential to elucidating system requirements and constraints.
Chapter 3 describes high-reliability organizations. Organizations provide resources, training, and education, and an environment in which systems are created. Organizations can enhance or hinder the communication process.
Within SHM design and dependable system operation, organizations and individuals communicate and develop knowledge, thus making knowledge management a key aspect of dependable system design. Chapter 4 describes the relationship between knowledge management and SHM, most significantly how knowledge management systems are essentially communication management systems.
Chapter 5 concludes this part by reviewing the business and economic realities that enable or hinder SHM design. Without an understanding of the costs and benefits of health management systems, they may not be fully utilized and the dependability of the system impacted.
Chapter 1
The Theory of System Health Management
Stephen B. Johnson
NASA Marshall Space Flight Center and University of Colorado at Colorado Springs, USA
Overview
This chapter provides an overview of system health management (SHM), and a theoretical framework for SHM that is used throughout the book. SHM includes design and manufacturing techniques as well as operational and managerial methods, and it also involves organizational, communicative, and cognitive features of humans as social beings and as individuals. The chapter will discuss why all of these elements, from the technical to the cognitive and social, are necessary to build dependable humanâmachine systems. The chapter defines key terms and concepts for SHM, outlines a functional framework and architecture for SHM operations, describes the processes needed to implement SHM in the system lifecycle, and provides a theoretical framework to understand the relationship between the different aspects of the discipline. It then derives from these and the social and cognitive bases some design and operational principles for SHM.
1.1 Introduction
System health management (SHM) is defined as the capabilities of a system that preserve the system's ability to function as intended.1 An equivalent, but much wordier, description is âthe capability of the system to contain, prevent, detect, diagnose, respond to, and recover from conditions that may interfere with nominal system operations.â SHM includes the actions to design, analyze, verify, validate, and operate these system capabilities. It brings together a number of previously separate activities and techniques, all of which separately addressed specific, narrower problems associated with assuring successful system operation. These historically have included analytical methods, technologies, design and manufacturing processes, verification and validation issues, and operational methods. However, SHM is not a purely technical endeavor, because failures largely originate in the organizational, communicative, and cognitive features of humans as social beings and as individuals.
SHM is intimately linked to the concept of dependability, which refers to the ability of a system to function as intended, and thus SHM refers to the capabilities that provide dependability.2 Dependability subsumes or overlaps with other âilitiesâ such as reliability, maintainability, safety, integrity, and other related terms. Dependability includes quantitative and qualitative features, design as well as operations, prevention as well as mitigation of failures. Psychologically, human trust in a system requires a system to consistently perform according to human intentions. Only then is it perceived as âdependable.â The engineering discipline that provides dependability we shall call âdependability engineering.â When applied to an application, dependability engineering then creates SHM system capabilities. This text could easily have been called Dependability Engineering: With Aerospace Applications. The relationship of dependability engineering to SHM is much like that of aerospace engineering to its application domain, in that there is no âaerospace subsystem,â but rather a set of system capabilities designed by aerospace engineers, such as aerodynamic capabilities of lift and drag, mission plans and profiles, and then the coordination of many other subsystems to control the aircraft's dynamics, temperatures, electrical power, avionics, etc. SHM is the name of all the âdependability capabilitiesâ which are embedded in a host of other subsystems.
Within the National Aeronautics and Space Administration (NASA), a recent alternative term to SHM is âfault managementâ (FM), which is defined as âthe operational capability of a system to contain, prevent, detect, diagnose, respond to, and recover from conditions that may interfere with nominal mission operations.â FM addresses what to do when a system becomes âunhealthy.â To use a medical analogy, FM is equivalent to a patient going to the doctor once the patient is sick, whereas SHM also includes methods to prevent sickness, such as exercise and improved diet, which boost the immune system (improve design margins against failure). For the purposes of this book, FM will be considered the operational aspect of SHM. SHM includes non-operational mechanisms to preserve intended function, such as design margins and quality assurance, as well as operational mechanisms such as fault tolerance and prognostics.
Major events in the evolution of SHM are given in Table 1.1.
Table 1.1 Major events in the development of SHM
| 1950s | - Quality control
- Reliability analysis, failure modes and effects analysis (FMEA)
- Environmental testing
- Systems engineering
|
| 1960s | - Fault tree analysis, hazards analysis
- Integrated system test and âsearch for weaknessesâ test
- Hardware redundancy
|
| 1970s | - Reliability-centered maintenance
- Software FMEA and software reliability analysis
- Redundancy management, on-board fault protection
- Early built-in test (primarily push-to-test or go/no-go testing)
|
| 1980s | - Byzantine fault theory (1982)
- Software fault tree analysis, directed graphs
- DoD integrated diagnostics
- Boeing 757/767 maintenance control and display panel (mid-1980s)
- NASA and DoD subsystem and vehicle health monitoring (late 1980s)
- Aerospace Corporation Dependability Working Group (late 1980s)
- ARINC-604 Guidance for Design and Use of Built-In Test Equipment (1988)
- Principles of Diagnostics Workshop (1988)
- Total quality management
- Boeing 747-400 central maintenance computer
|
| 1990s | - Condition-based maintenance
- System Health Management Design Methodology (1992)
- Dependability: Basic Concepts and Terminology (1992)
- ARINC-624 Design Guidance for Onboard Maintenance System (1993)
- Boeing 777 onboard maintenance system (1995)
- (Integrated) system health management
- Directed graphs applied to International Space Station
- Operational SHM control loop concept (1995)
- SHM diagnostics technologies, sensor technologies, prognostics
- Bi-directional safety analysis, probabilistic risk assessment
|
| 2000s | - Columbia Accident Investigation Board Report (2003)
- Air Force Research Laboratory ISHM Conference established (2004)
- Integrated System Health Engineering and Management Forum (2005)
- American Institute of Aeronautics and Astronautics Infotech Conference (2005)
- NASA Constellation SHMâFault Management (FM) Methodology
- NASA Science Mission Directorate FM Workshop (2008)
- Prognostics and Health Management Conference (2008)
- Control System and Function Preservation Framework (2008)
- International Journal of Prognostics and Health Management (2009)
- Prognostics and Health Management Society established (2009)
- Constellation FM team established (2009)
- NASA Fault Management Handbook writing begins (2010)
- NASA FM Community of Practice (2010)
- SHM: With Aerospace Applications published (2011)
|
The recognition that the many different techniques and technologies shown in Table 1.1 are intimately related and should be integrated has been growing over time. Statistical and quality control methods evolved in World War II to handle the logistics of the massive deployment of technological systems. The extreme environmental and operational conditions of aviation and space drove the creation of systems engineering, reliability, failure modes analysis, and testing methods in the 1950s and 1960s. As aerospace system complexity increased, the opportunity for failures to occur through a variety of causal factors also increased: inadequate design, manufacturing faults, operational mistakes, and unplanned events. This led in the 1970s to the creation of new methods to monitor and respond to system failures, such as the on-board mechanisms for deep-space fault protection on the Voyager project and the Space Shuttle's redundancy management capabilities. By the 1970s and 1980s these technologies and growing system complexity led to the development of formal theory for fault-tolerant computing (Byzantine fault theory), software failure modes and fault tree analyses, diagnostic methods, including directed graphs, and eventually to methods to predict future failures (prognostics). Total quality management, which was in vogue in the late 1980s and early 1990s, was a process-based approach to improve reliability, while software engineers created more sophisticated techniques to detect and test for software design flaws. By the early 2000s, and in particular in response to the Columbia accident of 2003, NASA and the DoD recognized that failures often resulted from a variety of cultural problems within the organizations responsible for operating complex systems, and hence that failure was not a purely technical problem.
The term âsystem health managementâ evolved from the phrase âvehicle health monitoring (VHM),â which within the NASA research community in the early 1990s referred to proper selection and use of sensors and software to monitor the health of space vehicles. Engineers soon found the VHM concept deficient in two ways. First, merely monitoring was insufficient, as the point of monitoring was to take action. The word âmanagementâ soon substituted for âmonitoringâ to refer to this more active practice. Second, given that vehicles are merely one aspect of the complex humanâmachine systems, the term âsystemâ soon replaced âvehicle,â such that by the mid-1990s, âsystem health managementâ became the most common phrase used to deal with the subject. By the mid-1990s, SHM became âintegrated SHMâ (ISHM) within some parts of NASA, which highlighted the relatively unexplored system implementation issues, instead of classical subsystem concerns.
In the 1980s, the DoD had created a set of processes dealing with operational maintenance issues under the title âIntegrated Diagnostics.â The DoD's term referred to the operational issues in trying to detect failures, determine the location of the underlying faults, and repairing or replacing the failed components. Given that failure symptoms frequently manifested themselves in components that were not the source of the original fault, it required âintegratedâ diagnostics looking at symptoms across the entire vehicle to determine the failure source. By the mid-1990s the DoD was promoting a more general concept of condition-based maintenance (as opposed to schedule-based maintenance), leading to th...