The advances and interdisciplinary integration of science and technology are making modern engineering and computing systems more and more complex. For modern systems (especially those in, e.g. wireless sensor networks, Internet of Things (IoT), smart power systems, space explorations, and cloud computing industries), dynamic behavior and dependence are typical characteristics of the systems or products. System load, operating conditions, stress levels, redundancy levels, and other operating environment parameters are variables of time, causing dynamic failure behavior of the system components as well as dynamic system reliability requirements. In addition, components of these systems often have significant interactions or dependencies in time or functions. Effects of these dynamic and dependent behaviors must be addressed for accurate system reliability modeling and analysis, which is crucial for verifying whether a system satisfies desired reliability requirements and for determining optimal design and operation policies balancing different system parameters like cost and reliability. As a result, reliability modeling and analysis of modern dynamic systems become more challenging than ever.
Traditional reliability modeling methods, such as reliability block diagram [1] and fault tree analysis [2], can define the static logical structure of the system, but they lack the ability to describe dynamic state transfers of the system, and component fault dependencies and propagations. It is difficult or impossible to accurately reflect the actual behavior of modern complex faultâtolerant systems using the traditional reliability models. In other words, failure to address effects of dynamic behavior and dependencies of modem systems makes the reliability analysis results obtained using the traditional reliability models far from the actual system reliability performance, misleading the system design, operation, and maintenance efforts.
Different from the traditional static reliability modeling, the dynamic reliability theory considers that a system failure depends not only on the static logical combination of basic component failure events, but also on the timing of the occurrence of the events, correlations or interrelationship of the events, and impacts of operating environments. Therefore, the dynamic system reliability theory can provide a more accurate representation of actual complex system behavior, more effectively guiding the reliable design of realâworld critical systems. The dynamic system reliability theory is the evolution and improvement of the traditional reliability modeling theory, and its research will promote the development and application of complex systems engineering.
This book focuses on dynamic reliability modeling of faultâtolerant systems with imperfect fault coverage, functional dependence, deterministic or probabilistic commonâcause failures, deterministic or probabilistic competing failures, as well as standby sparing.
Specifically, imperfect fault coverage is an inherent behavior of faultâtolerant systems designed with redundancies and automatic system recovery or reconfiguration mechanisms [3â5]. Just like any system component, the system recovery mechanisms involving fault detection, fault location, fault isolation, and fault recovery will likely not be perfect; they can fail such that the system cannot adequately detect, locate, isolate, or recover from a fault occurring in the system. The uncovered component fault may propagate through the system, causing an extensive damage to the system, sometimes failure of the entire system. Further, it is observed that the extent of the damage from an uncovered component fault occurring in a system with the hierarchical nature may exhibit multiple levels due to the layered recovery [6]. The traditional imperfect fault coverage concept has been extended to the modular imperfect fault coverage to model multiple levels of uncovered failure modes for components in hierarchical systems [7].
Functional dependence occurs in systems where the failure of one component (or, in general, the occurrence of a certain trigger event) causes other components (referred to as dependent components) within the same system to become unusable or inaccessible. A classic example is a computer network where computers can access the Internet through routers [8]. If the router fails, all computers connected to the router become inaccessible. It is said that these computers have functional dependence on the router.
In the case of systems with perfect fault coverage, the functional dependence behavior can be addressed as logic OR relationship [9]. However, for systems with imperfect fault coverage, the logic OR replacement method can lead to overestimation of system unreliability because it allows the disconnected dependent components (in the case of the trigger event occurring) to contribute to the system uncovered failure probability if they can fail uncovered. However, since these dependent components were disconnected or isolated, they could really not generate propagation effect or bring the system down [10]. New algorithms are required for addressing the coupled functional dependence and imperfect fault coverage behavior.
In addition to the imperfect fault coverage, commonâcause failures are another class of behavior that can contribute significantly to the overall system unreliability [11â13]. Commonâcause failures are defined as âA subset of dependent events in which two or more component fault states exist at the same time, or in a short time interval, and are direct results of a shared causeâ [11] . Most of the traditional commonâcause failure models assumed the deterministic failure of the multiple components affected by the shared root cause. Recent studies extended the concept to model probabilistic commonâcause failures, where the occurrence of a root cause results in failures of multiple system components with different probabilities [14â16].
As one type of commonâcause failures, a propagated failure with global effect (PFGE) originating from a system component can cause the failure of the entire system [17]. Such a failure can occur due to the imperfect fault coverage or destructive effect of a component failure on other system components (like overheating, explosion, etc.). However, PFGE may not always cause the overall system failure in systems with functional dependence behavior. Specifically, if the trigger event occurs before PFGEs of all the dependent components, these PFGEs can be isolated deterministically and thus cannot affect other parts of the system. On the other hand, if PFGE of any dependent component occurs before the trigger event, the failure propagation effect takes place, crashing the entire system. Therefore, there exist competitions in the time domain between the failure isolation and failure propagation effects, causing distinct system statuses [18,19].
The pioneering works on addressing such competing failures in system with functional dependence have focused on deterministic competing failures, where the occurrence of the trigger event, as long as it happens first, can cause deterministic or certain isolation effect to any failures originating from the corresponding dependent components. Recent studies [20,21] have revealed that in some realâworld systems, e.g. systems involving relayed wireless communications, the failure isolation effect can be probabilistic or uncertain. Consider a specific example of a relayâassisted wireless sensor network where some sensors pr...