A systematic classification of generic methods for reducing technical risk is crucial to risk management, safe operation, engineering designs, and software. However, this very important topic has not been covered with sufficient depth in the reliability and risk literature. For many decades, the focus of the reliability research has been primarily on identifying risks, risk assessment, and reliability prediction rather than methods for reliability improvement and risk reduction. The as low as reasonably practicable (ALARP) approach to risk management (Cullen 1990; HSE 1992; Melchers 2001), for example, advocates that risks should be reduced ALARP. This is commonly interpreted in the sense that risks have to be reduced to a level at which the cost associated with further risk reduction outweighs the benefits arising from further reduction (HSE 1992; Melchers 2001). While a decision about implementation of riskāreducing measures can be taken by implementing costābenefit analysis, the focus of the ALARP approach is whether riskāreducing measures should be implemented or not. There is little clarity on the riskāreducing methods that can be used to achieve the risk reduction.
Reliability improvement and risk reduction also relied for a long time on the feedback provided from reliability testing or on feedback from customers. Once the feedback about a particular failure mode is available, the component is redesigned to strengthen it against that failure mode. The problem with this approach is that the feedback always comes late, after the product has been manufactured. Therefore, all changes consisting of redesign to avoid the discovered failure modes will be costly or impossible. In addition, conducting a reliability testing programme to precipitate failure modes is expensive and adds significant extra cost to the product.
General guidelines on risk management do exist. Risk management, according to a recent review (Aven 2016) can be summarised to (i) establish the purpose of the risk management activity, (ii) identify adverse events, (iii) conduct cause and consequence analysis, (iv) make judgement about the likelihood of the adverse events and their impact and establish risk description and characterisation, and (v) risk treatment.
While a great deal of agreement exists about the necessary common steps of risk assessment, there is profound lack of understanding and insight about the general methods for reducing risk that can be used. The common approach to risk reduction is the domaināspecific approach which relies heavily on root cause analysis and detailed knowledge from the specific domain. Measures specific to a particular domain are selected for reducing the likelihood of failure or the consequences from failure and the risk reduction is conducted exclusively by experts in the specific domain. The risk reduction is effectively fragmented into risk reduction in numerous specific domains: nuclear industry, aviation, construction industry, food storage and food processing, banking, oil and gas industry, road transportation, railway transportation, marine transportation, financial industry, cyber security, environmental sciences, etc.
As a result, the domaināspecific approach to risk reduction created an illusion: that efficient risk reduction can be delivered successfully solely by using methods offered by the specific domain without resorting to general methods for risk reduction.
The direct consequence of this illusion is that many industries have been deprived from effective riskāreducing strategy and reliability improvement solutions. The same mistakes are made again and again, resulting in numerous accidents and inferior products and processes, associated with high risk of failure. Examples of such repeating mistakes are:
- insufficient reliability built in products with very high cost of failure;
- designing components with homogeneous properties where the stresses are clearly not uniform;
- creating systems with vulnerabilities where a single failure causes the collapse of the system;
- redundancy compromised by a common cause.
At the same time, excellent opportunities to improve reliability and reduce risk are constantly missed. Examples of such missed opportunities are:
- failure to increase reliability of systems and components at no extra cost (e.g. by a simple permutation of the same type of components in the system);
- failure to increase the reliability of components and systems by a separation of properties and functions;
- failure to reduce by orders of magnitude the probability of erroneous conclusion from imperfect tests;
- failure to increase by orders of magnitude the fault tolerance of components;
- failure to reduce risk by including deliberate weaknesses.
The weaknesses of the risk management in many specific domains were exposed by a string of costly failures and disasters (e.g. catastrophic oil spills, financial crises, serious industrial accidents, transport accidents, power blackouts, etc.).
In some cases, correct solutions were indeed found by āreinventing the wheelā, after a series of costly and timeāconsuming trials and errors.
An important contributing reason for this highly undesirable situation is the absence of a framework of domaināindependent methods for reliability improvement and risk reduction that could provide vital methodological knowledge to many unrelated domains.
With the exception of a few simple and wellāknown domain independent methods such as implementing redundancy, strengthening weak links, upgrading with more reliable components, simplification of components, systems and operations, and condition monitoring, the framework of domaināindependent methods for reliability improvement and risk reduction is missing.
Thompson (1999) stressed the importance of effective integration of maintainability and reliability considerations in the design process and the importance of failure mode analysis in design. Thompson (1999) correctly identified that knowledge of the principles of risk are important aids to achieving good reliability, however, no domaināindependent principles for improving reliability and reducing risk have been formulated.
Samuel and Weir (1999) covered problem solving strategies in engineering design and stressed the importance of satisfying design inequalities in defining the domain of acceptable designs. However, no domain-independent methods for improving reliability have been discussed.
French (1999) formulated a number of general principles to be followed in conceptual design, but they were not oriented towards improving reliability and reducing technical risk. General principles to be followed in engineering design have also been discussed in Pahl et al. (2007). Most of the discussed principles, however, are either not related to reducing the risk of failure or are too specific (e.g. the principle of thermal design), with no general validity. Collins (2003) discussed engineering design with failure prevention perspective. However, no riskāreducing methods and principles with general validity were formulated.
Taguchi's experimental method for robust design through testing (Phadke 1989) achieves designs where the performance characteristics are insensitive to variations of control (design) variables. This method can be considered to be a step towards formulating the domaināindependent risk reduction principle of robust design for which the performance characteristics are insensitive to variations of design parameters.
1.2 The Statistical, DataāDriven Approach
A common approach to reliability improvement is to select a statisticalābased, dataādriven approach. This approach relies on critical pieces of data: failure frequencies, load distribution, strength distribution, etc., in order to make predictions about the reliability of components and systems.
To describe the reliability on demand, which is essentially the probability that strength will exceed load, data covering the variation range of the load and the variation range of the strength are needed. These data are necessary to fit an appropriate model for the strength, an appropriate model for the load, and to estimate the parameters of the models fitting the load distribution and strength distribution. Next, a direct integration of the loadāstrength interference integral or a Monte Carlo simulation can be used to estimate the probability that, on demand, strength will be greater than the load (Todinov 2016a).
To calculate the time to failure of a system, the timeātoāfailure models of the components are needed. For each component, from the past times to failure, an appropriate timeātoāfailure model must be fitted and subsequently used to evaluate the reliability of the system built with the components (Todinov 2016a). However, the timeātoāfailure models of the components depend strongly on the environmental stresses. For example, increasing temperature accelerates material degradation and shortens the time to failure. Because of this, the time to failure of a seal working at elevated temperatures is significantly shorter than the time to failure of a seal working at room temperature. Reducing temperature also gives rise to dangerous failure modes (e.g. brittle fracture) which reduce the time to failure. The time to failure in the presence of a corrosive environment, high humidity and vibrations is shorter than the time...