1.1 Introduction
Ripple-Down Rules (RDR) are intended for problems where there is insufficient data for machine learning and suitable data is too costly to obtain. On the other hand, RDR avoids the major problems with building systems by acquiring knowledge from domain experts. There are various types of RDR, and this book presents three of them. Although RDR is a knowledge acquisition method acquiring knowledge from people, it is perhaps more like machine learning than conventional knowledge engineering.
In the 1970s and early 80s there were huge expectations about what could be achieved with expert or knowledge-based systems based on acquiring knowledge from domain experts. Despite the considerable achievements with expert systems, they turned out to be much more difficult to build than expected, resulting in disillusionment and a major downturn in the new AI industry. We are now in a new phase of huge expectations about machine learning, particularly deep learning. A 2018 Deloitte survey found that 63% of the companies surveyed were using machine learning in their businesses with 50% using deep learning (Loucks, Davenport, and Schatsky 2018). The same survey in 2019 shows a small increase in the use of machine learning over 2018, but also that 97% of respondents plan to use machine learning and 95% deep learning, respectively, in the next year (Ammanath, Hupfer, and Jarvis 2019). It appears that machine learning is a magic new technology whereas in fact the history of machine learning goes back to the early 1950s when the first neural network machine was developed based on ideas developed in the 1940s. The first convolutional neural networks, a major form of deep learning, were developed in the late 1970s. Although this is the first book on Ripple-Down Rules, RDR also has some history. An RDR approach to address the maintenance challenges with GARVAN-ES1, a medical expert system, was first proposed in 1988 (Compton and Jansen 1988) only three years after GARVAN-ES1 was first reported (Horn et al. 1985) and two years after GARVAN-ES1 was reported as one of the first four medical expert systems to go into clinical use (Buchanan 1986). The reason for an RDR book now is to present a fall-back technology as industry becomes increasingly aware of the challenges in providing data good enough for machine learning to produce the systems they want. We will first look briefly at the limitations and problems with machine learning and knowledge acquisition.
1.2 Machine Learning
Despite the extraordinary results that machine learning has produced, a key issue is whether there is sufficient reliably labelled data to learn the concepts required. Despite this being the era of big data, providing adequate appropriate data is not straightforward. If we take medicine as an example: A 2019 investigation into machine learning methods for medical diagnosis identified 17 benchmark datasets (Jha et al. 2019). Each of these has at most a few hundred cases and a few classes, with one dataset having 24 classes. This sort of data does not represent the precision of human clinical decision making. We will later discuss knowledge bases in Chemical Pathology which are used to provide expert pathologist advice to clinicians on interpreting patient results. Some of these knowledge bases provide hundreds, and some even thousands of different conclusions. Perhaps Jha et al.’s investigation into machine learning in medicine did not uncover all the datasets available, but machine learning would fall far short of being able to provide hundreds of different classifications from the datasets they did identify.
Hospitals receive funding largely based on the discharge codes assigned to patients. A major review of previous studies of discharge coding accuracy found the median accuracy to be 83.2% (Burns et al. 2011). More recent studies in more specialised, and probably, more difficult domains show even less accuracy (Ewings, Konofaos, and Wallace 2017, Korb et al. 2016). Chavis provides an informal discussion of the problems with accurate coding (Chavis 2010). No doubt discharge coding has difficulties, but given that hospital funding relies on it, and hospitals are motivated to get it right, it appears unlikely that large data bases with sufficient accuracy to be used by machine learning for more challenging problems are going to be available any time soon.
At the other end of the scale we have had all sorts of extraordinary claims about how IBM’s Watson was going to transform medicine by being able to learn from all published medical findings. It was the ultimate claim that given the massive amount of information in medical journal articles and implicit in other data, machine learning should be able to extract the knowledge implicit in these data resources. Despite Watson’s success playing Jeopardy, this has not really translated to medicine (Strickland 2019). For example, in a study of colon cancer treatment advice in Korea, Watson’s recommended treatment only agreed with the multi-disciplinary team’s primary treatment recommendations 41.5% of the time, but it did agree on treatments that could be considered, 87.7% of the time (Choi et al. 2019). It was suggested that the discordance in the recommended treatments was because of different medical circumstances between the Korean Gachon Gil Medical Centre and the Sloan Kettering Cancer Centre. This further highlights a central challenge for machine learning: that what is learned is totally dependent on the quality and relevance of the data available. There is also the question of how much human effort goes into developing a machine learning system. In IBM’s collaboration with WellPoint Inc. 14,700 hours of nurse-clinician training were used as well as massive amounts of data (Doyle-Lindrud 2015). The details of what this training involved are not available, but 6–7 man-years of effort is a very large effort on top of the machine learning involved. This collaboration led to the lung cancer program at the Sloan Kettering Cancer Centre using Watson; however, recent reports of this application indicate that for the system used at the Sloan Kettering, Watson was in fact trained on only hundreds of synthetic cases developed by one or two doctors and its recommendations were biased because of this training (Bennett 2018). Data on the time taken to develop these synthetic cases does not seem to be available. If one scans the Watson medical literature, the majority of the publications are about the potential of the approach, rather than results. There is no doubt that the Watson’s approach has huge potential and will eventually achieve great things, but it is also clear that the results so far have depended on a lot more than just applying learning to data – and have a long way to go to match expert human judgement.
This central issue of data quality was identified in IBM’s 2012 Global Technology Outlook Report (IBM Research 2012) naming “Managing Uncertain Data at Scale” as a key challenge for analytics and learning. A particular issue is the accuracy of the label or classification applied to the data, as shown in the discharge coding example above. If a label attached to a case is produced automatically, it is likely to be produced consistently and the dataset is likely to be highly useful. For example, if data on the actual outcome from an industrial process is available as well as data from sensors used in the process, then the data should be excellent for learning. In fact, one of the earliest successful applications of machine learning was for a Westinghouse fuel sintering process where a decision tree algorithm discovered the parameter settings to produce better pellets, boosting Westinghouse’s 1984 income by over $10M per year (Langley and Simon 1995). Apparently, the system outperformed engineers in predicting problems. The ideal application for machine learning is not only when there is a large number of cases, but where the label or classification attached to the case is independent of human judgement; e.g. the borrower did actually default on their loan repayment, regardless of any human assessment.
Human biases in making judgements are well known (Kahneman, Slovic, and Tversky 1982), but we are also inconsistent in applying labels to data. In a project where expert doctors had to assess the likelihood of kickback payments from pathology companies to GPs, the experts tended to be a bit inconsistent in their estimates of the likelihood of kickback. However, if they were constrained to identify differentiating features to justify a different judgement about a case, they became more consistent (Wang et al. 1996). It so happened that RDR were used to ensure they selected differentiating features, but the point for the discussion here is simply that it is difficult for people to be consistent in subtle judgements. As will be discussed, human judgement is always made in a context and may vary with the context.
One approach to human labelling of large datasets is crowdsourcing, but careful attention has to be paid to quality. Workers can have different levels of expertise or may be careless or even malicious, so there have to be ways of aggregating answers to minimise errors (Li et al. 2016). But clearly crowdsourcing has great value in areas such as labelling and annotating images, and when deep learning techniques are applied to such datasets extremely good results are achieved, which could not be achieved any other way.
An underlying question is: how much data and what sort of data is needed for a machine learning algorithm to be able to do as well as human judgement. An IBM project on data cleansing for Indian street address data provides an example of this issue (Dani et al. 2010). Apparently, it is very difficult to get a clean dataset for Indian street addresses. The methods used in this study were RDR, a decision tree learner, a conditional random field learner and a commercial system. The version of RDR used, with different knowledge bases for each of the address fields and the use of dictionaries, was more sophisticated (or rather specialised) than the RDR systems described in this book. The systems developed were trained on Mumbai data and then tested on other Mumbai data and data from all of India.
As seen in Table 1.1 all the methods, except for the commercial system, performed comparably when tested on data similar to the data on which they were trained, with a precision of 75–80%. However, when tested on data from all of India, although all methods degrade, the RDR method degrades much less than the statistical methods. The issue is not so much the use of RDR, but that a method using human knowledge, based o...