| I | THEORY, DATA, AND MEASURES |
The three chapters of Part I describe: (a) the origins of signal detection theory and ROC analysis in statistics and engineering and the relation of these concepts to historical concepts in psycho-physics and psychology; (b) experimental data in the form of empirical ROCs that support signal detection theory and ROC analysis in psychology and diagnostics; and (c) the implications of those ROC data both for psychological theory and for the several measures of discrimination performance that have been used in psychology and in diagnostic fields.
Chapter 1 describes the relevant psychophysical theory beginning with Gustav Fechner in 1860. It acknowledges Louis Leon Thurstoneâs 1920s conception of the two stimulus categories to be distinguished as leading to two overlapping (bell-shaped) distributions on an observation variable. In Thurstoneâs theory, the two stimuli are symmetrical as far as distinguishing between them is concerned, and so a criterion is set on the observation variable where their distributions cross one another. This chapter goes on to show how H. Richard Blackwell in the 1950s extended the conception of the overlapping distributions from Thurstoneâs consideration of the âpaired-comparison,â or recognition, task (which Blackwell termed âtwo-alternative forced-choiceâ) to the âyes-no,â or detection, task. This extension was made in the interests of threshold theory, which detection theory replaces, but it was a step along the way, inasmuch as the yes-no task lies at the heart of signal detection theory and is the basis for the ROC. As the last piece of relevant history, this chapter shows how statistical theory developed by Jerzy Neyman and Egon Pearson in 1933, and extended by Abraham Wald in 1950, formed the basis for signal detection theory. In statistical theory, the two overlapping distributions are statistical hypothesesâa null hypothesis and an alternative. In classical hypothesis testing, a decision criterion is selected to yield some small probability of rejecting the null hypothesis when it is true (that is, of making a false-positive decision or Type I error)âusually .05 or .01. Similarly, Blackwell assumed a fixed sensory threshold that would lead to a negligible proportion of positive responses when only noise is present. In going from hypothesis testing to a broader class of statistical decisions, Wald made it clear that a decision criterion could be set anywhere along a decision variable. This was the same variableâthe likelihood ratioâfor any task and for any definition of the optimal criterion. (A detailed treatment of the statistical heritage of detection theory is given by Gigerenzer, G., and Murray, D. J., Cognition as intuitive statistics. Hillsdale, NJ: Lawrence Erlbaum Associates, 1987.)
Chapter 1 points up that the signal detection theory of interest here was developed in the early 1950s by Wesley Peterson and Theodore Birdsall, who were then graduate students in electrical engineering at the University of Michigan. Wilson Tanner and I, graduate students there in psychology, joined them in research and made the first application of the theory to human observers in a study of visual discrimination. Though unaware of Waldâs work a few years earlier, Peterson and Birdsall also conceived of a decision criterion that could vary across the range of a decision variable that is the likelihood ratio. To show the consequences in performance of a variable criterion, they devised the ROC. The ROC shows, for a given discrimination acuity (and a given signal strength), how the true-positive proportion (TPP) varies with the false-positive proportion (FPP) as the criterion, or the observerâs willingness to make the positive response, is varied. On ordinary arithmetic scales, the ROC extends from 0 to 1.0 on each scale, concave downward, that is, with decreasing slope. An ROC lying on the positive diagonal, with TPP = FPP, shows zero discrimination; an ROC following the left and upper borders, with TPP = 1 for all FPP, shows perfect discrimination. Statisticians sometimes remark that the ROC is simply the âpower functionâ of statistical theory, but the two functions differ fundamentally. In fact, the power functionâwhich shows how TPP increases with increasing signal strength for some selected, small, fixed FPPâis the century-old âpsychometric functionâ of psychological theory.
Chapter 1 proceeds to describe computational procedures for the index of discrimination acuity called dâ˛, as popularized in the early applications of signal detection theory in psychology. This chapter anticipates the diminished value of dⲠsuggested by the accumulating data. Specifically, it shows the theoretical ROC on a bivariate-normal graph, that is, on normal-deviate scales that provide a linear ROC. The measure dⲠis appropriate for ROCs of slope = 1, but empirical data show ROCs of other slopes, varying primarily between 0.5 and 1.0. This chapter mentions the area under the ROC as a ânon-parametricâ discrimination index appropriate to varying ROC slopes; it does not anticipate the later prominence of an area measure based on bivariate-normal distributions. Chapter 2 shows the robustness of the linear ROC on the binormal graph, by displaying dozens of empirical ROCs that are fitted well by a linear function, with varying slope. The appropriate index is termed Azâthe A for âareaâ and the âzâ to connote the normal-deviate scales of the ROC plot. This index varies from .50 at chance performance to 1.0 at perfect performance. It is now the index of general choice in diagnostic applications of ROC theory and also should be, I suggest, in psychology.
The index of the decision criterion called β (beta) is also described in Chapter 1. It is defined as the criterion value on the likelihood-ratio decision variable and also as the slope of the tangent to the ROC (on ordinary arithmetic scales) at the point that is generated by the given criterion. In contrast to dâ˛, the index β has held up well in my opinion (but see Macmillan, N. A., and Creelman, C. D. Response bias: Characteristics of detection theory, threshold theory, and ânon-parametricâ indexes. Psychological Bulletin, 1990, 107(3), 401â413). A strong point is that optimal decision criteria can be specified by β.
Chapter 1 concludes with a review of conclusions drawn from applications of the ROC in psychology, highlighting areas in which the ability to separate discrimination and decision processes led to revised psychological conceptions. An example is sensory vigilance, in which performance effects long thought to represent declines in discrimination acuity were found in most instances to represent a change in the placement of the decision criterion. Similarly, many established findings thought to represent effects of memory and forgetting in recognition tasks were shown to be effects of differences in the decision criterion.
The empirical ROCs of Chapter 2 sample the psychological topics of human visual detection, recognition memory for odors and for words, conceptual judgment, and animal learning. The chapterâs ROCs from diagnostic applications include some from medical imaging, information retrieval, weather forecasting, aptitude testing, and polygraph lie detection. They demonstrate the use of the âratingâ task to calculate ROCs based on the adoption of several decision criteria simultaneously, as opposed to successive adoption of single criteria in successive conditions of a yes-no task. The conclusion to be drawn from the survey of empirical ROCs is not that deviations from the linear binormal form never appear, but that the few deviant ROCs do not show any apparent pattern and hence do not support any other particular form. For practical purposes, the linear binormal ROC is apparently adequate and satisfactory and the discrimination index Az is simple and generally useful. For conceptual calibration, it may help to know that Az is theoretically equal to the percentage of correct responses in a paired-comparison, or two-alternative forced-choice, task. That is, an observer represented in a yes-no or rating task by an Az = .80 will state correctly on 80% of the trials which of a pair of stimuli is signal (vs. noise) or Signal A (vs. Signal B).
Chapter 3 is fundamental to measurement of discrimination acuity. It shows that non-ROC indices of discrimination acuity drawn from a 2-by-2 table of stimulus and response are invalid. Included are the percentage of correct responses (that is, the overall percentage of correct positive and negative responses); the true-positive (or âhitâ) probability corrected for chance success (corrected in either of two ways); the measure of association called the Kappa statistic, used also as a measure of observer agreement; the correlation coefficient derived from 2-by-2 tables, called phi; and an index representing those developed in the field of weather forecasting, called the skill test. In addition to invalidity, these indices suffer from the inconvenience of not accounting for a variable decision criterion; their use assumes that the criterion placement on which they are based is fixed.
The percentage of correct responses is probably the index most difficult to give up. It seems close to the data, unencumbered by theoretical considerations. Yet, it is the easiest to dismiss, even on arithmetic grounds. And it can be shown to make strong theoretical assumptions, as strong as those made by dⲠand Az. If empirical ROCs for given observers and tasks look anything like the ROCs shown in Chapter 2 to be representative, the percentage of correct responses will be a highly variable and undependable index of discrimination acuity for those observers and tasks.
To illustrate, the percentage of correct responses, P(C), varies substantially with the prior probabilities (or base rates) of the stimuli. Indeed, it may be defined as the prior probability of a positive stimulus, P(S+), times the conditional probability of a positive response given a positive stimulus, P(R+|S +)âhence, P(S+) P(R+|S+)âadded to the p...