The Significance Test Controversy
eBook - ePub

The Significance Test Controversy

A Reader

  1. 347 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

The Significance Test Controversy

A Reader

About this book

Tests of significance have been a key tool in the research kit of behavioral scientists for nearly fifty years, but their widespread and uncritical use has recently led to a rising volume of controversy about their usefulness. This book gathers the central papers in this continuing debate, brings the issues into clear focus, points out practical problems and philosophical pitfalls involved in using the tests, and provides a benchmark from which further analysis can proceed.The papers deal with some of the basic philosophy of science, mathematical and statistical assumptions connected with significance tests and the problems of the interpretation of test results, but the work is essentially non-technical in its emphasis. The collection succeeds in raising a variety of questions about the value of the tests; taken together, the questions present a strong case for vital reform in test use, if not for their total abandonment in research.The book is designed for practicing researchers-those not extensively trained in mathematics and statistics that must nevertheless regularly decide if and how tests of significance are to be used-and for those training for research. While controversy has been centered in sociology and psychology, and the book will be especially useful to researchers and students in those fields, its importance is great across the spectrum of the scientific disciplines in which statistical procedures are essential-notably political science, economics, and the other social sciences, education, and many biological fields as well.Denton E. Morrison is professor, Department of Sociology, Michigan State University.Ramon E. Henkel is associate professor emeritus, Department of Sociology University of Maryland. He teaches as part of the graduate faculty.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access The Significance Test Controversy by Ramon E. Henkel in PDF and/or ePUB format, as well as other popular books in Psychology & Social Science Research & Methodology. We have over one million books available in our catalogue for you to explore.

PART ONE
Critical Historical Context

INTRODUCTION

HOGBEN’S BOOK, Statistical Theory (1957), is a systematic and damaging attack on various probability practices in research. In the three chapters reprinted here Hogben presents a critical and historical discussion of some major issues concerning tests of significance. Hogben is thus a logical starting point for this book, but because of both the scope of erudition he assumes the reader possesses and his somewhat elusive style of writing, his chapters will be a practical starting point only for those readers with considerable familiarity with the issues and their history. Others will probably want to defer reading Hogben until they become more familiar with the issues by reading certain of the subsequent papers, particularly Bakan’s [Chapter 25], but also those by Rozeboom [Chapter 24] and Camilleri [Chapter 16].
In “The Contemporary Crisis or the Uncertainties of Uncertain Inference,” the first of the chapters that follow, Hogben delineates four separate uses of probability in theoretical statistics. His “Calculus of Judgments,” the term he uses for statistical inference, is most relevant here.1 He elaborates his views on this topic in the two subsequent chapters we have reprinted. Most of the initial chapter is given to demonstrating that there are long-standing and deep-seated differences among statisticians on probability in general and on the mathematical bases for and interpretation of significance tests in particular. He points out that these differences have not prevented statisticians from making somewhat pompous claims that have been uncritically accepted by researchers. The contemporary crisis is, then, that despite widespread use in research, many uncertainties surround the mathematical assumptions, meaning, interpretation, and relevance of statistical theory.
In “Statistical Prudence and Statistical Inference” [Chapter 2], Hogben deals with the issue of the meaning of significance tests from a historical perspective and relates his discussion to the issues of the process and purpose of scientific inference. Hogben attempts to demonstrate that the term “statistical inference” is a misnomer for what is in fact simply a method of making prudent judgments in research situations where such judgments are necessary and possible. To understand this idea requires the recognition that Hogben wishes to reserve the term “inference” (also “interpretation”) exclusively for what is involved in arriving at a valid assessment of whether a hypothesis should or should not take its place in a body of scientific knowledge, that is, scientific inference. He argues that the school of R. A. Fisher, the earlier and more influential among scientists, mistakenly employs significance tests as a mode of scientific inference. In contrast, he maintains that J. Neyman, E. S. Pearson, and A. Wald correctly use the tests as decision tests.
Throughout Statistical Theory Hogben tends to identify Fisher’s approach as one example of what he calls the “Backward Look” (the other key example is the Bayesian approach), while the Neyman-Pearson-Wald approach is labeled the “Forward Look.” The terms “Forward” and “Backward” imply as much (or more) about Hogben’s attitudes toward the two practices as they do about the differences actually involved. Briefly, the Backward Look, as applied to Fisher’s notion, involves retrospective interpretation of the information produced by a significance test to infer the general validity of a hypothesis in probabilistic terms. On the other hand, the Forward Look involves the application of probability theory to a given empirical outcome to determine the ratio of correct decisions about future outcomes of tests of a hypothesis.
The difference can be better understood by considering the meaning of a rare event—for example, a difference between two groups that a significance test indicates would occur less than five times in 100 if the null hypothesis of no difference between the groups were true. According to the Fisherian approach we would attempt to infer from a single occurrence like the one above whether the difference is “significant,” i.e. whether the difference is such that we may infer the null hypothesis to be invalid (“rejected”). Although Fisher preferred the five percent level as a “convenient” criterion for making this inference, he did not dictate that it be a firm criterion, nor did he think it necessary to state the criterion level in advance of the test. The problem of judging whether a given finding is rare enough to warrant rejection of the null hypothesis is a matter of inference and interpretation for the researcher after he has performed the test.
The Neyman-Pearson-Wald approach, in contrast, does not, on the basis of any particular occurrence, attempt to judge the validity of the hypothesis under consideration. Rather, it is a procedure for judgment involving a firm rule (level) stated in advance that yields a specified proportion of valid judgments in the long run. In the short run, that is, in any particular instance, the procedure yields a decision. Such a procedure is appropriate because the research context in which such a decision test (in contrast with a significance test) is used requires a decision as a guide to actions and provides the required fixed framework for repetition of the test. Thus, making decisions in this way is prudent, but such decisions do not commit the researcher to an inference about the validity of the hypothesis in each instance; the hypothesis is simply rejected or accepted on each test, not “interpreted” with regard to its credentials as scientific knowledge.
Hogben favors the Neyman-Pearson-Wald approach for the following reason: Test procedure tells the researcher that a particular event would be one of a class of events that is collectively rare (for instance p < .05) if the null hypothesis were true, but the occurrence of such rare events is, by definition, perfectly compatible with the truth of the null hypothesis. Therefore there is no more basis for interpreting any particular event as outside the rare class (and the null hypothesis as false) than there is for interpreting the event as part of the rare class (and the null hypothesis as true). The probabilities involved in the tests refer only to classes of events, not to individual events, and the best one can hope for in using the tests is a certain proportion of correct decisions about hypotheses; meeting a given probability level does not allow an inference about the “significance” of the particular finding for the validity of the hypothesis.
To treat the probabilities of the Calculus of Judgments as indicative of significance in the Fisher tradition is, according to Hogben, to locate probability “in the mind”—the notion that a high degree of belief is warranted about the invalidity of null hypotheses rejected by “significant” findings. Hogben strenuously objects to this practice, preferring instead what he calls a “behaviorist” approach, which treats probability completely in terms of relative frequencies exhibited by observable events. Further, Hogben objects to the negative aspects of Fisher’s stress on the rejection of elementary (no-difference) null hypotheses (rather than emphasizing positive knowledge through the development of specific and informative alternative hypotheses). Hogben is not, however, against scientists individually or collectively having beliefs about the validity of hypotheses, and he fully recognizes that making positive inferences must be a part of scientific activity; his criticism is directed generally at attempts to calculate and express such beliefs as probabilities, specifically with significance tests involving elementary null hypotheses.2
Hogben thinks the main application of the Calculus of Judgments is in practical applications, as in quality control in manufacturing. In this sense, then, he clearly is not endorsing the Neyman-Pearson-Wald approach as a prescription for scientific inference. He does allow the possibility that the tests may play an occasional minor role in the scientific enterprise as a screen against rash decisions or as advice to the researcher concerning which hunches are worth following. However, he quickly qualifies even this possible contribution of the tests :
An experienced investigator, with no illusions about the practicality of formulating risks relevant to further effort in numerically intelligible terms consistent with the professional ethic of scientific research, may accordingly prefer to rely on common sense, if statistical theory has nothing better to confer [Chapter 2: 25].
Moreover, Hogben later points out that such a screening convention is worthless unless one considers the power of a test, and that the question of power cannot be resolved outside the context of practical situations requiring prudent research decisions.
In addition to the above issues, “Statistical Prudence and Statistical Inference” touches on an interrelated package of issues dealing with sampling, population, form, and scope, mainly in the context of Hogben’s attempt to demonstrate the general inadequacy of the Fisherian approach to significance tests. In “Significance as Interpreted by the School of R. A. Fisher” [Chapter 3], Hogben deals with this group of issues in more detail and with increasing invective. He also documents important inconsistencies in Fisher’s position, although some scholars will doubtless claim that both his selection and interpretation of Fisher’s statements are biased. Both Fisher’s position and Hogben’s analysis of these issues are complex but can be briefly summarized as follows.
According to Fisher, the purpose of a significance test is to determine whether two groups, treated and untreated, can be considered samples from the same infinite hypothetical population. Thus, the null hypothesis takes the familiar form that the groups do not differ on some characteristic of interest. If the null hypothesis is rejected on the basis of a significance test, the appropriate inference is that the treatment made a significant difference and the groups did not come from the same infinite hypothetical population. Thus, the population involved in the Fisherian view of significance is the conceptual resultant of the test procedure.
Hogben points out that the “no-difference” form of the Fisher null hypothesis (sometimes also called the “point null” hypothesis) is insufficiently informative for scientific work, that a random process must be assumed in the formation of the groups, that such a process must be possible in a conceptual framework of infinite repetition on a particular population, and that in no sense can an infinite or any other population be the resultant of the test. His critical points all have direct relevance to the use of significance tests in behavioral research, since no-difference nulls and nonrandom samples from what researchers casually claim are infinite hypothetical populations are deeply imbedded practices in these disciplines. Hogben does not point out, however, that in the behavioral science disciplines the inference features to the Fisher approach have to some extent been amalgamated in practice with the features of the decision test approach that involve stating a level in advance and sticking to it unequivocally. It is doubtful that Hogben would look with favor on either this marriage or its offspring.
1. In a less direct sense Hogben’s “Calculus of Aggregates” is relevant to Gold’s [Chapter 20] and Winch and Campbell’s [Chapter 22] use of tests of significance. The latters’ random process models are essentially what Hogben would refer to as a Calculus of Aggregates—the attempt to see whether random aggregate behavior could account for the phenomenon observed.
2. The reader is urged to compare Hogben’s views with the notions of Rozeboom [Chapter 24] and Camilleri [Chapter 16]. While the latter writers both favor attempts to assess the probable validity of hypotheses, they are in complete agreement with Hogben in his negative view of using tests of significance for doing so.

1 The Contemporary Crisis or the Uncertainties of Uncertain Inference

Lancelot Hogben F.R.S.
IT IS NOT WITHOUT REASON that the professional philosopher and the plain man can now make common cause in a suspicious attitude towards statistics, a term which has at least five radically different meanings in common usage, and at least four in the context of statistical theory alone. We witness on every side a feverish concern of biologists, sociologists and civil servants to exploit the newest and most sophisticated statistical devices with little concern for their mathematical credentials or for the formal assumptions inherent therein. We are some of us all too tired of hearing from the pundits of popular science that natural knowledge has repudiated any aspirations to absolute truth and now recognises no universal logic other than the principles of statistics. The assertion is manifestly false unless we deprive all purely taxonomic enquiry of the title to rank as science. It is also misleading because statistics, as men of science use the term, may mean disciplines with little connexion other than reliance, for very different ostensible reasons, on the same algebraic tricks.
This state of affairs would be more alarming as indicative of the capitulation of the scientific spirit to the authoritarian temper of our time, if it were easy to assemble in one room three theoretical statisticians who agree about the fundamentals of their speciality at the most elementary level. After a generation of prodigious proliferation of statistical techniques whose derivation is a closed book to an ever-expanding company of avid consumers without access to any sufficiently simple exposition of their implications to the producer-mathematician, the challenge of J. Neyman, E. S. Pearson, and Abraham Wald is provoking, in Nietzsche’s phrase, a transvaluation of all values. Indeed, it is not too much to say that it threatens to undermine the entire superstructure of statistical estimation and test procedure erected by R. A. Fisher and his disciples on the foundations laid by Karl Pearson, Edgeworth, and Udny Yule. An immediate and hopeful consequence of the fact that the disputants disagree about the factual credentials of even the mathematical theory of probability itself is that there is now a market for textbooks on probability as such, an overdue awareness of its belated intrusion in the domain of scientific research and a willingness to re-examine the preoccupations of the Founding Fathers when the topic had as yet no practical interest other than the gains and losses of a dissolute nobility at the gaming table.
Since unduly pretentious claims put forward for statistics as a discipline derive a spurious cogency from the protean implications of the word itself, let us here take a look at the several meanings it enjoys in current usage. First, we may speak of statistics in a sense which tallies most closely with its original connotation, i.e. figures pertaining to affairs of state. Such are factual statistics, i.e. any body of data collected with a view to reaching conclusions referable to recorded numbers or measurements. We sometimes use the term vital statistics in this sense, but for a more restricted class of data, e.g. births, deaths, marriages, sickness, accidents and other happenings common to individual human beings and more or less relevant to medicine, in contradistinction to information about trade, employment, education and other topics allocated to the social sciences. In a more restricted sense, we also use expressions such as vital statistics or economic statistics for the exposition of summarising procedures (life expectation, age standardisation, gross or net reproduction rates, cohort analysis, cost of living or price indices) especially relevant to the analysis of data so described. By analysis in this context, we then mean sifting by recourse to common sense and simple arithmetical procedures what facts are more or less relevant to conclusions we seek to draw, and what circumstances might distort the true picture of the situation. Anscombe (1951) refers to analysis of this sort as statistics in the sense in which “some continental demographers” use the term.
If we emphatically repudiate the unprovoked scorn in the remark last cited, we must agree with Anscombe in one particular. When we speak of analysis in the context of demography, we do not mean what we now commonly call theoretical statistics. What we do subsume under the latter presupposes that our analysis invokes the calculus of probabilities. When we speak of a calculus of probabilities we also presuppose a single formal system of algebra; but a little reflection upon the history of the subject suffices to remind us that: (a) there has been much disagreement about the relevance of such a calculus to everyday life; (b) scientific workers invoke it in domains of discourse which have no very obvious connexion. When I say this I wish to make it clear that I do not exclude the possibility that we may be able to clarify a connexion if such exists, but only if we can reach some agreement about the relevance of the common calculus to the world of experience. On that understanding, we may provisionally distinguish between four domains to which we may refer when we speak of the Theory of Statistics:
(i) A Calculus of Errors, as first propounded by Legendre, Laplace, and Gauss, undertakes to prescribe a way of combining observations to derive a preferred and so-called best approximation to an assumed true value of a dimension or constant embodied in an independently established law of nature. The algebraic theory of probability intrudes at two levels: (a) the attempt to interpret empirical laws of error distribution referable to a long sequence of observations in terms consistent with the properties of models suggested by games of chance; (b) the less debatable proposition that unavoidable observed net error associated with an isolated observation is itself a sample of elementary components selected randomwise in accordance with the assumed properties of such models.
Few current treatises on theoretical statistics have much to say about the Gaussian Theory of Error ; and the reader in search of an authoritative exposition must needs consult standard texts on The Combination of Observations addressed in the main to students of astronomy and geodesics. In view of assertions mentioned in the opening paragraph of this chapter, it is pertinent to remark that a calculus for combining observations as propounded by Laplace and by Gauss, and as interpreted by all their successors, presupposes a putative true value of any measurement or constant under discussion as a secure foothold for the concept of error. When expositors of the contemporary reorientation of physical theory equate the assertion that the canonical form of the s...

Table of contents

  1. Cover
  2. Title Page
  3. Copyright Page
  4. Contents
  5. Preface
  6. Introduction A Preview of the Issues and an Overview of the Readings
  7. PART ONE. CRITICAL HISTORICAL CONTEXT
  8. PART TWO. THE CONTROVERSY IN SOCIOLOGY
  9. PART THREE. CRITICISM BY PSYCHOLOGISTS
  10. PART FOUR. CRITICISM FROM OTHER QUARTERS
  11. PART FIVE. EPILOGUE
  12. References
  13. Index