Statistical analysis is probably the most pervasive methodological tool in psychological research, often providing a common set of assumptions that span various sub-disciplines, such as personality, social, perception, and cognition. Just as technologies influence sciences, so do the conceptual tools inherent in the use of particular statistical techniques. To illustrate this point, it is easier to look at the issue historically, before considering the implications of the more recent innovations.
The Analysis of Variance
One common type of statistical analysis is the Analysis of Variance—the ANOVA. ANOVAs are so much a part of psychological research that it is difficult to find any general statistics book written for graduate psychology that omits the ANOVA. In spite of the current pervasiveness of the ANOVA, its rise and institutionalization in psychology occurred only in the last thirty years, as described by Rucci and Tweney (1980). The basic formulation of the ANOVA was presented by Fisher in the late 1920’s. From the mid-1930’s until the mid-1940’s, a considerable number of expository articles appeared in the psychological literature describing the ANOVA as applied to psychology. In addition, there was a steady increase in the number of articles that used ANOVA as an analysis technique (except for a brief interruption caused by World War II). Also appearing in the 1940’s were the first statistics textbooks designed especially for psychological research, which further disseminated the ANOVA. By the early 1950’s, statistical methods became an intrinsic part of experimental psychology when graduate courses in statistics for psychology were established. By the 1980’s, the assumption can be made that a psychologist from any research area at least understands the terms F distribution, error term, and degrees of freedom. So even though ANOVA sometimes seems like an inherent part of psychological research, it has not always been with us. The intriguing question, of course, is what its role and influence has been on the scientific development of psychology.
It might be too strong a statement to say that the institutionalization of the ANOVA caused a shift in thinking, but a shift seems to have occurred at about the same time ANOVA started to be used extensively. Rucci and Tweney note that the rise of the ANOVA was correlated with the demise of the single variable law in psychology. Coupled with the demise was shift to consideration of multiple, conjoint determinants of behavior. In part, this shift may have been due to the fact that the ANOVA facilitates thinking about the effects of several variables, that is, cross classification and the interaction among variables. One might also speculate about other, more recent influences of the ANOVA on the field. For instance, the availability of the ANOVA among experimental psychologists probably facilitated the dissemination and impact of Sternberg’s additive factors methodology (Sternberg, 1969).
To push this methodological Whorfianism a little farther, we can consider how the “language” of the ANOVA shapes our conceptions of psychological issues. My hypothesis is that ANOVA permits and encourages issues to be framed and conceptualized in terms of nominal, or essentially qualitative, variables. Nominal variables, as introductory statistics books explain, are those without an underlying metric. Thus, it is possible to classify words as abstract or concrete, frequent or infrequent, pronounceable or unpronounceable and so on. These are examples of continuous quantitative variables that have been dichotomized or trichotomized, and each level treated as distinct from the others. The implicit assumption may be that there is no underlying metric. The consequences of having a method for showing that different levels of a factor affect behavior differently is that researchers are “encouraged” to search for differences, rather than the mechanisms that account for the differences.
Multiple Regression
The ANOVA is of course a special case of multiple regression. However, the conceptual difference between the two statistical tools may be larger than their formal mathematical relation would suggest. Regression techniques are convenient for exploring interval or ratio variables, those that have a continuous, or quantitative, underlying dimension. One of the important products of the regression analysis is the regression weight, which expresses how much the dependent variable (i.e. some aspect of the behavior) changes as a function of a unit of change in the independent variable (i.e. some aspect of the stimulus). Rather than simply determining whether long words take significantly longer to pronounce than short words, the regression model quantifies the amount of increase as a function of length. Such a mathematical result appears closer to the enterprise of model building than does the confirmation or disconfirmation of a difference among conditions. Obviously, not all interesting scientific issues involve quantitative stimulus dimensions. Even those that do might not involve a linear or continuous function to relate the independent and dependent variables. Nevertheless, it is interesting to speculate how current experimental psychology might be different if multiple regression had been the technique that was institutionalized in the discipline, with the ANOVA as a special case, rather than the reverse.
The Effect-Size Question. The few examples of regression analysis that already exist in the psycholinguistics literature suggest one way in which the tool has a subtle influence on its users. The R-squared measure in the regression analysis seems to lead researchers to ask about the importance of each variable in the regression equation. In the extreme, this becomes a step-wise regression in which variables are entered one at a time, and the computation of interest is the residual variance accounted for by each variable after the preceding variables have been entered. Of course, the analogous questions could be asked in ANOVA designs, but they seldom are, in spite of the suggestions of some authors (cf. Dwyer, 1974; Hays, 1963).
To ask about the importance of a particular effect, or more precisely, its underlying process, is a worthwhile concern. However, it is incorrect to assume that the R-squared measure provides the answer. The R-squared measure indicates the size of an effect in a given study, but not the importance of the process. The R-squared measure gives the variance accounted for by a given factor, but this estimate is not an inherent property of the factor. The estimate depends on the variation of that factor relative to the variation of other factors in the task. The implications of this become clearer if one considers a common example such as the variable of word length. In most reading experiments with natural text, word length accounts for a large proportion of the variation in gaze duration on a word. The unwary might be tempted to conclude that the process affected by word length is a particularly important one. But such a conclusion fails to consider that word length has such a large effect because it varies so widely, from words like a, of, by to thermoluminescence, and microbiological. Different variances and different R-squares would be obtained if the same readers were given texts in which the word lengths varied only a little (a restriction that occurs in many word perception experiments). This is just one reason why it is fallacious to equate the R-squared measure with the importance of an underlying process.
In order to assess the importance of a given subprocess in a task, the term importance must be specified more precisely than just the layman’s sense. One possible specification is to consider practical importance, in the sense of determining performance in a “real-world” situation. In that case, the process should be evaluated in a context that approximates the situation to be explained. Both the size of the effect of the process, and the necessity or frequency of the process contribute to its practical importance. An alternative specification of importance can be in terms of theory development. Even a process that produces only a small effect or operates infrequently may be necessary to explain how a system works in particular contexts, and hence, may be theoretically important. Thus, the “importance” of a factor is a context-sensitive issue. The usefulness of evaluating importance is that it encourages researchers to explicitly consider the context of the process and model that is being evaluated.
It is difficult to assess more general effects that regression models might have in psychological research, since their use is too limited and recent. However, one way to indulge in such hypothetical reasoning, while minimizing its science-fiction character, is to examine a field in which this tool has been widely used. Then we can see whether there are lessons to be learned by example. Regression models lend themselves to this exercise quite easily, since they have been the workhorse of the sub-discipline of econometrics. We can see whether some of the successes and failures of this field hold any moral for psychology.
The Lessons of Econometrics. Part of economics cannot be an experimental science, since its domain includes large-scale economies. Economics requires tools for analyzing existing data on factors such as supply and demand, consumption, investment, and so on. This area embraced and developed regression techniques as a major analytic tool, and is called econometrics. This field has been so permeated by quantitative techniques, particularly regression models, that many textbooks on econometrics resemble advanced statistics books, with the content of specific economic theories relegated to appendices. While psychology and psycholinguistics differ from economics in that they are primarily experimental sciences, they too have a need for correlational techniques. Not every variable of interest can be controlled or systematically manipulated. Regression is useful when the uncontrolled variable is just a nuisance variable, so that the effects of the remaining variables can be examined after the effect of the nuisance variable has been partialled out. As in economics, regression also can be used to examine complex relations that cannot be manipulated experimentally. A good example is individual differences, which generally is not manipulated since we cannot experimentally create good and poor readers. Other manipulations require such large amounts of time or resources that they are not practical or feasible to manipulate, such as the effect of schooling or long-term training on the acquisition of language skills. As psycholinguistic research explores more complex situations, its methodological tools may increasingly resemble those that have been of use in other social sciences.
In spite of the differences between fields, we can examine econometrics for potential lessons concerning the role and limitations of regression analysis. One lesson that appears in this literature is that regression models are useful for identifying the variables that enter into a particular process. A weakness of the tool, however, is that the specific weights and exact parameters may easily be overinterpreted. Regression techniques do not discriminate very well between models with the same variables but with different functions, such as models that include vs. exclude interactions. Regression techniques may be most useful for establishing the variables and general aspects of their quantitative function, but then experimental tools are needed to explore processing details. A second lesson is that econometrics has sometimes degenerated into the rote application of statistical analyses without commensurate gains in the theory development. But this seems to be a danger with every new paradigm and methodology, and we can find instances within our own field without taking lessons from others. Beyond these general lessons, there are some technical issues that have been explored in considerable detail in the econometrics literature, including issues such as multicollinearity, inferring causality, and so on, that may be of use to psychologists who are just beginning to use this tool.
The Purpose of Model Fitting
Now, I would like to go to a more abstract level and discuss the rationale for model fitting — assessing the fit between a model and some data. The argument I am going to expand upon is one that I encountered in a 1962 article by David Grant in which he talked about fitting models that were developed from mathematical learning theory. The models were probabilistic, but like some current modelling efforts, mathematical learning models made predictions about a number of aspects of the data — some with more success than others. Those researchers also faced the issue of fitting the predicted characteristics to the observed ones. One type of answer that was common then and is still used now is to compute an overall measure of fit. The model-fitting exercise was considered to be successful if the deviations were minor and non-significant. The ironic aspect of this approach, as Grant pointed out, was that the sloppier the data, the more likely one was to find non-significant differences between the predictions and the observations. The more careful the study, the more likely the researcher would have to contend with the sticky issue of significant discrepancies between the model’s behavior and Mother Nature’s. Grant suggested a particular statistical solution to the paradox, a solution that included two statistical tests: one for the goodness of fit and one for the non-randomness of the deviations. Of more interest than his particular statistical procedure, however, is the general rationale he gave for model fitting.
Grant argued that the statistical approach of testing the null hypothesis against the alternative hypothesis provides a misleading characterization of the scientific enterprise. Researchers are not in the position of accepting or rejecting hypotheses like a quality control supervisor on an assembly line. Our actual task is the long-range process of explaining phenomena. Hence, we should not test a model to determine if it is true (after all, no model can be proven true) or even to determine if it is false (in real life, we seldom deal with models that are truly useless). We should test a model to determine how it should be improved. This rationale has implications for how we evaluate the usefulness of a statistical method. An overall test of goodness of fit provides a ballpark estimat...