| 1 | AN INTRODUCTION TO GENERAL LINEAR MODELS: REGRESSION, ANALYSIS OF VARIANCE AND ANALYSIS OF COVARIANCE |
1.1 Regression, analysis of variance and analysis of covariance
Regression and analysis of variance are probably the most frequently applied of all statistical analyses. Regression and analysis of variance are used extensively in many areas of research, such as psychology, biology, medicine, education, sociology, anthropology, economics, political science, as well as in industry and commerce.
One reason for the frequency of regression and analysis of variance (ANOVA) applications is their suitability for many different types of study design. Although the analysis of data obtained from experiments is the focus of this text, both regression and ANOVA procedures are applicable to experimental, quasi-experimental and non-experimental data. Regression allows examination of the relationships between an unlimited number of predictor variables and a response or dependent variable, and enables values on one variable to be predicted from the values recorded on one or more other variables. Similarly, ANOVA places no restriction on the number of groups or conditions that may be compared, while factorial ANOVA allows examination of the influence of two or more independent variables or factors on a dependent variable. Another reason for the popularity of ANOVA is that it suits most effect conceptions by testing for differences between means.
Although the label analysis of covariance (ANCOVA) has been applied to a number of different statistical operations (Cox & McCullagh, 1982), it is most frequently used to refer to the statistical technique that combines regression and ANOVA. As the combination of these two techniques, ANCOVA calculations are more involved and time consuming than either technique alone. Therefore, it is unsurprising that greater availability of computers and statistical software is associated with an increase in ANCOVA applications. Although Fisher (1932; 1935) originally developed ANCOVA to increase the precision of experimental analysis, to date it is applied most frequently in quasi-experimental research. Unlike experimental research, the topics investigated with quasi-experimental methods are most likely to involve variables that, for practical or ethical reasons, cannot be controlled directly. In these situations, the statistical control provided by ANCOVA has particular value. Nevertheless, in line with Fisherās original conception, many experiments can benefit from the application of ANCOVA.
1.2 A pocket history of regression, ANOVA and ANCOVA
Historically, regression and ANOVA developed in different research areas and addressed different questions. Regression emerged in biology and psychology towards the end of the 19th century, as scientists studied the correlation between peopleās attributes and characteristics. While studying the height of parents and their adult children, Galton (1886; 1888) noticed that while short parentsā children usually were shorter than average, nevertheless, they tended to be taller than their parents. Galton described this phenomenon as āregression to the meanā. As well as identifying a basis for predicting the values on one variable from values recorded on another, Galton appreciated that some relationships between variables would be closer than others. However, it was three other scientists, Edgeworth (e.g. 1886), Pearson (e.g. 1896) and Yule (e.g. 1907), applying work carried out about a century earlier by Gauss (or Legendre, see Plackett, 1972), who provided the account of regression in precise mathematical terms. (Also see Stigler, 1986, for a detailed account.)
Publishing under the pseudonym āStudentā, W.S. Gosset (1908) described the t-test to compare the means of two experimental conditions. However, as soon as there are more than two conditions in an experiment, more than one t-test is needed to compare all of the conditions and when more than one t-test is applied there is an increase in Type 1 error. (A Type 1 error occurs when a true null hypothesis is rejected.) In contrast, ANOVA, conceived and described by Ronald A. Fisher (1924, 1932, 1935) to assist in the analysis of data obtained from agricultural experiments, is able to compare the means of any number of experimental conditions without any increase in Type 1 error. Fisher (1932) also described a form of ANCOVA that provided an approximate adjusted treatment sum of squares, before he described the exact adjusted treatment sum of squares (Fisher, 1935, and see Cox & McCullagh, 1982, for a brief history). In early recognition of his work, the F-distribution was named after him by G.W. Snedecor (1934).
In the subsequent years, the techniques of regression and ANOVA were developed and applied in parallel by different groups of researchers investigating different research topics, using different research methodologies. Regression was applied most often to data obtained from correlational or non-experimental research and only regression analysis was regarded as trying to describe and predict dependent variable scores on the basis of a model constructed from the relations between predictor and dependent variables. In contrast, ANOVA was applied to experimental data beyond that obtained from agricultural experiments (Lovie, 1991), but still it was considered as just a way of determining whether the average scores of groups differed significantly. For many areas of psychology, where the interest (and so tradition) is to assess the average effect of different experimental conditions on groups of subjects in terms of a particular dependent variable, ANOVA was the ideal statistical technique. Consequently, separate analysis traditions evolved and encouraged the mistaken belief that regression and ANOVA constituted fundamentally different types of statistical analysis. Although ANCOVA illustrates the compatability of regression and ANOVA, as a combination of two apparently discrete techniques employed by different researchers working on different topics, unsurprisingly, it remains a much less popular method that is frequently misunderstood (Huitema, 1980).
1.3 An outline of general linear models (GLMs)
Computers, initially mainframe but increasingly PCs, have had considerable consequence for statistical analysis, both in terms of conception and implementation. From the 1980s, some of these changes began to filter through to affect the way data is analysed in the behavioural sciences. Indeed currently, descriptions of regression, ANOVA and ANCOVA found in psychology texts are in a state of flux, as alternative characterizations based on the general linear model are presented by more and more authors (e.g. Cohen & Cohen, 1983; Hays, 1994; Judd & McClelland, 1989; Keppel & Zedeck, 1989; Kirk, 1982, 1995; Maxwell & Delaney, 1990; Pedhazur, 1997; Winer, Brown & Michels, 1991).
One advantage afforded by computer based analyses is the easy use of matrix algebra. Matrix algebra offers an elegant and succinct statistical notation. Unfortunately however, human matrix algebra calculations, particularly those involving larger matrices, are not only very hard work, but also tend to be error prone. In contrast, computer implementations of matrix algebra are not only error free, but also computationally efficient. Therefore, most computer based statistical analyses employ matrix algebra calculations, but the program output usually is designed to accord with the expectations set by traditional (scalar algebra-variance partitioning) calculations.
When regression, ANOVA and ANCOVA are expressed in matrix algebra terms, a commonality is evident. Indeed, the same matrix algebra equation is able to summarize all three of these analyses. As regression, ANOVA and ANCOVA can be described in an identical manner, clearly they follow a common pattern. This common pattern is the GLM conception. Unfortunately, the ability of the same matrix algebra equation to describe regression, ANOVA and ANCOVA has resulted in the inaccurate identification of the matrix algebra equation as the GLM. However, just as a particular language provides a means of expressing an idea, so matrix algebra provides only one notation for expressing the GLM.
The GLM conception is that data may be accommodated in terms of a model plus some error, as illustrated below:
The model in this equation is a representation of our understanding or hypotheses about the data. The error component is an explicit recognition that there are other influences on the data. These influences are presumed to be unique for each subject in each experimental condition and include anything and everything not controlled in the experiment, such as chance fluctuations in behaviour. Moreover, the relative size of the model and error components is used to judge how well the model accommodates the data.
The model part of the GLM equation constitutes our understanding or hypotheses about the data and is expressed in terms of a set of variables recorded, like the data, as part of the study. As will be described, the tradition in data analysis is to use regression, ANOVA and ANCOVA GLMs to express different types of ideas about how data arises.
1.3.1 Regression analysis
Regression analysis attempts to explain data (the dependent variable scores) in terms of a set of independent variables or predictors (the model) and a residual component (error). Typically, a researcher who applies regression is interested in predicting a quantitative dependent variable from one or more quantitative independent variables, and in determining the relative contribution of each independent variable to the prediction: there is interest in what proportion of the variation in the dependent variable can be attributed to variation in the independent variable(s). Regression also may employ categorical (also known as nominal or qualitative) predictors: the use of independent variables such as sex, marital status and type of teaching method is common. Moreover, as regression is the elementary form of GLM, it is possible to construct regression GLMs equivalent to any ANOVA and ANCOVA GLMs by selecting and organizing quantitative variables to act as categorical variables (see Chapter 2). Nevertheless, the convention of referring to these particular quantitative variables as categorical variables will be maintained.
1.3.2 Analysis of variance
ANOVA also can be thought of in terms of a model plus error. Here, the dependent variable scores constitute the data, the experimental conditions constitute the model and the component of the data not accommodated by the model, again, is represented by the error term. Typically, the researcher applying ANOVA is interested in whether the mean dependent variable scores obtained in the experimental conditions differ significantly. This is achieved by determining how much variation in the dependent variable scores is attributable to differences between the scores obtained in the experimental conditions, and comparing this with the error term, which is attributable to variation in the dependent variable scores within each of the experimental conditions: there is interest in what proportion of variation in the dependent variable can be attributed to the manipulation of the experimental variable(s). Although the dependent variable in ANOVA is most likely to be measured on a quantitative scale, the statistical comparison is drawn between the groups of subjects receiving different experimental conditions and is categorical in nature, even when the experimental conditions differ along a quantitative scale. Therefore, ANOVA is a particular type of regression analysis that employs quantitative predictors to act as categorical predictors.
1.3.3 Analysis of covariance
As ANCOVA is the statistical technique that combines regression and ANOVA, it too can be described in terms of a model plus error. As in regression and ANOVA, the dependent variable scores constitute the data, but the model includes not only experimental conditions, but also one or more quantitative predictor variables. These quantitative predictors, known as covariates (also concomitant or control variables), represent sources of variance that are thought to influence the dependent variable, but have not been controlled by the experimental procedures. ANCOVA determines the covariation (correlation) between the covariate(s) and the dependent variable and then removes that variance associated with the covariate(s) from the dependent variable scores, prior to determining whether the differences between the experimental condition (dependent variable score) means are significant. As mentioned, this technique, in which the influence of the experimental conditions remains the major concern, but one or more quantitative variables that predict the dependent variable also are included in the GLM, is labelled ANCOVA most frequently, and in psychology is labelled ANCOVA exclusively (e.g. Cohen & Cohen, 1983; Pedhazur, 1997, cf. Cox & McCullagh, 1982). A very important, but seldom emphasized, aspect of the ANCOVA method is that the relationship between the covariate(s) and the dependent variable, upon which the adjustments depend, is determined empirically from the data.
1.4 The āgeneralā in GLM
The term āgeneralā in GLM simply refers to the ability to accommodate variables that represent both quantitative distinctions that represent continuous measures, as in regression analysis, and categorical distinctions that represent experimental conditions, as in ANOVA. This feature is emphasized in ANCOVA, where variables representing both quantitative and categorical distinctions are employed in the same GLM.
Traditionally, the label linear modelling was applied exclusively to regression analyses. However, as regression, ANOVA and ANCOVA are ...