1
Analysis of Variance
Between-Groups Designs
Robert A. Cribbie and Alan J. Klockars
Between-groups analysis of variance (ANOVA) is one of the most commonly used techniques to analyze data when the intent of the research is to determine whether one or more categorical independent variables (IVs) relates to a continuous dependent variable (DV). Causal statements regarding this relation are often clearest with random assignment of subjects to the various treatment groups (although see Chapter 30 for dealing with quasi-experimental designs). The broad area of ANOVA consists of significance tests to determine if a non-random relation exists between IV(s) and the DV, follow-up tests to investigate more thoroughly the nature of such a relation, and measures of the strength of the relation.
The following is an overview of some of the types of designs and experiments that can be analyzed by between-groups ANOVA. The number of IVs will determine the type of design. For example, a single IV design is called a one-way design. There are also ANOVA designs that have a number of different IVs. If, for example, there were three IVs that were completely crossed with one another, it would be called a three-way factorial design. Crossing IVs in a factorial experiment typically enriches the major theory being investigated, providing for additional main effects as well as interactions between the factors.
A random factor is one that has the levels of the IV randomly chosen from some universe of possible levels (e.g., dosages of a medication, hours of instruction), which is distinguishable from a fixed factor, where all levels of the IV that are of interest are included (e.g., treatment and control). Random factors may be crossed with the fixed factors yielding a mixed model. If one IV is nested within another, the design can be labeled hierarchical. Individual difference variables might also be incorporated into ANOVA to reduce nuisance variability, as in a randomized block design, or as a continuous control variable within an analysis of covariance (ANCOVA). There is a rich literature in ANOVA, including texts written for experimental design by Kirk (1995), Maxwell and Delaney (2003), and Keppel and Wickens (2004).
1. Dependent Variables
Researchers must clearly outline the dependent variable under study, most importantly describing in detail how that outcome is being measured (for designs with multiple outcomes, see Chapter 24, this volume). The outcome variable is the set of numbers that we have collected that are used as a proxy for the theoretical dependent variable. The authors must defend the outcome variable that is being used by, for example, establishing sufficient construct validity within the population being studied. The reader should be able to see the obvious fit between the measure and the construct.
Table 1.1 Desiderata for Analysis of Variance, Between-Groups Designs.
Desideratum | Manuscript Section(s)* |
1. The dependent variable(s) under study are outlined with a discussion of their importance within the field of study. | I |
2. Each discrete-level independent variable is defined and its hypothesized relation with the dependent variable is explained. | I |
3. A rationale is provided for the simultaneous inclusion of two or more independent variables and any interaction effects are discussed in terms of their relation with the dependent variable. | I |
4. Appropriate analyses are adopted when the research hypothesis relates to the equivalence of means. | I |
5. The inclusion of any covariate is justified in terms of its purpose within the analysis. | I |
6. The research design is explained in detail, including the nature/measurement of all independent and dependent variables. | M |
7. In randomized block designs, the number, nature, and method of creation of the blocks is discussed. | M |
8. The use of a random factor is justified given the hypotheses. | M |
9. In hierarchical designs, the rationale for nesting is explained and the analysis acknowledges the dependence in the data. | M |
10. In incomplete designs, or complex variants of other designs, sufficient information and references are provided. | M |
11. A rationale is given for the number of participants, the source of the participants, and any inclusion/exclusion criteria used. | M |
12. Missing data and statistical assumptions of the model are investigated and robust methods are adopted when issues arise. | M, R |
13. The final model is discussed, including defining and justifying the chosen error term and significance level. | M, R |
14. Follow-up strategies for significant main effects or interactions are discussed. | M, R |
15. Effect size and confidence interval information are provided to supplement the results of the statistical significance tests. | R |
16. Appropriate language, relative to the meaning and generalizability of the findings, is used. | D |
* I = Introduction, M = Methods, R = Results, D = Discussion.
For ANOVA, the outcome measure should be measured on a continuous scale. All analyses and decisions made in tests of significance are about the characteristics of the outcome measure, not directly on the theoretical dependent variable. If the means for groups differ statistically significantly on the outcome variable, this strongly suggests a relation between the grouping variable and the outcome. The researcher, however, does not typically want to discuss differences on the unique outcome measure used but rather differences on the theoretical dependent variable that the test is supposed to tap. Only to the extent that the outcome measure provides a valid measurement of the theoretical construct will the conclusions drawn be relevant to the underlying theory.
The overall mean and variance of the outcome measure must be such that potential differences among the groups can be detected. Outcomes where subjects have scores clustered near the minimum or maximum possible on the instrument (i.e., floor or ceiling effects) provide little opportunity to observe a difference. This will reduce power and obscure differences that might really be present. Floor or ceiling effects also typically result in distributions that are not normal in shape (see Desideratum 12).
Authors often employ measures used in previously published literature. Of concern is the appropriateness of the measure in the new setting, such as with different age or ability levels. In factorial experiments, particularly with randomized blocks designs where one of the factors is an individual difference measure on which subjects have been stratified, the inappropriateness of the measure at some levels of the blocking variable might incorrectly appear as an interaction of the blocks with the treatments. For example, consider a learning experiment where the impact of the treatment was measured by an achievement test and the subjects were blocked on ability. If it were true that one treatment was generally superior to the others across all levels of ability, the appropriate finding would be a main effect for treatment. However, if the outcome measure were such a difficult test that subjects in all but the highest ability group obtained essentially chance scores (which would not show treatment effects), while those in the highest ability group showed the true treatment effect, the effect would be declared an aptitude-treatment interaction.
2. Independent Variables
A between-groups ANOVA requires unique, mutually exclusive groups. Typically, the groups reflect: (1) fixed treatments, (2) levels of an individual difference variable, or (3) levels of a random variable (i.e., a random factor). Fixed treatment groups are created such that they differ in some aspect of the way the participants are treated. The differences among treatments capture specific differences of interest and thus are usually treated as a fixed factor with either qualitatively different treatments or treatments having different levels of intensity of some ordered variable. Categorical individual difference variables (e.g., race) may be included as the primary IV of interest or as blocking variables in a randomized block design. Random factors, where the levels of the IV are a subset of all potential levels available, must be treated different than fixed factors given that there is variability associated with the selection of levels.
Regardless of the nature of the IV, it is important that researchers explain the hypothesized relation between the IV and DV. Unless the researcher clearly outlines the study as exploratory, the nature of the relation between the IV and DV should be explicit. For example, in most cases it is not sufficient for a researcher to simply state that they are comparing the levels of the IV on the DV, but instead should justify why the IV was included in the study and specifically how they expect the levels of the IV to differ on the DV.
When treatment groups are created, the groups are the operational definitions of the theoretical independent variable of interest. In the report of the research, the relations described are generally in terms of the theoretical construct the groups were created to capture (e.g., stress) not in terms of the operations involved (e.g., group was told the assignment counted for 50% of their grade). The way the treatments are defined must clearly capture the essence of the theoretical variable. Authors must defend the unique characteristics of the treatment groups, clearly indicating the crucial differences desired to address the theoretical variable of interest. The treatments must have face validity in that the reader sees the obvious linkage between the theoretical variable of interest and the operations used to create that variable.
There are a number of common shortcomings in the construction of treatment groups that can result in low power or confounded interpretations. For example, differences in the wording of reading prompts might have a subtle effect too small to be detected by an experiment unless an extremely large sample size is used. Low power can be the result of treatments that were implemented for too short a period or with insufficient intensity to be detected. Confounded effects can happen when groups differ in multiple ways, any one of which might produce observed differences. This can happen inadvertently, such as if one type of prompt required the study period to be longer than any of the other types of prompts. Any difference found might be due to either the differences in prompts or the differences in study time. In other experiments, the intent is to compare treatment groups that differ on complex combinations of many differences resulting in uncertainty regarding the ‘active ingredient’ that actually produced any group differences found.
With regard to individual difference variables, it is important that the researcher explains whether the variable is being used as a primary IV or as a nuisance factor to reduce random variability, and in either case provide a theoretical rationale for including the variable. With random factors the author must defend the choices made with particular attention to the extreme levels (see Desideratum 8).
3. Inclusion of Two or More Independent Variables
Many settings use factorial experiments where multiple IVs are crossed to facilitate the assessment of both main effects and interactions. Sometimes one IV is central to an investigator’s research program, with the other IVs included for control and/or to expand upon the theory being explored. Authors should defend the inclusion of all IVs relative to their relation to the central theory they are testing. The explanation of an IV’s role should include both the main and interaction effects anticipated. It is not necessary for the researcher to include the interaction in the model if no hypothesis concerning the interaction exists.
Fixed IVs generally enrich theory. The inclusion of an IV that is a random factor is often meant to show the generalizability of the central findings across levels of the random factor. The choice of the random factor should be justified relative to the need for greater generalizability concerning the main effects (see Desideratum 8).
Factorial designs provide more information than a single factor design, but the authors should recognize the costs involved in terms of complexity of interpretation. The more complex the design, the more difficult it is to understand what, if anything, really differs. Interactions alter the interpretation of the main effects and thus, if an interaction exists, the main effects should not be discussed. For example, imagine that the effect of test type (short answer versus multiple choice) on grades differs across males and females (i.e., a test type by sex interaction). In this situation, it is not appropriate to discuss the overall effect of test type because the nature of the relation depends on the sex of the student. If an interaction is not statistically significant, authors sometimes over-generalize main effects ignoring the possibility that the statistically non-significant interaction may be due to lack of power (i.e., a Type II error). Lack of interaction tests can be used if the primary hypothesis relates to negligible interaction or if there is a desire to remove inconsequential interaction terms from a model. Lack of interaction tests fall under the category of equivalence tests (see Desideratum 4).
Factorial designs with unequal numbers of observations across the cells require special attention as the main effects and interactions are not orthogonal (see Desideratum 12).
4. Difference-Based or Equivalence-Based Hypotheses
Often, researchers propose that a relation exists between the IV(s) and the DV but in other instances researchers propose that there is no relation between the IV(s) and the DV. Imagine that a researcher was interested in demonstrating that two treatments were equally effective, or that the effect of a treatment was similar across genders (i.e., a lack of interaction). In this instance, typical ANOVA procedures are not appropriate. Alternatives to the one-way and factorial between-subjects ANOVA for testing a lack of relation are available (field of equivalence testing) and should be adopted in these situations. Further, follow-up tests for nonsignificant omnibus tests of equivalence require specific procedures that differ in important ways from those discussed in Desideratum 14.
5. Covariates
Authors should justify the inclusion of any covariate(s) relative to the theory being tested. Covariates can increase the power of the statistical tests within the ANOVA, but can also increase the Type I error rate if their selection allows for the capitalization on chance. A posteriori or haphazardly selected covariates can capitalize on random variability and should not be used. Covariate scores must not be a function of the treatment group and, in randomized designs, should be available before random assignment takes place. Designs in which the covariate is obtained at the same time as the outcome have the likelihood that the covariate scores will be altered due to treatment. A primary function of the covariate is to provide a way to statistically equate groups to which the treatments are applied (e.g., subjects are randomly assigned to treatments although the bal...