chapter one
What is generalizability theory?
Generalizability theory: Origin and developments
Not all measuring procedures can be perfectly accurate. In the social and health sciences in particular, but in the natural sciences as well, we can rarely assume our measurements to be absolutely precise. Whether we are attempting to evaluate attitudes to mathematics, managerial aptitude, perception of pain, or blood pressure, our scores and ratings will be subject to measurement error. This is because the traits or conditions that we are trying to estimate are often difficult to define in any absolute sense, and usually cannot be directly observed. So we create instruments that we assume will elicit evidence of the traits or conditions in question. But numerous influences impact on this process of measurement and produce variability that ultimately introduces errors in the results. We need to study this phenomenon if we are to be in a position to quantify and control it, and in this way to assure maximum measurement precision.
Generalizability theory, or G theory, is essentially an approach to the estimation of measurement precision in situations where measurements are subject to multiple sources of error. It is an approach that not only provides a means of estimating the dependability of measurements already made, but that also enables information about error contributions to be used to improve measurement procedures in future applications. Lee Cronbach is at the origin of G theory, with seminal co-authored texts that remain to this day essential references for researchers wishing to study and use the methodology (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Rajaratnam, & Gleser, 1963).
The originality of the G theory approach lies in the fact that it introduced a radical change in perspective in measurement theory and practice. In essence, the classical correlational paradigm gave way to a new conceptual framework, deriving from the analysis of variance (ANOVA), whose fundamental aim is to partition the total variance in a data set into a number of potentially explanatory sources. Despite this profound change in perspective, G theory does not in any way contradict the results and contributions of classical test theory. It rather embraces them as special cases in a more general problematic, regrouping within a unified conceptual framework concepts and techniques that classical theory presented in a disparate, almost disconnected, way (stability, equivalence, internal consistency, validity, inter-rater agreement, etc.). The impact of the change in perspective is more than a straightforward theoretical reformulation. The fact that several identifiable sources of measurement error (markers, items, gender, etc.) can simultaneously be incorporated into the measurement model and separately quantified means that alternative sampling plans can be explored with a view to controlling the effects of these variables in future applications. G theory thus plays a unique and indispensable role in the evaluation and design of measurement procedures.
That is why the Standards for educational and psychological testing (AERA, 1999), developed jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (and hence familiarly known as the āJoint Standardsā), stress the need to refer to G theory when establishing the validity and reliability of observation or testing procedures. The first two chapters immediately embrace this inferential perspective, in which generalization to a well-defined population is made on the basis of a representative random sample. The Standards explicitly refer to G theory at several points. For instance, the commentary for standard 2.10 states, with respect to reliability estimates based on repeated or parallel measures:
Where feasible, the error variances arising from each source should be estimated. Generalizability studies and variance component analyses are especially helpful in this regard. These analyses can provide separate error variance estimates for tasks within examinees, for judges and for occasions within the time period of trait stability. (AERA, 1999, p. 34)
We return later to some of the essential characteristics of the theory. For the moment we simply draw attention to two important stages in its evolution, of which the second can be considered as an extension of the first, since it has led to the expansion and considerable diversification of its fields of application.
As originally conceived, G theory was implicitly located within the familiar framework of classical test theory, a framework in which individuals (students, psychiatric patients, etc.) are considered as the objects of measurement, and the aim is to differentiate among them as reliably as possible. The principal requirement is to check that the instrument to be used, the test or questionnaire, can produce reliable measurements of the relative standing of the individuals on some given measurement scale, despite the inevitably disturbing influence on the measures of the random selection of the elements of the measurement instrument itself (the test or questionnaire items).
During the 1970s and 1980s, a potentially broader application of the model was identified by Jean Cardinet, Yvan Tourneur, and Linda Allal, who observed that the inherent symmetry in the ANOVA model that underpinned G theory was not being fully exploited at that time. They noted that in Cronbachās development of G theory the factor Persons was treated differently from all other factors, in that persons, typically students, were consistently the only objects of measurement. Recognizing and exploiting model symmetry (i.e., the fact that any factor in a factorial design has the potential to become an object of measurement) allows research procedures as well as individual measurement instruments to be evaluated. Thus, procedures for comparing subgroups (as in comparative effectiveness studies of various kinds) can also be evaluated for technical quality, and improved if necessary (Cardinet & Allal, 1983; Cardinet & Tourneur, 1985; Cardinet, Tourneur, & Allal, 1976, 1981, 1982). As these authors were expounding the principle of model symmetry, practitioners on both sides of the Atlantic were independently putting it into operation (e.g., Cohen & Johnson, 1982; Gillmore, Kane, & Naccarato, 1978; Johnson & Bell, 1985; Kane & Brennan, 1977).
Relative item difficulty, the mastery levels characterizing different degrees of competence, the measurement error associated with estimates of population attainment, the progress recorded between one stage and another within an educational program, the relative effectiveness of teaching methods, are all examples of G theory applications that focus on something other than the differentiation of individuals. To facilitate an extension to the theory, calculation algorithms had to be modified or even newly developed. Jean Cardinet and Yvan Tourneur (1985), whose book on G theory remains an essential reference in the French-speaking world, undertook this task. We explicitly place ourselves in the perspective adopted by these researchers.
An example to illustrate the methodology
The example
It will be useful at this point to introduce an example application to illustrate how G theory extends classical test theory, and in particular how the principle of symmetry enriches its scope. Let us suppose that a research study is planned to compare the levels of subject interest among students taught mathematics by one or the other of two different teaching methods, Method A and Method B. Five classes have been following Method A and five others Method B. A 10-item questionnaire is used to gather the research data. This presents students with statements of the following type about mathematics learning:
⢠Mathematics is a dry and boring subject
⢠During mathematics lessons I like doing the exercises given to us in class
and invites them to express their degree of agreement with each statement, using a 4-point Likert scale. Studentsā responses are coded numerically from 1 (strongly agree) to 4 (strongly disagree), and where necessary score scales are transposed, so that in every case low scores indicate low levels of mathematics interest and high scores indicate high levels of mathematics interest. There are then two possibilities for summarizing studentsā responses to the 10-item questionnaire: we can sum studentsā scores across the 10 items to produce total scores on a 10ā40 scale (10 items, each with a score between 1 and 4), or we can average studentsā scores over the 10 items to produce average scores on a 1ā4 scale, the original scale used for each individual item. If we adopt the second of these alternatives, then student scores higher than 2.5, the middle of the scale, indicate positive levels of interest, while scores below 2.5 indicate negative levels of interest; the closer the score is to 4, the higher the studentās general mathematics interest level, and the closer the score is to 1 the lower the studentās general mathematics interest level.
Which reliability for what type of measurement?
As we have already mentioned, the aim of the research study is to compare two mathematics teaching methods, in terms of studentsā subject interest. But before we attempt the comparison we would probably be interested in exploring how āfit for purposeā the questionnaire was in providing measures of the mathematics interest of individual students. Of all the numerous indicators of score reliability developed by different individuals prior to 1960, Cronbachās α coefficienti (Cronbach, 1951) remains the best known and most used (Hogan, Benjamin, & Brezinski, 2000). The α coefficient was conceived to indicate the ability of a test to differentiate among individuals on the basis of their responses to a set of test items, or of their behavior within a set of situations. It tells us the extent to which an individualās position within a score distribution remains stable across items. α coefficients take values between 0 and 1; the higher the value, the more reliable the scores. The α value in this case is 0.84. Since α values of at least 0.80 are conventionally considered to be acceptable, we could conclude that the questionnaire was of sufficient technical quality for placing students relative to one another on the scale of measurement. This is correct. But in terms of what we are trying to do hereāto obtain a reliable measure of average mathematics interest levels for each of the two teaching methodsādoes a measure of internal consistency, which is what the α coefficient is, really give us the information we need about score reliability (or score precision)?
We refer to the Joint Standards again:
⦠when an instrument is used to make group judgments, reliability data must bear directly on the interpretations specific to groups. Standard errors appropriate to individual scores are not appropriate measures of the precision of group averages. A more appropriate statistic is the standard error of the observed score means. Generalizability theory can provide more refined indices when the sources of measurement are numerous and complex. (AERA, 1999, p. 30)
In fact, the precision, or rather the imprecision, of the measure used in this example depends in great part on the degree of heterogeneity among the students following each teaching method: the more heterogeneity there is the greater is the contribution of the āstudents effectā to measurement error. This is in contrast with the classical test theory situation where the greater the variance among students the higher is the ātrue scoreā variance and consequently the higher is the α value. Within-method student variability is a source of measurement error that should not be ignored. Moreover, other factors should equally be taken into consideration: in particular, variability among the classes (within methods), variability among the items, in terms of their overall mean scores, as well as any interactions that might exist between teaching methods and items, between students (within classes) and items, and between classes and items.
How does G theory help us?
As we will show, G theory is exactly the right approach to use for this type of application. It is sufficient to consider the two teaching methods as the objects of measurement and the other elements that enter into the study (items, students, and classes) as components in the measurement procedure, āconditions of measurement,ā potentially contributing to measurement error. In place of the α coefficient we calculate an alternative reliability indicator, a generalizability coefficient (G coefficient). Like the α coefficient, G coefficients are variance ratios. They indicate the proportion of total score variance that can be attributed to ātrueā (or āuniverseā) score variance, which in this case is inter-method variation, and equivalently the proportion of variance that is attributable to measurement error. Also like α, G coefficients take values between 0 (completely unreliable measurement) and 1 (perfectly reliable measurement), with 0.80 conventionally accepted as a minimum value for scores to be considered acceptably reliable. The essential difference between measurement error as conceived in the α coefficient and measurement error as conceived in a more complex G coefficient is that in the former case measurement error is attributable to one single source of variance, the student-by-item interaction (inconsistent performances of individual students over the items in the test), whereas in the latter case multiple sources of error variance are acknowledged and accommodated.
A G coefficient of relative measurement indicates how well a measurement procedure has differentiated among objects of study, in effect how well the procedure has ranked objects on a measuring scale, where the objects concerned might be students, patients, teaching methods, training programs, or whatever. This is also what the α coefficient does, but in a narrower sense. A G coefficient of absolute measurement indicates how well a measurement procedure has located objects of study on a scale, irrespective of where fellow objects are placed. Typically, āabsoluteā coefficients have lower values than ārelativeā coefficients, because in absolute measurement there are more potential sources of error variance at play. In this example, with 15 students representing each of the five classes, the relative and absolute G coefficients are 0.78 and 0.70, respectively (see Chapter 3 for details). This indicates that, despite the high α value for individual student measurement, the comparative study was not capable of providing an acceptably precise measure of the difference in effectiveness of the two teaching methods in terms of studentsā mathematics interest.
In this type of situation, a plausible explanation for low reliability can sometimes be that the observed difference between the measured means is particularly small. This is not the case here, though: the means for the two teaching methods (A and B) were, respectively, 2.74 and 2.38 (a difference of 0.36) on the 1ā4 scale. The inadequate values of the G coefficients result, rather, from the extremely influential effect of measurement error, attributable to the random selection of small numbers of attitude items, students, and classes, along with a relatively high interaction effect between teaching methods and items (again, Chapter 3 provides the details).
Standard errors of measurement,ii for relative and for absolute measurement, can be calculated and used in the usual way to produce confidence intervalsiii (but note that adjustments are sometimes necessary, as explained in Chapters 2 and 3). In this example, the adjusted standard errors are equal to 0.10 and 0.11, respectively, when the mean results of the two teaching methods are compared. Thus a band of approximately two standard errors (more specifically 1.96 standard errors, under Normal distribution assumptions) around each mean would have a width of approximately ±0.20 for relative measurement and ±0.22 for absolute measurement. As a result, the confidence intervals around the method means would overlap, confirming that the measurement errors tend to blur the true method effects.
Optimizing measurement precision
Arguably the most important contribution of the G theory methodology, and the most useful for practitioners wanting to understand how well their measuring procedures work, is the way that it quantifies the relative contributions of different factors and their interactions to the error affecting measurement precision. G coefficients are calculated using exactly this information. But the same information can also be used to explore ways of improving measurement precision in a future application. In the example presented here, the principal sources of measurement error were found to be inter-item variation, inter-class (within method) variation, and inter-student (within class) variation. The interaction effect between methods and items also played an important role. Clearly, the quality of measurement would be improved if these major contributions to measurement error could be reduced in some way. A very general, but often efficient, strategy to achieve this is to use larger samples of component elements in a future application, th...