Statistical Approaches to Measurement Invariance
eBook - ePub

Statistical Approaches to Measurement Invariance

  1. 368 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Statistical Approaches to Measurement Invariance

About this book

This book reviews the statistical procedures used to detect measurement bias. Measurement bias is examined from a general latent variable perspective so as to accommodate different forms of testing in a variety of contexts including cognitive or clinical variables, attitudes, personality dimensions, or emotional states. Measurement models that underlie psychometric practice are described, including their strengths and limitations. Practical strategies and examples for dealing with bias detection are provided throughout.

The book begins with an introduction to the general topic, followed by a review of the measurement models used in psychometric theory. Emphasis is placed on latent variable models, with introductions to classical test theory, factor analysis, and item response theory, and the controversies associated with each, being provided. Measurement invariance and bias in the context of multiple populations is defined in chapter 3 followed by chapter 4 that describes the common factor model for continuous measures in multiple populations and its use in the investigation of factorial invariance. Identification problems in confirmatory factor analysis are examined along with estimation and fit evaluation and an example using WAIS-R data. The factor analysis model for discrete measures in multiple populations with an emphasis on the specification, identification, estimation, and fit evaluation issues is addressed in the next chapter. An MMPI item data example is provided. Chapter 6 reviews both dichotomous and polytomous item response scales emphasizing estimation methods and model fit evaluation. The use of models in item response theory in evaluating invariance across multiple populations is then described, including an example that uses data from a large-scale achievement test. Chapter 8 examines item bias evaluation methods that use observed scores to match individuals and provides an example that applies item response theory to data introduced earlier in the book. The book concludes with the implications of measurement bias for the use of tests in prediction in educational or employment settings.

A valuable supplement for advanced courses on psychometrics, testing, measurement, assessment, latent variable modeling, and/or quantitative methods taught in departments of psychology and education, researchers faced with considering bias in measurement will also value this book.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Statistical Approaches to Measurement Invariance by Roger E. Millsap in PDF and/or ePUB format, as well as other popular books in Psychology & Education General. We have over one million books available in our catalogue for you to explore.

Information

1
Introduction
This book is about measurement in psychology. It is addressed to researchers who want to study strategies for detecting bias in measurement, and who want to understand how these strategies fit in with the general theory of measurement in psychology. The methods to be described are statistical in nature, in line with the statistical nature of the models used in psychometrics to relate scores on tests to underlying psychological attributes. On the assumption that the reader may be unfamiliar with the details of these models, we will devote considerable space to describing them. The decision to include this material is motivated partly by the goal of making the book reasonably self-contained. An understanding of bias detection methods within item response theory (IRT), for example, is difficult to achieve by simply focusing on the detection methods themselves without first understanding the models and practices in IRT. A further motivation for including the material on measurement models is the conviction that bias detection methodology, while statistical in nature, is best understood through the psychometric perspective. In other words, while one can approach the topic as simply another application of statistics, in doing so one can easily forget what motivates the entire topic: inaccuracy in psychological measurement.
What is Measurement Invariance?
The idea of measurement invariance is best introduced by analogy with physical measurement, as physical analogies have long played a role in the development of measurement theories generally (Lord & Novick, 1968). Measurement invariance is built on the notion that a measuring device should function in the same way across varied conditions, so long as those varied conditions are irrelevant to the attribute being measured. A weight scale, for example, should register varied weights across objects whose weights actually differ. We would be concerned, however, if the scale produced different weight readings for two objects whose known weights are identical, but whose shapes are different. In this case, shape is a condition that is varying, but this shape condition is known to be irrelevant to actual weight for this pair of objects. After more study, if we consistently find that the scale gives different weight readings for objects with different shapes but identical known weights, we would conclude that something is wrong with our scale. The scale is producing biased weight readings as a function of shape. We would also say that the scale violates measurement invariance in relation to shape: across objects whose weights are identical but whose shapes differ, the scale produces weight readings that vary systematically with shape.
In psychological measurement, a test or questionnaire that is designed to measure a given attribute should reveal differences among individuals if those individuals actually differ on the attribute. A scale designed to measure depression should show differences among individuals whose depression levels vary, for example. On the other hand, we should not find that the test produces different results for people who are identical on the attribute, but who might differ on other, less relevant, variables. For example, among a group of males and females who are identical in their depression levels, we should not find that the test gives consistently different results for males and females. If we do find that the test functions in this way, we would conclude that the test violates measurement invariance in relation to gender and that the test shows measurement bias in relation to gender.
Chapter 3 will address the definition of measurement bias in depth, but it is useful to consider some general aspects of this topic now. As defined here, bias in measurement is equivalent to systematic inaccuracy in measurement. This inaccuracy is replicable, in contrast to the essentially random errors of measurement that determine the reliability of a test. A test that is reliable may, or may not, be biased. An unreliable test may still yield unbiased measurement in an average sense across repeated measurement. In contrast, a biased test may be highly reliable, yet may consistently provide inaccurate measurement even when the average of many repeated measurements is considered.
Accepting this preliminary point of view, how can we know whether a measuring device is biased? In physical measurement, one might use the device repeatedly on the same object and then compare these results to the results produced by another device that is known to be more accurate (though possibly more expensive and time-consuming to use). In the weight example, one could find a group of objects whose weights are effectively known because they have been measured using more sensitive and accurate devices than the device being studied. In psychological measurement, however, this strategy is often impractical. Repeated measurement of the same individual using the same test will not often be possible without either creating boredom and fatigue or inducing real change in the individual through learning and memory. Furthermore, “gold standard” measures that are known to be free of bias and to be highly reliable do not exist in most areas of psychological measurement. The detection of bias in psychological measurement is a difficult problem for these reasons.
This book will focus on a particular aspect of the bias problem, which is bias that is related to a person’s membership in a group. For example, a cognitive test item may be solved correctly at a higher rate among males than females. If this finding is replicated across large samples of males and females, we may conclude that the higher rate of correct answers for males is a feature of the population of males and females and is not a result of sampling error. If males score systematically higher on the item than females, is this fact evidence of bias in the item? One answer to this question is “yes,” because the item systematically produces higher scores for males than females. In other words, systematic score differences between groups are evidence of bias, according to this viewpoint.
The above viewpoint confuses two different concepts, however: systematic group differences in scores on the item, and systematic inaccuracy in scores on the item. In the example, the higher rate of correct answers for males constitutes a group difference in scores but does not necessarily constitute systematic inaccuracy. Inaccuracy exists if the score on the item does not reflect the examinee’s actual status on the attribute being measured. If two examinees truly differ on this attribute, an accurate test should produce different scores for the two examinees. By the same argument, if males and females differ on average on the attribute, an accurate test should yield different scores on average for males and females. Hence, the gender difference in average item scores has multiple interpretations. The gender difference could reflect a real difference on the attribute, or it could fail to reflect any real difference on the attribute. If you believe that no real gender differences are possible, you must then believe that the test item is biased: the item is systematically inaccurate in relation to gender.
In most areas of psychological measurement, we cannot know with certainty whether the groups being compared are actually different on a given psychological attribute. The only data available on such questions are the data at hand, or previous research with the same or similar measures. Once the question of bias is raised, trying to answer the question by simply documenting group differences in item scores is futile. For example, it may be true that (a) the groups do differ systematically on the attribute being measured, (b) the test item yields scores that systematically differ across groups, yet (c) the test is biased. This situation would arise if the test item produced score differences across groups that were systematically too large or too small. If the possibility of real group differences on psychological attributes is accepted, it becomes impossible to detect bias by simply looking at group differences in item scores. Something more is required.
How can we make any progress in deciding whether the test item is biased? The first step is to work with a definition of measurement bias that goes beyond group differences in item scores. The definition of bias that is accepted widely now relies on the matching principle (Angoff, 1993). This principle leads to the definition of an unbiased test or item. Consider any two individuals from different groups (e.g., one male and one female) who are identical on the attribute(s) being measured by the test item. We say the item is unbiased in relation to these groups if the probability of attaining any particular score on the item is the same for the two individuals (Lord, 1980; Mellenbergh, 1989). The matching idea enters into this definition in the requirement that the two individuals be identical or matched on the attribute(s) being measured. It is essential that we compare individuals from different groups who are matched, rather than randomly chosen pairs of individuals. A biased test item is then defined as one in which the probability of attaining any particular score differs for the two individuals in spite of the matching. In other words, the score on a biased item will depend not only on the attribute being measured but also on the group membership of the individual under consideration.
Several points should be noted about the definition of bias just described. First, the two individuals who have been matched on the attribute being measured may still achieve different scores on the test item in a particular occasion of measurement, even when the item is unbiased. Psychological measures are seldom perfectly reliable, and the two individuals may receive different scores due to this unreliability. Second, if we consider group differences in item scores in the general population without matching on the attributes, we may again encounter systematic group differences in item scores even though the item is unbiased. Group differences in item scores for the unbiased item reflect the actual magnitudes of differences on the attribute being measured. For this reason, the finding that a test produces systematic group differences in scores is ambiguous as long as no attempt is made to match individuals across groups.
At this point, anyone who has closely read the above definition of measurement bias may ask the next big question: If we must match individuals on the attribute being measured before trying to detect bias, how can this be done? After all, we do not actually know the individual’s status on the attribute. If we knew the individual’s status, we would not need the test item and we would know whether the item is biased.
This book is about the various answers to the matching question, and the statistical methods that have been built on these answers. In the process of explaining the answers, it will be necessary to review some portion of the theory of psychological measurement. Without this theory as background, the reader will have difficulty understanding the various strategies adopted in pursuit of bias detection. In this review, we will emphasize the unity underlying the various latent variable models used in psychometrics, but we will also not minimize the perplexing problems that have arisen in psychometric theory. Psychometrics as a separate discipline is around 100 years old as of this writing, but it is a work in progress. Basic questions still loom, such as the problem of rigorously establishing a unit of measurement for psychological scales (Michell, 1999), or how test validation should be understood (Borsboom, 2005). The problem of bias detection itself is another example. It is hoped that by the end of this book, the reader will have gained some knowledge about strategies for bias detection, but also that the reader will appreciate the difficulties involved in bias detection and in psychological measurement generally.
Is Measurement Bias an Important Problem?
It is fair to ask why we should be concerned about bias in psychological measurement. For many researchers involved in applied settings, concerns about bias in measurement have already been effectively settled: There is little or no bias in measurement in tests used to make important decisions about people (Hunter & Schmidt, 2000; Jensen, 1980; Neisser et al., 1996; Sackett, Borneman, & Connelly, 2008; Sackett, Schmitt, Ellington, & Kabin, 2001). This sense that the question of bias is settled rests on several lines of evidence. Methods for detecting measurement bias have been around for decades (for an early review, see Berk, 1982), and so we might expect the bias question to have been thoroughly investigated by now. Major educational testing companies routinely use statistical methods to screen for item bias in relation to ethnicity or gender (Holland & Wainer, 1993). Inexpensive and statistically well-grounded methods for detecting item bias are available (Dorans & Holland, 1993). Furthermore, it has been argued that whatever measurement bias may be present has little impact on the use of tests in prediction or selection (Neisser et al.; Sackett et al.). This conclusion is partly based on the many empirical studies that have examined group differences in regressions or correlations between tests and criteria and have not found major differences (Schmidt & Hunter, 1998).
Although it is true that methods for detecting measurement bias have been around for decades, the methods that were in use prior to the mid-1980s had flaws that were not recognized until relatively recently. Early factor analytic approaches relied on exploratory methods that were inefficient and only considered limited aspects of factor structure (see Millsap & Meredith, 2007, for a review). The shift to confirmatory factor analytic methods in the 1980s initially failed to consider mean structures as part of the model, now recognized as an essential part of any test of factorial invariance (Gregorich, 2006). Whereas methods for bias detection based on IRT have been known for some time, software for actually conducting these analyses, apart from those based on the Rasch model, has been slow to develop. For example, the MULTILOG program (Thissen, 1991) has been the only program for evaluating likelihood-ratio (LR) tests of item bias under a variety of IRT models, but this program requires many computer runs for a single test. The IRTLRDIF program (Thissen, 2001) makes this process much more efficient but is a recent development. An alternative approach for item bias analyses is to use confirmatory factor analytic software that will handle discrete measures, as found in item data. The extension of such software to handle discrete measures is another recent development, however, and its dissemination for general use has been hindered by a general lack of knowledge about how these models should be specified and identified (Millsap & Yun-Tein, 2004).
Apart from latent variable methods such as factor analysis or IRT, a variety of methods that condition on observed scores to achieve matching across groups have been in use for some time. Early chi-square and ANOVA-based methods are now known to be flawed (Camilli & Shepard, 1994). The Mantel–Haenszel (MH) method is a great improvement and was available by the end of the 1980s (Holland & Thayer, 1988). This method is used by major testing organizations to screen for item bias on a large scale. The MH method has known weaknesses however (see Chapter 8). Logistic regression methods can address some problems in the MH approach, but other problems remain. The development of observed-score methods for items with polytomous response formats has been slow. Finally, some hybrid methods that integrate latent variable and observed-score approaches are now available, but these are also relatively recent developments (Jiang & Stout, 1998; Raju, Van der Linden, & Fleer, 1995). We can conclude that although methods for bias detection have been available for decades, for much of this history, the methods in use have had significant weaknesses that are now known. The stronger methods have had limited use due to both the scarcity of efficient, comprehensive software and the general lack of awareness about the methods among researchers.
Although statistical methods for studying measurement bias are difficult and may have had flaws historically, regression and correlation methods for predicting various criteria from test scores are well understood and widely used. As noted above, the lack of group differences in regressions or correlations between tests and criteria has been cited as evidence that measurement bias, if it exists, does not have any practical impact. This argument was taken seriously 30 years ago (see Jensen, 1980), but it should not be taken seriously now. The truth is that a test may show identical regressions when it is used to predict a criterion measure across multiple populations, even though substantial measurement bias exists in the test (Borsboom, Romeijn, & Wicherts, 2008; Millsap, 1997, 1998, 2007). The distinction between measurement bias and predictive bias is obscured by claims that “test bias” can be detected through scrutiny of group differences in the regression of a criterion measure on the test. While “test bias” has at times been equated with group differences in these regressions (Cleary, 1968), the presence or absence of such differences is not diagnostic for measurement bias.
The issue of bias in measurement has traditionally been viewed as primarily one of fairness and equitable treatment. This outlook arose in the context of high-stakes educational and employment testing, where decisions based on biased test results could unfairly penalize or stigmatize an entire group of people. Without question, the ethical and moral dimensions of bias in measurement are important. The issue of measurement bias has another side that receives less attention: the scientific use of bias detection methods as tools for understanding psychological measurement. When a test item is found to function differently depending on the examinee’s group membership, questions are raised about what the item is actually measuring. One option in such cases is to get rid of the item, a choice that is often taken in applied settings. A different option is to try to understand why the bias exists. Ordinarily, item bias is unanticipated, and the post hoc search for an explanation of the bias can be difficult. A different strategy is to generate hypotheses about the sources of the bias and then test these hypotheses by strategically selecting items to manifest hypothesized biases. Bias detection methods are then used to evaluate whether the expected bias is found. This type of research has been attempted but is uncommon (Roussos & Stout, 1996a; Scheuneman, 1987). Another strategy would be to experimentally manipulate the attribute measured by the test in a way that could be detected using bias detection methods. The idea here is that a treatment might alter the relationship between scores on the test and the underlying attribute being measured. This effect would then be detected using bias detection methods, with the groups being randomized treatment and control groups. The literature on alpha–beta–gamma change in organizational research (Millsap & Hartog, 1988; Riordan, Richardson, Schaffer, & Vandenberg, 2001) is an example of this type of research. Although uncommon at present, this strategy seems to be a potentially interesting approach whenever a treatment is intended to induce deeper levels of change.
About this Book
Terminology and Notation
A...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright
  5. Dedication
  6. Contents
  7. Preface
  8. Acknowledgments
  9. 1. Introduction
  10. 2. Latent Variable Models
  11. 3. Measurement Bias
  12. 4. The Factor Model and Factorial Invariance
  13. 5. Factor Analysis in Discrete Data
  14. 6. Item Response Theory: Models, Estimation, Fit Evaluation
  15. 7. Item Response Theory: Tests of Invariance
  16. 8. Observed Variable Methods
  17. 9. Bias in Measurement and Prediction
  18. References
  19. Author Index
  20. Subject Index