Handbook of Item Response Theory Modeling
eBook - ePub

Handbook of Item Response Theory Modeling

Applications to Typical Performance Assessment

Steven P. Reise, Dennis A. Revicki, Steven P. Reise, Dennis A. Revicki

Share book
  1. 466 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Handbook of Item Response Theory Modeling

Applications to Typical Performance Assessment

Steven P. Reise, Dennis A. Revicki, Steven P. Reise, Dennis A. Revicki

Book details
Book preview
Table of contents
Citations

About This Book

Item response theory (IRT) has moved beyond the confines of educational measurement into assessment domains such as personality, psychopathology, and patient-reported outcomes. Classic and emerging IRT methods and applications that are revolutionizing psychological measurement, particularly for health assessments used to demonstrate treatment effectiveness, are reviewed in this new volume. World renowned contributors present the latest research and methodologies about these models along with their applications and related challenges. Examples using real data, some from NIH-PROMIS, show how to apply these models in actual research situations. Chapters review fundamental issues of IRT, modern estimation methods, testing assumptions, evaluating fit, item banking, scoring in multidimensional models, and advanced IRT methods. New multidimensional models are provided along with suggestions for deciding among the family of IRT models available. Each chapter provides an introduction, describes state-of-the art research methods, demonstrates an application, and provides a summary. The book addresses the most critical IRT conceptual and statistical issues confronting researchers and advanced students in psychology, education, and medicine today. Although the chapters highlight health outcomes data the issues addressed are relevant to any content domain.

The book addresses:

IRT models applied to non-educational data especially patient reported outcomes

Differences between cognitive and non-cognitive constructs and the challenges these bring to modeling.

The application of multidimensional IRT models designed to capture typical performance data.

Cutting-edge methods for deriving a single latent dimension from multidimensional data

A new model designed for the measurement of constructs that are defined on one end of a continuum such as substance abuse

Scoring individuals under different multidimensional IRT models and item banking for patient-reported health outcomes

How to evaluate measurement invariance, diagnose problems with response categories, and assess growth and change.

Part 1 reviews fundamental topics such as assumption testing, parameter estimation, and the assessment of model and person fit. New, emerging, and classic IRT models including modeling multidimensional data and the use of new IRT models in typical performance measurement contexts are examined in Part 2. Part 3 reviews the major applications of IRT models such as scoring, item banking for patient-reported health outcomes, evaluating measurement invariance, linking scales to a common metric, and measuring growth and change. The book concludes with a look at future IRT applications in health outcomes measurement. The book summarizes the latest advances and critiques foundational topics such a multidimensionality, assessment of fit, handling non-normality, as well as applied topics such as differential item functioning and multidimensional linking.

Intended for researchers, advanced students, and practitioners in psychology, education, and medicine interested in applying IRT methods, this book also serves as a text in advanced graduate courses on IRT or measurement. Familiarity with factor analysis, latent variables, IRT, and basic measurement theory is assumed.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on ā€œCancel Subscriptionā€ - itā€™s as simple as that. After you cancel, your membership will stay active for the remainder of the time youā€™ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlegoā€™s features. The only differences are the price and subscription period: With the annual plan youā€™ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weā€™ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Handbook of Item Response Theory Modeling an online PDF/ePUB?
Yes, you can access Handbook of Item Response Theory Modeling by Steven P. Reise, Dennis A. Revicki, Steven P. Reise, Dennis A. Revicki in PDF and/or ePUB format, as well as other popular books in Psychology & Research & Methodology in Psychology. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Routledge
Year
2014
ISBN
9781317565697
Edition
1
Part I
Fundamental Issues in Item Response Theory

1
Introduction

Age-Old Problems and Modern Solutions
Steven P. Reise and Dennis A. Revicki
The statistical foundation of item response theory (IRT) is often traced back to the seminal work of Lord, Novick, and Birnbaum (1968). The subsequent development, research, and application of IRT models and related methods link directly to the need of large-scale testing companies, such as the Educational Testing Service, to solve statistical as well as practical problems in educational assessment (i.e., the measurement of aptitude, achievement, and ability constructs). Daunting problems in this include the challenge of administering different test items to demographically diverse individuals across multiple years, while maintaining scores that are comparable on the same scale. This test score comparability problem traditionally has been addressed with ā€œtest-score equatingā€ methods, but now more routinely, IRT-based ā€œlinkingā€ strategies are used (see Chapter 19).
The application of IRT models and methods in educational assessment is now commonplace (e.g., see most any recent issue of the Journal of Educational Measurement), especially for large-scale testing firms that employ on their research staff dozens of world-class psychometricians, content experts, and item writers. The application of IRT models, and related statistical methods in the fields of personality, psychopathology, patient-reported outcomes (PRO), and health-related quality-of-life (HRQOL) measurement, in contrast, has only recently begun to proliferate in research journals. In these noneducational or ā€œtypical performanceā€ domains, the application of IRT has gained popularity for much the same reasons as in large-scale educational assessment; that is, to solve practical and technical problems in measurement.
The National Institutes of Health (NIH) Patient Reported Outcome Measurement Information System (PROMISĀ®), for example, has developed multiple item banks for measuring various physical, mental, and social health domains (Cella et al., 2007; Cella et al., 2010). Similarly, the Quality of Life in Neurological Disorders (www.neuroqol.org) and NIH Toolbox (www.nihtoolbox.org) have also employed IRT methods of scale development and item analysis. One of the chief motivations underlying the application of IRT methods in these projects was to solve a long-standing and well-recognized problem in health outcomes research; namely, that for any important construct, there are typically half a dozen or so competing measures of unknown quality and questionable validity. This chaotic measurement situation, with dozens of researchers studying the same phenomena using different measurement tools, fails to promote good research and inhibits the cumulative aggregation of research results.
Large-scale IRT application projects, such as PROMISĀ®, have raised awareness not only of the technical and practical challenges of applying IRT models to psychological or PRO data, in general, but also has uncovered the many and varied special problems and concerns that arise in applying IRT outside of educational assessment (see also Reise & Waller, 2009). We will highlight several of these critical challenges later in this chapter to set a context for the present volume. Before doing so, however, we note that thus far, standard IRT models and methods have been imported into noneducational measurement contexts, and essentially without modification. In other words, there has been little in the way of ā€œnew modelsā€ or ā€œnew statistical methodsā€ uniquely appropriate for PRO or any other type of noneducational data (but see Chapter 13).
This equalitarianā€”the same IRT models and methods should be used for all constructs, educational or PROā€”was perhaps critical in early stages of IRT exploration and application in new domains. Inevitably, we believe, further progress will require new IRT-based psychometric approaches particularly tailored to meet measurement challenges in noneducational assessment. We will expand on this in the final chapter. For now, prior to previewing the chapters in this edited volume, in the following section, we briefly discuss some critical differences between educational and noneducational constructs, data, and assessment contexts, as these relate to the application of IRT models. We argue that although there are fundamental technical issues in applying IRT to any domain (e.g., dimensionality issues, assessing model to data fit), unique challenges arise when applying IRT to noneducational data due to the nature of the constructs (e.g., limited conceptual breadth, questionable applicability across the entire population), and item response data (e.g., non-normal latent trait distribution issues).

Educational Versus Noneducational Measurement

It is well recognized that psychological constructs, both cognitive and noncognitive, can be conceptualized as being hierarchically arranged, from very general to middle level, conceptually narrow to specific behaviors (Clark & Watson, 1995).1 Since Loevinger (1957), it has also been well recognized (although not necessarily realized in practice by scale developers) that the position of a construct in this hierarchy has profound implications for all aspects of scale development, psychometric analyses, and ultimately validation of test score inferences.
Almost by definition, measures of broad bandwidth constructs (intelligence, verbal ability, negative affectivity, general distress, overall life satisfaction, or QOL) must have heterogeneous item content to capture the diversity of trait manifestations.2 In turn, item intercorrelations, item-test correlations, and factor-loadings/IRT slopes are expected to be modest in magnitude, with low communality. Moreover, resulting factor structures may (must?) be multidimensional to some degree, perhaps with a strong general factor and several so-called group or specific factors corresponding to more content-homogeneous domains (see Chapter 2).
On the other hand, just the opposite psychometric properties would be expected for measures of conceptually narrow constructs (mathematics self-efficacy, primary narcissism, fatigue, pain interference, germ phobia). That is, in this latter context, the content diversity of trait manifestation is very limited (by definition of the construct), and as a consequence, item content is homogeneous with the conceptual distance between the item content and the latent trait being slim. In turn, this can result in very high item intercorrelations, item-test correlations, and factor-loadings/IRT slopes. In factor analyses, essential unidimensionality would be the expectation, as would high item communalities. Finally, in contrast to broadband measures, where local independence violations are typically caused by clusters of content-similar items, in narrowband measures, local independence violations are typically caused by having the same item content repeated over and over with slight variation (e.g., ā€œI have problems concentrating,ā€ ā€œI find it hard to concentrate,ā€ ā€œI lose my concentration while driving,ā€ ā€œIt is sometimes hard for me to concentrate at workā€).
In our judgment, applications of IRT in educational measurement have tended toward the more broadband constructs, such as verbal and quantitative aptitude, or comprehensive licensure testing contexts (which also involve competencies across a heterogeneous skill domain). In contrast, we argue that with few exceptions, applications of IRT in noneducational measurement have primarily been with constructs that are relatively conceptually narrow. As a consequence, IRT applications in noneducational measurement contexts present some unique challenges, and the results of such applications can be markedly different from a typical IRT application in education.
For illustration, Embretson and Reise (in preparation) report on an analysis of the PROMISĀ® anger item set (see Pilkonis et al., 2010), a set of 29 items rated on a 1 to 5 response scale. Anger is arguably conceptually narrow because there simply are not that many ways of being angry (especially when rated within the past seven days); that is, the potential pool of item content is very limited, unlike a construct, say, such as spelling or reading comprehension where the pool of items is virtually inexhaustible. Accordingly, alpha was 0.96, and an eigenvalue ratio of around 15 to 1, suggesting unidimensionality, or at least a strong common factor. Fitting a unidimensional confirmatory factor analysis resulted in an ā€œacceptableā€ fit by conventional standards. However, univariate and multivariate Lagrange tests indicated 407 and 157 correlated residuals needed to be estimated (set free), respectively. This unambiguous evidence against the data meeting the unidimensionality/local independence assumption was not due to the anger data being in any real sense of the term ā€œmultidimensional,ā€ with substantively interpretable distinct factors, but rather as having many sizeable correlated residuals (violations...

Table of contents