1
Managing Ongoing Changes to the Test
Agile Strategies for Continuous Innovation
Cynthia G. Parshall and Robin A. Guille
Introduction
When an exam program is delivered via computer, a number of new measurement approaches are possible. These changes include a wide range of novel item types, such as the hot spot item or an item with an audio clip. Beyond these relatively modest innovations lie extensive possibilities, up to and including computerized simulations. The term innovative item type is frequently used as an overarching designation for any item format featuring these types of changes (Parshall, Spray, Kalohn, & Davey, 2002). The primary benefit that these new item types offer is the potential to improve measurement. When they are thoughtfully designed and developed, novel assessment formats can increase coverage of the test construct, and they can increase measurement of important cognitive processes (Parshall & Harmes, 2009). A further advantage provided by some innovations is the opportunity to expand the response space and collect a wider range of candidate behaviors (e.g., DiCerbo, 2004; Rupp et al., 2012).
However, the potential benefits offered by innovative item types are not guaranteed. Furthermore, to successfully add even a single new item type to an exam may require substantial effort. When an exam program elects to add several innovations, the costs, complexities, and risks may be even higher. Part of the challenge in adding innovative item types is that so much about them is new to testing organization staff and stakeholders. And while the standard approaches for processes and procedures serve an exam program well in the development of traditional item types, it fails to meet the needs that arise when designing new item types. A more flexible approach is needed in these cases, ideally one that provides for âexperimental innovationâ (Sims, 2011), in which solutions are built up over time, as learning occurs. Looking to the future, a likely additional challenge with test innovations is that the measurement field and all aspects of technology are going to continue to advance. Testing organizations may need to begin thinking of innovation and change as an ongoing, continuous element that needs to be addressed.
The research and development team at the American Board of Internal Medicine (ABIM) sought a strategic approach that would help them manage the task of continuous change in their exam programs. The methods presented in this chapter enable their goal for a strategic and sustainable process. The heart of the process is an Agile implementation philosophy (Beck et al., 2001) coupled with a semistructured rollout plan.
These approaches, individually and in combination, are presented in this chapter as useful strategies for managing ongoing assessment innovation. They are also illustrated through a case study based on one of ABIMâs recent innovations. It is hoped that these methods will also be useful to other organizations that anticipate the need to strategically manage continuous innovations.
Background on Innovative Item Types
The primary reason for including innovative item types on an assessment is to improve the quality of measurement (Parshall & Harmes, 2009). The ideal innovative item type would increase construct representation while avoiding construct irrelevant variance (Huff & Sireci, 2001; Sireci & Zenisky, 2006). Potential benefits of innovative item types include greater fidelity to professional practice (Lipner, 2013), the opportunity to increase measurement of higher-level cognitive skills (Wendt, Kenny, & Marks, 2007), the ability to measure broader aspects of the content domain (Strain-Seymour, Way, & Dolan, 2009), and the possibility of scoring examineesâ processes as part of the response as well as their products (Behrens & DiCerbo, 2012).
The term âinnovative item typesâ has been used most often to describe these alternative assessment methods, though in the field of educational testing, the term âtechnology-enhanced itemsâ has also become common (e.g., Zenisky & Sireci, 2013). Both phrases are broadly inclusive terms that have been used to encompass a very wide range of potential item types and other assessment structures. In general, any item format beyond the traditional, text-based, multiple-choice item type may be considered to be an innovative item type, though the most complex computerized assessment structures are more typically referred to as case-based simulations (Lipner et al., 2010). Item formats that are possible but rarely used in paper-based testing are often included in the category of innovative item types, because the computer platform may mean they are easier to deliver (e.g., an item with a full-color image or an audio clip) or to score (e.g., a short-answer item, a drag-and-drop matching item).
The range of innovative item types that could be created is so great that various compendia and taxonomies have been produced in an effort to help define the field. For example, Sireci and Zenisky (2006) present a large number of item formats, including extended multiple choice, multiple selection, specifying relationships, ordering information, select and classify, inserting text, corrections and substitutions, completion, graphical modeling, formulating hypotheses, computer-based essays, and problem-solving vignettes. Multiple categorization schemas for innovative item types have also been proposed (e.g., Scalise & Gifford, 2006; Strain-Seymour, Way, & Dolan, 2009; and Zenisky & Sireci, 2002). For example, in Parshall, Harmes, Davey, and Pashleyâs (2010) taxonomy, seven dimensions are used to classify innovative item types. These dimensions are assessment structure, response action, media inclusion, interactivity, complexity, fidelity, and scoring method.
The extensive lists of innovative item types provided in compendia and taxonomies typically include a fair number that have never been used operationally. In some cases, an item type was developed as part of the preliminary research a testing organization devoted to new item types. As such, even the incomplete development of an alternative item type might have been a valuable learning experience for the organization. In other cases, intractable problems (e.g., a scoring solution) were uncovered late in the development process, and the novel item type was forced to be abandoned.
For the first decade or more of operational computer-based tests (CBTs), if an exam program wanted to implement any nontraditional item types, custom software development was required. In fact, all the early CBTs required custom software development, even to deliver the traditional multiple-choice item type, since there were no existing CBT applications. Nevertheless, expanding beyond multiple-choice items required further effort, and most exam programs continued to deliver tests using that sole item type. Only a handful of exam programs pursued customized item type development (e.g., Bejar, 1991; Clauser, Margolis, Clyman, & Ross, 1997; OâNeill & Folk, 1996; Sands, Waters, & McBride, 1997). It was an expensive and time-consuming process, as extensive work was needed to support the underlying psychometrics, as well as the software development, and the effort did not always result in an item type that could be successfully used.
Over time, wide-scale changes in the CBT field occurred. These changes included the development of commercial CBT software, such as item banks and test-delivery applications. Testing organizations are now able to contract with a commercial firm for applications such as these rather than undertaking proprietary software development. In a related development, measurement-oriented interoperability specifications such as the Question and Test Interoperability standard (QTI; IMS, 2012) were established. The QTI specification represents test forms, sections, and items in a standardized XML syntax. This syntax can be used to exchange test content between software products that are otherwise unaware of each otherâs internal data structures. As a result of these technological developments, all testing organizations have become much less isolated. There is much greater integration and communication across software systems, as well as more standardization of the elements included in different software applications.
Under these newer, more integrated software conditions, the development of customized assessment innovations is relatively streamlined in comparison to the past. In some cases, the IT department at a testing organization may develop a plug-in for an item type feature that will then work within the larger set of CBT software for delivery and scoring. In other cases, a testing organization may work with a third-party vendor that specializes in CBT item/test software development to have a more elaborate innovation custom developed (e.g., Cadle, Parshall, & Baker, 2013). These technological changes have undoubtedly made the development of customized item types more achievable, though substantial challenges, including potentially high costs, remain.
One area of interest requiring customization is the development of multistep, integrated tasks or scenarios. Behrens and DiCerbo (2012) refer to this approach as the shift from an item paradigm to an activity paradigm. One goal often present when these task-based assessments are considered is the opportunity to focus on the examineeâs process as well as the end product (e.g., Carr, 2013; DiCerbo, 2004; Mislevy et al., this volume; Rupp et al., 2012). In some cases, though the task may be designed to be process oriented, the outcome is still product oriented (Zenisky & Sireci, 2013). The response formats in these cases often include traditional approaches such as the multiple-choice and essay item types (e.g., Steinhauer & Van Groos, 2013). Other response formats use more complex approaches (e.g., Cadle, Parshall, & Baker, 2013; Carr, 2013; Steinhauer & Van Groos, 2013). When researchers and developers are interested in the examineeâs process, they may also seek ways to score attributes of the examineeâs response set, either in addition to or instead of the responseâs correctness (Behrens & DiCerbo, 2012). Examples of assessments that can score attributes of the examineeâs response include interactive tasks in the National Assessment of Educational Progress (NAEP) Science Assessment; student responses to these tasks can be evaluated to determine if they were efficient and systematic (Carr, 2013). In addition, childrenâs task persistence has been investigated in a game-based assessment (DiCerbo, 2004), while usersâ effectiveness and efficiency in responding to computer application tasks has also been considered (Rupp et al., 2012).
As some of these examples suggest, a âdigital ocean of dataâ (DiCerbo, 2004) may be available for analysis. Potential data sources can include computer log files (Rupp et al., 2012), the userâs clickstream, resource use pattern, timing, and chat dialogue (Scalise, 2013). Determining which elements to attend to in these cases can be a challenging problem (Rupp et al., 2012). Luecht and Clauser (2002) describe this as the need to identify the âuniverse of important actions.â
Scoring these types of assessments is often a challenging problem to resolve. Use of a much larger examinee response space and evaluation of multiple attributes naturally suggests a need for new analysis methods (Behrens et al., 2012 ; Gorin & Mislevy, 2013; Olsen, Smith, & Goodwin, 2009; Way, 2013; Williamson, Xi, & Breyer, 2012). As new analysis methods are developed for novel types of assessments, investigations are also needed into the types of response analysis and feedback that item writers find most useful in their task of item review and revision (Becker & Soni, 2013).
At the same time as interest in this new wave of customized innovations has been growing, several modestly innovative item types have been incorporated into many popular CBT applications. Depending on the specific CBT applications, these built-in innovative item types can include the multiple-response (also referred to as the multiple-answerâmultiple-choice); items with graphics, audio, or video clips; the hot spot; the short-answer item type; and the drag-and-drop item. Several of these item types have a potential utility across a fairly large number of content areas, and have in fact been used on a considerable number of operational exams.
In some cases, the availability of these built-in, or off-the-shelf, item types within an application can mean that their inclusion on an exam is fairly easy. However, it is still not unusual for software support of these built-in item types to be incomplete across the full set of applications needed to deliver an exam (from item banking through test delivery and on to scoring and reporting). And because exam programs are so dependent on standardized measurement software and delivery vendors, whether the exam includes off-the-shelf or customized innovative item types, it is essential that all these elements interface seamlessly with each other.
The future of measurement is likely to include more novel item types and customized tasks. Testing organizations are increasingly likely to need strategies to help them manage the process of continuous innovation.
Strategies for Continuous Innovation
The recommended process for an exam program to follow when initially considering innovative item types is to begin with the test construct and to identify any current measurement limitations in the exam that innovations could help address (e.g., Parshall & Harmes, 2008, 2009; Strain-Seymour, Way, & Dolan, 2009). Through this analysis, a list of desirable new item types is often developed; this list might include both item types that are provided within a CBT vendorâs software and one or more that require custom development. At the same time, other exam innovations may also be on the table (e.g., some form of adaptive testing). In a short while, these possible improvements to the exam may be in competition with each other, and staff may be overwhelmed by the decisions needed and the work required.
In addition to potential software development challenges, every exam program has multiple stakeholders, and these stakeholder groups may have very different or even conflicting opinions regarding the value of a potential innovation. New materials for communicating with these stakeholder groups will be needed, just as new materials will be needed to support the work of the item writers and staff. Furthermore, new procedures for a host of test-development activities are often important in the development and delivery of an innovative item type.
At ABIM, this set of challenges led the research and development team to seek out a flexible yet consistent approach for the overall development of a broad set of potential innovations. The goal was to utilize the flexible and iterative nature of Agile software development methods, while at the same time including a standardized framework to ensure that the full assessment context would always be considered. ABIM anticipates that these methods, use of Agile principles and the innovation rollout plan, will be useful for many years into the future. These methods can support the current set of planned innovations and should also be robust enough to be helpful in years to come, even given the ongoing changes in medicine, technology, and measurement that will occur.
The strategies we propose for managing ongoing change are illustrated throughout this chapter via one specific innovation ABIM recently undertook. This case study involves the inclusion of patientâphysician interaction video clips within standard multiple-choice items.
Case StudyâIntroduction
ABIM certifies physicians in the specialty of internal medicine and, additionally, in the 18 subspecialty areas within internal medicine, such as cardiovascular disease and medical oncology. Its multiple-choice examinations largely measure medical knowledge, which is but one of six competencies assessed by the certification process. In order to best manage the research and development of innovations, ABIM formed a cross-departmental innovations team, with content, psychometric, and computing backgrounds.
Many of the innovations considered by this cross-departmental research team seek to improve the multiple-choice examinations by enhancing fidelity to practice, both in enhanced look and feel of case presentation and in improved alignment of the thinking required to a...