Section III
Standard Setting Methods
10
Variations on a Theme
The Modified Angoff, Extended Angoff, and Yes/No Standard Setting Methods
BARBARA S. PLAKE AND GREGORY J. CIZEK
Perhaps the most familiar of the methods for setting performance standards bears the name of the person who first suggested the outlines of an innovative criterion-referenced approach to establishing cut scores: William Angoff. Scholars involved in standard setting research, facilitators of standard setting studies, and even many panelists themselves are likely to at least recognize the phrase Angoff method.
Although the frequency with which the Angoff method is used in education contexts has waned since the introduction of the Bookmark standard setting approach (see Lewis, Mitzel, Ricardo, & Schulz, Chapter 12 of this volume), it likely remains, overall, the most popular standard setting method in use today. In 1988, Mills and Melican reported that âthe Angoff method appears to be the most widely used. The method is not difficult to explain and data collection and analysis are simpler than for other methods in this categoryâ (p. 272). In a 1986 review, Berk concluded that âthe Angoff method appears to offer the best balance between technical adequacy and practicabilityâ (p. 147). More recent appraisals suggest that it is the most oft-used method in licensure and certification testing and it is still commonly used in educational testing contexts (Meara, Hambleton, & Sireci, 2001; Plake, 1998; Sireci & Biskin, 1992).
Perhaps one reason for its enduring popularity is that the procedure first suggested by Angoff (1971) has been successfully refined and adapted in numerous ways. The original Angoff method and variations such as the Modified Angoff method, Extended Angoff method, and Yes/No Method are the focus of this chapter.
Origins
It is perhaps an interesting historical note that the standard setting method that came to be known as the Angoff method was not the primary focus of the original work in which the method was first described. The method first appeared in a chapter Angoff (1971) wrote on scaling, norming, and equating for the measurement reference book, Educational Measurement (Thorndike, 1971) in which he detailed âthe devices that aid in giving test scores the kind of meaning they need in order to be useful as instruments of measurementâ (p. 508). Clearly, what Angoff had in mind by that were score scales, equated scores, transformed scores, and the like; his chapter made essentially no reference to setting performance standardsâa topic that today has itself been detailed in book length treatment such as this volume and others (see Cizek, 2001; Cizek & Bunch; 2007). In fact, the contents of the nearly 100-page chapter devote only two paragraphs to standard setting. The method is proposed by Angoff in one paragraph:
A systematic procedure for deciding on the minimum raw scores for passing and honors might be developed as follows: keeping the hypothetical âminimally acceptable personâ in mind, one could go through the test item by item and decide whether such a person could answer correctly each item under consideration. If a score of one is given for each item answered correctly by the hypothetical person and a score of zero is given for each item answered incorrectly by that person, the sum of the item scores will equal the raw score earned by the âminimally acceptable person. A similar procedure could be followed for the hypothetical âlowest honors person.â (1971, pp. 514â515)1
Three aspects of this description are noteworthy and, in retrospect, can be seen as influencing the practice of standard setting in profound ways for years to come. First, it is perhaps obvious, but should be noted explicitly that, Angoffâs description of a âminimally acceptable personâ was not a reference to the acceptability of an examinee as a person, but to the qualifications of the examinee with respect to the characteristic measured by the test and the level of that characteristic deemed acceptable for some purpose. In the years since Angoff described his method, the terms borderline, minimally competent examinee, and minimally qualified candidate have been substituted when the Angoff procedure is used. Those constructions notwithstanding, this fundamental idea put forth by Angoffâthe conceptualization of a minimally competent or borderline examineeâremains a key referent for the Angoff and similar standard setting methods. Indeed, in the conduct of an actual standard setting procedure, it is common that a considerable portion of the training time is devoted to helping participants refine and acquire this essential conceptualization.
A second noteworthy aspect is that the Angoff method was rooted in the notion that participants could be asked to make judgments about individual test items for purposes of determining a performance standard. The term test-centered model was used by Jaeger (1989) to describe the Angoff and other approaches that rely primarily on judgments about test content, as opposed to direct judgments about examinees (called examinee-centered models by Jaeger). With few exceptions, all modern criterion-referenced standard setting approaches are primarily test-centered.
The third noteworthy aspect of the Angoffâs original formulation is that it could be adapted to contexts in which more than one cut score was needed. That is, it could be applied to situations in which only dichotomous (i.e., pass/fail) classifications were needed, but it could also be applied to situations in which more than two categories were required. This can be seen in the context of Angoffâs original description, where two cut scores were derived to create three categories: Failing, Acceptable/Passing, and Honors. Further, although the method was originally conceived to be applied to tests in which the multiple-choice question (MCQ) format was used exclusively, the method has also been successfully applied to tests comprised of constructed-response (CR) items, and to tests with a mixture of both MCQ and CR formats.
Other features that have become fairly commonplace in modern standard setting were included in the second of the two paragraphs in which Angoffâs method was described. For one, Angoffâs proposal permitted the calculation of criterion-referenced cut scores by summarizing the independent judgments of a group of standard setting panelists prior to the administration of a test. Additionally, he proposed a potential, albeit rudimentary, strategy for validation of the resulting cut scores:
With a number of judges independently making these judgments it would be possible to decide by consensus on the nature of the scaled score conversions without actually administering the test. If desired, the results of this consensus could later be compared with the number and percentage of examinees who actually earned passing and honors grades. (1971, p. 515)2
As described by Angoff, the task presented to participants is to make dichotomous judgments regarding whether the minimally competent examinee could answer each item correctly (thereby assigning a value of 1 to each such item) or not (resulting in a value of zero being assigned to those item). This would most appropriately be called the Basic or Unmodified Angoff method, and is the foundation for what has subsequently been developed into the Yes/No Method (Impara & Plake, 1997) and which is described in greater detail later in this chapter. In practice, however, the use of the phrase original or unmodified Angoff method refers to an alternative to the basic approach that Angoff described in a footnote to one of the two paragraphs. The alternative involved asking participants to make a finer judgment than simply assign zeros and ones to each item in a test form. According to Angoffâs footnote:
A slight variation of this procedure is to ask each judge to state the probability that the âminimally acceptable personâ would answer each item correctly. In effect, judges would think of a number of minimally acceptable persons, instead of only one such person, and would estimate the proportion of minimally acceptable persons who would answer each item correctly. The sum of these probabilities would then represent the minimally acceptable score. (1971, p. 515, emphasis added)
That refinementâasking participants to provide probability judgments with respect to the borderline examineesâ chances of answering items correctlyâhas become at the same time highly influential to the extent it incorporated and highlighted the probabilistic nature of standard setting judgments and, as will be described, it has also been the source of modest controversy.
At present, the most popular manifestation of the Angoff method is likely what has come to be called the traditional or modified Angoff approaches. In actuality, there are numerous ways in which the basic Angoff method has been modified. By far the most common modification involves requiring participants to make more than one set of judgments about each itemâwith those multiple judgments occurring in âroundsâ between which the participants are provided with one or more pieces of additional information to aid them in making more accurate, consistent estimations of borderline examinee performance.
The balance of this chapter describes each of the contemporary adaptations of the original Angoff approach, including a traditional Angoff method applied to MCQs, the Yes/No variation with MCQs, and variations of the Angoff method for use with polytomously-scored (i.e., CR) items or tasks. The chapter concludes with common limitations of the Angoff method and recommendations for the future.
Traditional Angoff Method with MCQs
The purpose of this section is to describe a traditional Angoff (1971) standard setting procedure with items of a multiple-choice format, involving a panel of subject matter experts (SME) as judges, and using multiple rounds of ratings by the panelists with some information (i.e., feedback) provided to the panelists between rounds. This is also often called a Modified Angoff standard setting method because having multiple rounds with feedback in between is a modification of the original Angoff method that only involved a single round of ratings and no provision of feedback. In this section, several elements common to a Modified Angoff process will be presented, including: information about the composition of the panel; generating probability estimates; the role of Performance Level Descriptors (PLDs); the steps in undertaking the method; the rounds of ratings; the types of feedback provided to the panelists between rounds, and the methods typically used to compute the cut score(s).
Composition of the Panel
Like panelists in any standard setting study, the composition of a panel using the Angoff method varies based on the purpose of the test and uses of the results. In some cases, such as in licensure and certification testing programs, the panel is exclusively composed of subject matter experts (SMEs). In other instances, a mix of SMEs and other stakeholders are included in the panel, such as the case with the National Assessment of Educational Progress (NAEP) standard setting studies (see Loomis & Bourque, 2001; see Loomis, Chapter 6 of this volume, for more information about issues to consider when deciding on the composition of the panel for a standard setting study). Because it is the panelists who provide the data (ratings) for making cut score recommendations using the Angoff standard setting method (and most other judgmental standard setting methods), the representativeness of the panel is a crucial element bearing on the validity of the cut scores that are generated from the standard setting study.
Generating Probability Estimates
The Modified Angoff method involves having panelists make item-level estimates of how certain target examinees will perform on multiple-choice questions. In particular, panelists are instructed to estimate, for each item in the test, the probability that a randomly selected, hypothetical, minimally competent candidate (MCC) will answer the item correctly. Because these estimates are probability values they can range from a low of 0 to a high of 1.
These probability judgments can be difficult for participants to make, however. To aid in completing these estimates, panelists in an Angoff standard settings study are often instructed to conceptualize 100 MCCs and then estimate the proportion (or number) of them that would get the item right. In essence, the probability estimation is shifted to be an estimate of the proportion (or p-value) that would result from administering the item to a sample of 100 MCCs. Notice that this estimation is of the probability or the proportion of MCC that would answer the item correctly, not the proportion that should answer the item correctly. The focus on would instead of should takes into account many factors that might influence how such candidates perform on the test questions, including their ability and the difficulty of the item, but also other factors such as anxiety over test performance in a high-stakes environment, administrative conditions, and simple errors in responding.
Because panelists are asked to make estimates of item performance of a specific subgroup of the examinee population, it is critical that the panelists have a conceptual understanding of the knowledge, skills, and abilities (KSAs) of the MCCs. Often the SMEs who form the panel have first-hand knowledge of members of this subgroup of the examinee population, as when panel members have direct interactions with the examinee population through their educational or work experience. For setting cut scores on tests in Kâ12 educational contexts, the panel is typically composed of grade level and subject matter teachers or educational leaders; in licensure programs the panel is often comprised of SMEs who teach or supervise entry level professionals in their field. In some instances, policy and business leaders or representatives of the public are also members of the panel; special attention is needed in these instances to ensure that these panelists have a...