Introduction
In an era of high-stakes testing, maintaining the integrity of test scores has become an important issue and another aspect of validity. A search on the Web with words “test fraud” and “cheating” reveals increasing numbers of news stories in the local and national media outlets, potentially leading to less public confidence about the use of test scores for high-stakes decisions. To address the increasing concerns about the integrity of test scores, the scholarly community is beginning to develop a variety of best practices for preventing, detecting, and investigating testing irregularities. For instance, the National Council on Measurement in Education (2012) released a handbook on test and data integrity, the Council of Chief State School Officials published a test security guidebook (Olson & Fremer, 2013), and the Association of Test Publishers and the National College Testing Association have recently developed a best practices document related to test proctoring (ATP & NCTA, 2015). A symposium on test integrity was hosted with a number of experts from universities, testing companies, state educational agencies, law firms, and nonprofit organizations (U.S. Department of Education, IES, & NCES, 2013). Similarly, an annual scholarly conference on statistical detection of test fraud has been held since 2012 with a growing national and international attention. All these efforts provided an environment for people to discuss the best practices and policies to prevent, detect, and investigate testing irregularities and to ensure the integrity of test scores.
Unusual response similarity among test takers or aberrant response patterns are types of irregularities which may occur in testing data and be indicators of potential test fraud (e.g., examinees copy responses from other examinees, send text messages or prearranged signals among themselves for the correct response). Although a number of survey studies already support the fact that copying/sharing responses among students is very common at different levels of education (e.g., Bopp, Gleason, & Misicka, 2001; Brimble & Clarke, 2005; Hughes & McCabe, 2006; Jensen, Arnett, Feldman, & Cauffman, 2002; Lin & Wen, 2007; McCabe, 2001; Rakovski & Levy, 2007; Vandehey, Diekhoff, & LaBeff, 2007; Whitley, 1998), one striking statistic comes from a biannual survey administered by the Josephson Institute of Ethics in 2006, 2008, 2010, and 2012 to more than 20,000 middle and high school students. A particular question in these surveys was how many times students cheated on a test in the past year, and more than 50% of the students reported they had cheated at least once whereas about 30% to 35% of the students reported they had cheated two or more times on tests in all these years. The latest cheating scandals in schools and the research literature on the frequency of answer copying behavior at different levels of education reinforce the fact that comprehensive data forensics analysis is not a choice, but a necessity for state and local educational agencies.
Although data forensics analysis has recently been a hot topic in the field of educational measurement, scholars have developed interest in detecting potential frauds on tests as early as the 1920s (Bird, 1927, 1929), just after multiple-choice tests started being used in academic settings (Gregory, 2004). Since the 1920s, the literature on statistical methods to identify unusual response similarity or aberrant response patterns has expanded immensely and evolved from very simple ideas to more sophisticated modeling of item response data. The rest of the chapter will first provide a historical and technical overview of these methods proposed to detect unusual response similarity and aberrant response patterns, then describe a simulation study investigating the performance of some of these methods under both nominal and dichotomous response outcomes, and finally demonstrate the potential use of these methods in the real common datasets provided for the current book.
A Review of the Status Quo
As shown in Table 2.1, the literature on statistical methods of detecting answer copying/sharing can be examined in two main categories: response similarity indices and person-fit indices. Whereas the response similarity indices analyze the degree of agreement between two response vectors, person-fit indices examine whether or not a single response vector is aligned with a certain response model. Response similarity indices can be further classified based on two attributes: (a) the reference statistical distribution they rely on and (b) evidence of answer copying being used when computing the likelihood of agreement between two response vectors. The current section will briefly describe and give an overview for some of these indices.
Person-Fit Indices
The idea of using person-fit indices in detecting answer copying has been present for a quite long time (e.g., Levine & Rubin, 1979); however, it has not received as much attention as the response similarity indices in the literature with respect to detection of answer copying. The use and effectiveness of person-fit indices in detecting answer copying is a relatively underresearched area compared to the response similarity indices. This is likely because already existing studies had found person-fit indices under-powered specifically in detecting answer copying, a finding that may discourage from further research. One reason of underpowering is probably the fact that most copiers have aberrant response patterns, but not all examinees with aberrant response patterns are copiers. Aberrant response patterns may occur based on many different reasons, and therefore it is very difficult to trigger a fraud claim without demonstrating an
Table 2.1 Overview of Statistical Methods Proposed for Detecting Answer Copying
| Response Similarity Indices
| |
| Evidence of Answer Copying
| |
| Statistical Distribution | Number of Identical Incorrect Responses | Number of Identical Correct and Incorrect Responses | All items | Person-Fit Indices |
|
| Normal Distribution | | Wesolowsky (2000) | g2 (Frary, Tideman, & Watts, 1977) | IF (Sijtsma & Mejer, 1992) |
| | | D (Trabin & Weiss, 1983) |
| | | 10 (Wollack, 1997) | C (Sato, 1975) |
| Binomial Distribution | IC (Anikeef, 1954) | | | MCI (Harnisch & Linn, 1981) |
| K (Kling, 1979, cited in Saretsky, 1984) | | | U3 (van der Flier, 1980) |
| ESA (Bellezza & Bellezza, 1989) | | | k (Drasgow et al., 1985) |
| Ki and K2 (Sotaridona & Meijer, 2002) | | | *See Karabatsos (2003) |
| Poisson Distribution | Sj (Sotaridona & Meijer, 2003) | S2 (Sotaridona & Meijer, 2003) | | for more in this categor... |