1
Introduction
The Test Security Threat
James A. Wollack and John J. Fremer
Introduction
As a result of the publicās seemingly insatiable hunger for scandal, it is difficult to pick up a national newspaper or watch a national news program in which we are not learning about athletes taking performance enhancing drugs, teams spying on other teamsā practices or trying to steal their signals, individuals falsifying tax documents to avoid paying taxes, stock brokers engaged in insider trading, investment firms conducting Ponzi schemes, seemingly happily married folk cheating on their partners, or, as is the focus in the Handbook, examinees, educators, or entrepreneurs cheating or helping others to cheat on standardized examinations. In light of the popular mediaās penchant for sensationalizing stories, it is easy to become desensitized to test fraud. However, cheating on tests is very real, and the impact it is having on test score interpretations, the publicās confidence in the testing industry, and our economy ought not to be underestimated. In this chapter, we discuss the magnitude of the security problem in all areas of testing, and attempt to set the stage for the chapters that follow.
How Prevalent is Cheating?
As long as there have been tests for which important, high-stakes decisions are made, there have been people endeavoring to find a means for artificially inflating their scores. Indeed, cheating on tests was detected in the Keju Chinese civil service exams which began in AD 606. These exams were used to find the ābestā individuals to serve in the administration of the country, were extremely challenging, and very few individuals passed. At the height of the Chinese civil service testing program at the end of the 19th century, it is estimated that, at most, one in 250,000 people sitting for the exams would ever achieve the marks necessary to become eligible for an official government appointment (Suen & Yu, 2006). Because the tests were so selective and so critical to their mission, the Chinese government went to great lengths to protect the integrity of the exams, including instituting a number of preventive measures, such as restricting clothing and resources allowed into the testing rooms, subjecting examinees to body searches, and sequestering examinees in heavily guarded, prison-like exam compounds. In addition, the government established severe sanctions for any individual caught cheating, including stripping examinees of all previously earned credentials, caning, or even execution (Suen & Yu, 2006; Taylor & Taylor, 1995). However, the rewards bestowed upon those who passed (and their families) were tremendous: power, fortune, fame. As a result, in spite of the governmentās best efforts, cheating was still rampant.
Many of the methods used to cheat were the same methods used today: bringing crib notes and cheat sheets, writing notes on clothing, body parts, or other āmaterials,ā and collaborating with co-conspirators (either inside or outside the testing room). Perhaps the most frequent form of cheating was impersonation, or as it has come to be known, proxy testing. Because authentication of examinee identity consisted of verbal descriptions of candidates, an estimated 30ā40 percent of candidates sitting for the first phase of exams were believed to be paid body-doubles (Suen & Yu, 2006).
Today, though impersonation is by no means extinct, advances with respect to video surveillance equipment and biometric technologies, such as fingerprinting, retinal scans, and keystroke analytics, provide the tools to successfully combat proxy test taking, even if such measures are utilized primarily in large international testing programs. Unfortunately, the same cannot be said for other types of cheating. In fact, it is almost certainly the case that cheating on tests is more prevalent now than ever before.
Making matters more challenging is the fact that cheating approaches continue to evolve and expand. Cohen & Wollack (2006) emphasized that an unintended consequence of improving technology is that there are many new and more sophisticated ways to cheat on tests. Pagers, cell phones, personal digital assistants, voice recorders, iPads, MP3 players, laptop and tablet computers, advanced calculators, two-way radios, and tiny wireless microphones packaged with earpieces and transmitters all make routine communicating with individuals outside (and inside) the testing room. Many of these devices may also be used to access the Internet and provide access to a huge supply of information, which can include extremely elaborate notes and other handy resources that an examinee may have posted to his/her website prior to the exam. Video cameras disguised as jewelry, pens, or shirt buttons can be used to reproduce exact copies of a test, which can then be transmitted instantly to locations all around the world. Written assessments, particularly in the case of high school or college students who are asked to complete their papers outside of class, are now easier than ever to plagiarize, because the Internet exposes students to an unlimited number of untraceable references, not to mention countless sites where completed papers of any length, format, and quality may be ordered up, in much the same way that one goes on-line to order a computer that is customized to meet certain specifications.
Furthermore, because of the perceived administrative and psychometric advantages, an increasing number of tests are now being delivered on computer, be it computer-based linear testing (CBT), computer adaptive testing (CAT), or web-based Internet testing. In the third edition of Educational Measurement, it was argued that computer-based delivery systems, which were then in their infancy, offered improved test security relative to traditional paper-and-pencil tests (Bunderson, Inouye, & Olsen, 1989). After all, they said, computer tests result in no paper copies of exams or keys, and may be stored electronically in encrypted or password-protected files which grant access to only authorized parties, and can automatically shuffle items and corresponding keys to make it more difficult for a student to follow the screen of a neighboring examinee. In the case of CAT, they added, because each examinee receives a different equated test, copying is near impossible and ā[i]t would be difficult to steal and memorize each of the hundred or so items for each of several such testsā (p. 386). As it turns out, in the quarter century since the third edition of Educational Measurement was published, we have learned the hard way that cheating on computer-based tests is not as ādifficultā as was originally believed. In fact, although many of the specifics of what Bunderson et al. (1989) noted about the security of CBT and CAT were true, the validity issues surrounding item and test compromise have proven so monumental that it would be difficult in light of the experience of computer-based programs to conclude that computerized tests are more secure than paper-and-pencil exams.
So where, exactly, does that leave us? High-stakes tests are not just for evaluating students in the classroom any more. Testing is a vitally important part of our culture. Tests are routinely used to evaluate people for graduation, admission to universities or graduate/professional programs, scholarships, employment, promotion, and licensure/certification. They are used to evaluate how well individuals perform their jobs, such as evaluating teachers, administrators, schools, districts, and states as part of the No Child Left Behind Act (2001) accountability criteria. They are used to award college credit or to exempt students from graduation requirements. And they are used to diagnose educational and psychological disabilities or relative strengths and weaknesses so that education may be tailored to suit individualsā needs. The world-wide emphasis on testing, and particularly in the United States, has increased at such a staggering pace that psychometrics is regarded as one of the hottest growing fields in America (Herszenhorn, 2006).
But all this testing can be undermined if test users and developers cannot vouch for the validity of the scores for their designated purposes. In recent years, many major testing programs have had to deal with extensive organized cheating scandals that have caused the validity of the scores for large sets of examinees to be questioned.
- In 2002, Educational Testing Service (ETS) discovered Chinese- and Koreanlanguage braindump websites, in which students posted questions and answers they had memorized from the computerized version of the Graduate Record Examination (GRE), causing average test scores to increase by as much as 100 points (Steinberg, 2002).
- In 2008, Advanced Placement (AP) scores for nearly 400 students in Orange County, CA, were voided because the administrative oversight and proctoring of the exams was particularly poor. Although not all examinees were known to have benefitted from the lax proctoring, all students were required to retake their exams a couple of weeks later. According to reports, students were allowed to use cell phones during the original administration of the exam to send text messages, were seated too close together, and were tested in configurations that were inappropriate (such as facing one another). It was also reported that the number of proctors was inadequate for the number of students testing, and that many proctors were inattentive, including a few who allegedly were reading, fell asleep, or left the room (Mehta, 2008).
- In 2010, the U.S. Justice Department found that 22 Federal Bureau of Investigation (FBI) agents cheated on an exam assessing knowledge of counterterrorism procedures. Although examinees were allowed to use their book and notes during the exam, examinees were also found to have consulted with supervisors and a legal advisor (Stein, 2010).
- In 2011, 15 high school students in New York were arrested (although up to 50 were believed to be involved) for hiring proxy examinees, for between $500 and $3,600 each, to take the SAT and ACT assessments for them (Anderson, 2012). This scandal prompted the New York State legislature to hold a hearing on ways these testing programs could improve their security (Phillips, 2012), and ultimately led to ACT and SAT requiring students to provide photographs of themselves as part of the registration process.
- In 2010, the American Board of Internal Medicine (ABIM) accused 140 doctors of having acquired or assisted in the acquisition of preknowledge of live questions for the Boardās certification exams (Hobson, 2010). Examinees were accused of having either shared test content following an exam, or actively seeking out such content (including some being paid to acquire actual test questions). At both the point of exam registration and again immediately before taking the exam, ABIM requires exam candidates to agree to adhere to a Board policy strictly forbidding the sharing of test content. A year and a half later, in early 2012, a CNN investigation revealed a widespread practice within the radiology community of residents preparing for their ABIM Board exams using ārecallsā or large banks of memorized test questions (Zamost, Griffin, & Ansari, 2012). According to the report, these recall banks were maintained and provided by the residency programs, which encouraged its soon-to-test residents to memorize items to contribute to the banks. Approximately half the items on the radiology exam had appeared on previous forms. Within some programs, CNN found over 15 yearsā worth of questions and answers, neatly prepared by the training program as PowerPoint presentations.
Similar scandals have been seen with increasing frequency on State Accountability Testing programs. In 2003, two-thirds of the elementary school teachers in a troubled Dallas school district were found to have improperly helped their students on the Texas Assessment of Knowledge and Skills (TAKS), resulting in nearly perfect scores for many students (Benton, 2006). Two years later, nearly 20,000 TAKS booklets went missing following the test (Benton, 2005). This snafu prompted an investigation which conservatively estimated that there were statistical anomalies consistent with cheating in 8.6 percent of Texasā schools (Benton, 2006). The cheating on the TAKS was not the first reported incident of educators cheating on behalf of their students, but it was the first highly publicized case in the No Child Left Behind era, and was clearly a sign of things to come.
In 2011 and 2012, there were media reports of educator cheating in many large cities across the United States. In March 2011, six charter schools in Los Angeles were closed after teachers and principals opened sealed copies of the stateās exam so that they could prepare students with the actual test questions (Blume, 2011). Later that month, USA Today published a story revealing staggeringly high numbers of erasures, as well as math and reading gain scores that seemed too good to be true, throughout the Washington, DC, school system from 2006 to 2010 (Gillum & Bello, 2011). Then in early July 2011, a team of special investigators appointed by the Governor of Georgia to probe allegations of test misconduct throughout the Atlanta Public School System published its findings. In their report, the investigators identified 178 educators in at least 44 schools who engaged in cheating in 2008ā9 (Vogell, 2011).1 It is unquestionably the most thoroughly investigated and quite possibly the most widespread instance of school-based cheating uncovered to date. Immediately on the heels of Atlanta, 89 schools in Pennsylvania, including 28 in Philadelphia, were identified as suspicious in July 2011, also based on patterns of erasures and gain scores (Winerip, 2011). Using open records laws to obtain state test data for all states, in March 2012, the Atlanta Journal Constitution (AJC) published the results of an analysis looking into anomalous score gains (and drops) across 69,000 public schools across the country (Vogell, Perry, Judd, & Pell, 2012). The AJC report concluded that approximately 200 school districts had test score patterns that very much resembled those found in Atlanta, and suggested that test scores for some tens of thousands of students in 2010 alone may have been invalid. A follow-up analysis published in the AJC a month later revealed what was described as an unusual tendency for schools that had received the prestigious Blue Ribbon Award ā given annually to the schools that achieve at the highest levels or that demonstrate the largest growth despite serving largely disadvantaged students ā to be dramatically over-represented on the list of anomalous schools (Judd, Vogell, & Perry, 2012b). In many cases, the unusually large gains were followed by equally unusual score drops in the year immediately following the bestowing of the Blue Ribbon Award (Judd, Vogell, & Perry, 2012a).
And such problems are not limited to testing programs in the United States. Cheating is a huge problem on college entrance exams in China and Vietnam. These national exams identify the students who may enroll in a four-year university course. Because job prospects are much improved by having a college degree, many students take these tests. However, because space in the universities is limited, pass rates are relatively low, often around 25 percent. The Chinese Ministry of Education estimated that 3,000 students cheated on the 2006 college entrance exam. Common types of cheating included exchanging information with fellow students and carrying mobile phones (Peopleās Daily Online, 2006). In Vietnam, immediately following the administration of the college entrance exams, the halls and floors are u...