1
A History of Test Speededness
Tracing the Evolution of Theory and Practice
Daniel P. Jurich
There are many practical reasons for administering tests with time limits, most of which relate to the logistics and efficiency of test administration (Bandalos, 2018, p. 59; Morrison, 1960; Rindler, 1979). For example, time limits help to control costs for test developers who often must pay expenses associated with the testing space as well as staff costs for necessary personnel (e.g., test proctors). However, time limits can also serve essential measurement-related functions. Perhaps most importantly, they help to standardize the testing conditions and improve the ability to compare performance across examinees. Concrete evidence of timed standardized testing dates back at least to the Chinese Civil Service examinations administered in the 15th century. At that time, candidates were given one night and one day to complete poems and essays that were used to evaluate their style and penmanship (Martin, 1870). In the United States, the Army led early applications of timed structured cognitive and noncognitive testing through exams such as the Army Alpha and Beta. Beginning in 1917, these tests were used to evaluate World War I recruits on a variety of cognitive skills such as arithmetic reasoning and verbal aptitude (Gregory, 2004; Schnipke & Scrams, 2002). Since these beginnings, standardized examinations with time limits have become ubiquitous within modern society.
Although the implementation of time limits in standardized testing usually occurs due to reasons unrelated to measurement, time constraints can have substantial impact on the validity of scores. Accurate measurement is predicated on the assumption that test scores represent an examineeâs true proficiency with respect to the intended constructs. When the speed with which an examinee completes a test is not of interest, a restrictive time limit that does not allow examinees to exhibit their true proficiency can have negative consequences by introducing construct-irrelevant variation into examinee performance. Even when purposefully measuring speed, an inadequately timed assessment can yield questionable or even invalid results if the degree to which speed affects scores is different from what is expected based on the construct definition. The potential for speed to threaten the validity of scores has been referred to in the literature as test speededness.
This chapter presents a historical overview of the testing literature that exemplifies the theoretical and operational evolution of test speededness. As will be shown, the definition of speededness has evolved throughout the history of measurement and to this day remains a debated topic. The current Standards for Educational and Psychological Testing provide a framework for conceptualizing test speededness as the âextent to which test takersâ scores depend on the rate at which work is performed as well as on the correctness of the responsesâ (AERA, APA, NCME, p. 223). In other words, speededness occurs when the allotted testing time influences examinee performance such that both speed and the construct of interest contribute to score variation. Several comprehensive literature reviews have summarized different aspects of the relationship between timing and testing (e.g., Lu & Sireci, 2007; Morrison, 1960; Schnipke & Scrams, 2002). This chapter presents a historical overview that focuses on how the concept of speededness evolved and how this evolution in conceptualization has influenced the methods that practitioners have used, and are now using, for evaluations of speededness. By describing how the field arrived at current philosophies and exploring the issues that still remain unaddressed, this brief historical review intends to serve as a foundation for the subsequent chapters within this book.
The Early Years: Speed and Ability as Interchangeable Measures
As the scientific study of testing burgeoned after World War I, initial theories posited that speed would not influence response quality independent of the intended construct (Spearman, 1927). Though practitioners recognized that speed and proficiency were conceptually distinct, the prevailing theory presumed that the high correlation between the two traits made them indistinguishable from a measurement perspective (Davidson & Carroll, 1945). In other words, timing could not introduce construct-irrelevant variance because speed was interchangeable with the construct of interest. Some context of the testing era is helpful to understand the logic in this theory. It is axiomatic that numeric scores, such as number correct, will decrease when examinees lack sufficient time to consider all items. However, test scores in this era were predominately used to rank-order examinees. Although total scores can differ substantially under different time limits, rank order would stay comparable if speed and proficiency correlated near perfectly (see Ruch & Koerth, 1923).
There was also an empirical basis for considering the evolution of speed and proficiency as interchangeable. To elaborate on this work, we must distinguish between speed tests and power tests, concepts formalized by Gulliksen in 1950 but used colloquially prior to Gulliksenâs work. A pure speed test is one that is intended to evaluate how quickly an examinee can complete a set of test items within a fixed period of time. As such, speed tests are designed to have strict time limits and to include items of such ease that examinees can respond to all items correctly. Scores on speed tests then reflect the number of items responded to within the time limit and provide an indication of the speed and accuracy with which an examinee processes information. In contrast, pure power tests have no time limits and contain items of varying difficulty to capture the range of proficiency on the construct(s) of interest; scores on these tests reflect the number of items examinees answer correctly out of all items and are used to evaluate ability apart from the speed with which questions are answered. The distinction between pure speed and power tests is primarily theoretical. Many educational examinations function as a mixture of both power and speed tests, intending to primarily measure the construct of interest (i.e., power), but also containing a speed component resulting from time limits that are imposed to address practical constraints (Lu & Sireci, 2007; Chapter 3, this volume). Although theoretical in nature, the concepts of speed and power tests served as a foundation for the methodological developments throughout the evolution of speededness.
Restating Spearmanâs theory in these terms, rank order should be consistent whether an examination is administered as a speed or a power test. The belief that speed served as a proxy for cognitive ability partially stemmed from research in the 1920s and 1930s that investigated the relationship between scores from tests taken under both speed and power conditions. This research generally involved having examinees take a timed examination with a pencil; when the time limit was reached, they then finished taking the test using a different colored pencil or pen so that scores under both speed and power conditions could be distinguished (e.g., Paterson & Tinker, 1930; Peak & Boring, 1926; Ruch & Koerth, 1923). The empirical evidence indicated that scores under the two conditions were highly correlated. For example, Ruch and Koerth (1923) administered the aforementioned Alpha Army examination to 122 examinees under two timed conditions and a power condition, and multicolored pencils were used to capture response markings under the different conditions. Examinees first were given the standard amount of time suggested by the testing manual to respond to questions using a black pencil (single time). After the first time limit expired, examinees were provided the same amount of time to continue or revise answers using a blue pencil (double time), and after that time limit expired they switched to a red pencil to complete or change responses under an untimed period (untimed). Results indicated that rank ordering remained consistentâsingle to double time total scores correlated at 0.966 and single to untimed total scores correlated at 0.945âand therefore seemed to support the comparability between speed and accuracy.
Distinctions between Speed and Power
Taken at face value, Spearmanâs philosophy implies that time limits could be applied capriciously without consequence to validity (Morrison, 1960). As the study of mental testing matured, and likely motivated by the implication of Spearmanâs theory for practice, empirical research began to contradict the interchangeability of time and proficiency (Baxter, 1941; Davidson & Carroll, 1945). Davidson and Carroll provided a strong theoretical and empirical critique of this accepted practice. The authors expressed strong beliefs that scores from tests administered under time limitsâparticularly restrictive limitsâreflected a mixture of examineesâ knowledge and rate. This led the authors to claim, âthe indiscriminate use of time-limit scores is one of the more unfortunate characteristics of current psychological testing âŚâ (p. 411). Davidson and Carroll first criticized the established method of correlating scores from timed and untimed administrations of the same examination because the untimed score reflects a combination of the timed component, responses to the unreached items, and any answer changes made by the examinee. As the timed scores represent a part of the total untimed score, this method spuriously inflates correlations. The problems with this approach were exacerbated when the timed condition allowed examinees to reach the vast majority of the items. In this situation, the timed scores would almost fully reflect the final untimed scores (and the two necessarily would be highly correlated).
The authors followed up their methodological critique with an empirical study focusing on establishing a distinction between speed and knowledge. Utilizing various sections from a revised Alpha Army and several other examinations measuring a number of different constructs, the authors captured responses from examinees under timed and untimed conditions. They also collected data on the time it took each examinee to finish the exam after the time limit expired. A factor analysis found that scores from the untimed administration and completion speed loaded on separate orthogonal factors representing power and speed, respectively. Moreover, scores from the timed administration loaded on both the power factor and the speed factor, indicating that timed...