I
A Foundation for Developing and Validating Test Items
Part I covers four important, interrelated concerns in item development and validation.
This first chapter provides definitions of basic terms and distinctions useful in identifying what is going to be measured. The first chapter also discusses validity and the validation process as it applies to item development. The second chapter presents the essential steps in item development and validation. The third chapter presents information on the role of content and cognitive demand in item development and validation. The fourth chapter presents a taxonomy of selected-response (SR) and constructed (CR) test item formats for certain types of content and cognitive demands.
1
The Role of Validity in Item Development
Overview
This chapter provides a conceptual basis for understanding the important role of validity in item development. First, basic terms are defined. Then the content of tests is differentiated. An argument-based approach to validity is presented that is consistent with current validity theory. The item development process and item validation are two related steps that are integral to item validity. The concept of item validity is applied throughout all chapters of this book.
Defining the Test Item
A test item is a device for obtaining information about a test takerās domain of knowledge and skills or a domain of tasks that define a construct. Familiar constructs in education are reading, writing, speaking, and listening. Constructs also apply to professions: medicine, teaching, accountancy, nursing, and the like. Every test item has the same three components:
1. Instructions to the test taker,
2. Conditions for performance, and
3. A scoring rule.
A test item is the basic unit of observation in any test. The most fundamental distinction for the test item is whether the test taker chooses an answer (selected-response: SR) or creates an answer (constructed-response: CR). The SR format is often known as multiple-choice. The CR format also has many other names including open-ended, performance, authentic, and completion. This SRāCR distinction is the basis for the organization of chapters in this book. The response to any SR or CR item is scorable. Some items can be scored dichotomously, one for right and zero for wrong, or polytomously using a rating scale or some graded series of responses. Refined distinctions in item formats are presented in greater detail in chapter 4.
Thorndike (1967) advised item and test developers that the more effort we put into building better test items, the better the test is likely to be. To phrase it as to validity, the greater effort expended to improve the quality of test items in the item bank, the greater degree of validity we are likely to attain. As item development is a major step in test development, validity can be greatly affected by a sound, comprehensive effort to develop and validate test items.
Toward that end, we should develop each test item to represent a single type of content and a single type of cognitive behavior as accurately as is humanly possible. For a test item to measure multiple content and cognitive behaviors goes well beyond our ability to understand the meaning of a test takerās response to such an item.
Defining the Test
A test is a measuring device intended numerically to describe the degree or amount of a construct under uniform, standardized conditions. Standardization is a very important idea when considering a test and the most important feature of a test is the validity of its test score interpretation and use. āMeasurement procedures tend to control irrelevant sources of variability by standardizing the tasks to be performed, the conditions under which they are performed, and the criteria used to interpret the resultsā (Kane, 2006b, p. 17).
In educational achievement testing, most tests contain a single item or set of test items intended to measure a domain of knowledge or skills or a domain of tasks representing an ability. The single test item might be a writing prompt or a complex mathematics problem. Responses to a single test item or a collection of test items are scorable using complex scoring guides and highly trained raters. The use of scoring rules helps to create a test score that is based on the test takerās responses to these test items. In this book, we are less concerned with tests and solely concerned with developing highly effective items and then assembling validity evidence for each item responseās valid interpretation and use. Readers are directed to the Handbook of Test Development (Downing & Haladyna, 2006) for comprehensive discussions of issues and steps in the test development process. The fourth edition of Educational Measurement (Brennan, 2006) also provides current treatments of many important issues in test development and validation.
What Do Tests and Test Items Measure?
In this section, two issues we face in the measurement of any cognitive ability are presented and discussed. The first is the dilemma provided when we fail to define a construct operationally that we want to measure. The second is a distinction between achievement and intelligence.
A construct is something definable that we want to measure. Constructs have characteristics that help define it. Another good way to make a construct clear is to list examples and non-examples. In educational and psychological testing, the most important concepts we measure include reading, writing, speaking, listening, mathematical problem-solving, scientific problem-solving, and critical thinking as applied in literature analysis and in social studies. Some concepts are subject-matter-based, for example language arts, mathematics, science, social studies, physical education, and English language proficiency. Professional competence is another type of concept that we often test for certification and licensure. Medicine, nursing, dentistry, accountancy, architecture, pharmacy, and teaching are all constructs of differing professional competence.
Operational Definitions and Constructs
Operational definitions are commonly agreed on by those responsible and most highly qualified for measuring the construct. In other words we have a consensus by highly qualified subject-matter experts (SMEs). In the Conduct of Inquiry, it was stated:
To each construct there corresponds a set of operations involved in its scientific use. To know these operations is to understand the construct as fully as science requires; without knowing them, we do not know what the scientific meaning of the construct is, not even whether it has scientific meaning. (Kaplan, 1963, p. 40)
With an operational definition, we have no surplus meaning or confusion about the construct. We can be very precise in the measurement of an operationally defined construct. We can eliminate or reduce random or systematic error when measuring any operationally defined construct. Instances of operationally defined constructs include time, volume, distance, height, speed, and weight. Each can be measured with great precision because the definition of each of these constructs is specific enough. Test development for any construct that is operationally defined is usually very easy.
However, many constructs in education and psychology are not amenable to operational definition. Validity theorists advise that the alternative strategy is one of defining and validating constructs. By doing so, we recognize that the construct is too complex to define operationally (Cronbach & Meehl, 1955; Kane 2006b; Kaplan, 1963; Messick, 1989). As previously noted, constructs include reading and writing. Also, each profession or specialty in life is a construct. For example baseball ability, financial analysis, quilt-making, and dentistry are examples of constructs that have usefulness in society. Each construct is very complex. Each construct requires the use of knowledge and skills in complex ways. Often we can conceive of each construct as to a domain of tasks performed.
For every construct, we can identify some aspects that can be operationally defined. For instance, in writing, we have spelling, punctuation, and grammatical usage that is operationally defined and easily measured. In mathematics, computation can be operationally defined. In most professions, we can identify sets of tasks that are either performed or not performed. Each of these tasks is operationally defined. However, these examples of operational definition within a construct represent the minority of tasks that comprise the construct. We are still limited to construct measurement and the problems it brings due to the constructās complexity and the need for expert judgment to evaluate performance.
Because constructs are complex and abstractly defined, we employ a strategy known as construct validation. This investigative process is discussed later in this chapter and used throughout this book. The investigation involves many important steps, and it leads to a conclusion about validity.
Achievement and Intelligence
The context for this book is the measuring of achievement that is the goal of instruction or training. Most testing programs are designed for elementary, secondary, college and graduate school education. Another large area of testing involves certifying professions, such as medicine, dentistry, accountancy, and the like.
Achievement is usually thought of as planned changes in cognitive behavior of students that result from instruction or training, although certainly achievement is possible due to factors outside instruction or training. All achievement can be defined in terms of content. This content can be represented in two ways. The first is a domain of knowledge and skills. The second is as a cognitive ability for which there is a domain of tasks to be performed. Chapter 3 refines the distinctions between these two types of content. However, introducing these distinctions in the realm of achievement is important as we consider item development and validation because it involves validity.
Knowledge is a fundamental type of learning that include facts, concepts, principles, and procedures that can be memorized or understood. Most student learning consists of knowledge. Knowledge is often organized as a domain that consists of an organized set of instructional objectives/content standards.
A skill is a learned, observable, performed act. A skill is easily recognize...