1
Dimensionality of Test Data and Aberrant Response Patterns
1.1 General Overview of Cognitively Diagnostic Methodologies
The value of a diagnostic profile that enumerates strengths and weaknesses in individual performance has been recognized in education, and competent teachers have been using their diagnostic skills in their classrooms to teach students better. It had been common sense that only humans could do such detective work to determine what was going on inside a human brain; however, the rapid development of computer technologies in the 1970s enabled technology to accomplish what previously had been impossible for humans. As computer technologies developed rapidly, computational powers increased dramatically. Linguists worked on natural language processing, psychologists were interested in modeling human information and retrieval, mathematicians were more interested in automating the theorem-proving process, and statisticians advanced various statistical methodologies and models that were impossible to compute without the help of computers. Computer scientists Brown and Burton (1978) developed a computer program called BUGGY using a new powerful programming language suitable for processing a list of logical statements. BUGGY was able to diagnose various âbugs,â or equally erroneous rules of operations committed by students in whole-number subtraction problems.
The successful diagnostic capability of the computer program BUGGY affected American education to a great extent; consequently, similar computer programs, called âexpert systemsâ (Anderson, 1984), that are capable of diagnosing erroneous rules of operations or capable of teaching simple algebra and geometry were developed in the 1980s and 1990s. FBUG was a similar buggy system that followed the idea of the original BUGGY and was capable of diagnosing fraction, addition, and subtraction problems (Tatsuoka & Baillie, 1982). These computer programs required a prepared list of erroneous rules that were originally discovered by humans. Each erroneous rule was decomposed into a sequence of logical statements and a computer language like LISP, which could develop the diagnostic systems. If a new erroneous rule was discovered, then the program would be modified to include it. The system could not discover either new erroneous rules not listed in the initial list or common erroneous rules using a different strategy or a new method to solve a problem.
Stability and Gradient of Diagnostic Information
Sleeman, Kelly, Martink, Ward, and Moore (1989) developed a buggy system for algebra and discovered that many students changed their erroneous rules of operations so often that the buggy system was practically unable to diagnose such students. VanLehn (1983) developed ârepair theoryâ to explain why bugs are unstable. Shaw (1984, 1986) interviewed 40 to 50 students in fraction, addition, and subtraction problems; Standiford, Tatsuoka, and Klein (1982) also interviewed many students for mixed-number operations; and so did Birenbaum (1981) for signed-number operations. They discovered that 95% of erroneous rules of operations in these domains were extremely unstable, and students kept changing their rules to something else. The students also could not answer interview questions as to why they changed their old rules to new ones. Moreover, many students could not even recall what rules they used, even when they used them only 10 seconds before. Tatsuoka (1984a) concluded it would not be wise to diagnose a micro level of performances like bugs or erroneous rules on a test. Consequently, an important question arose regarding the level of performance that would be stable enough to measure and diagnose.
Total scores of most large-scale assessments have high reliabilities, but the level of information is too coarse, and because there are too many different ways to get 50% of the items correct, total scores are not very useful for cognitive diagnosis. If a math test has 5 geometry items and 5 algebra items, then there are 252 ways to achieve a score of 5. Some students may get only the geometry items correct and all algebra items incorrect, whereas others get the geometry items incorrect and all algebra items correct. Their sources of misconceptions could be very different, and they would then need very different remediation treatments. The item score of a test is still at the macro level, and it is difficult to obtain useful diagnostic information from a single item. The question then becomes which levels of diagnostic information would be most valuable and helpful in promoting learning activities among students and whether a subscore level would be useful. The following problems, labeled Examples 1.1.1 and 1.1.2, were excerpted from the technical report to the National Science Foundation (Tatsuoka, Kelly, C. Tatsuoka, Varadi, & Dean, 2007) and coded by three types of attributes: content-related knowledge and skills (C2âC5 and Exponential and Probabilities), mathematical thinking skills (P1âP10), and special skills unique to item types (S1âS9).
Example 1.1
A water ski tow handle makes an isosceles triangle. If one of the congruent angles is 65 degrees, what is the measure of the angle?
Geometric figure is given: S3.
This is a geometry problem: C4.
Have to apply knowledge about the relationships among angles to get the solution: P3.
Because the total sum of angles is 180°, the third angle becomes (180° â 130°) = 50°. Therefore, x can be obtained by subtracting half of this angle, 25°: P5.
That is, 180° â {180° â (65° + 65°)}/2 = 155°: P2.
Example 1.2
An electrician has a plastic pipe, used for underground wiring, which is 15 feet long. He needs plastic pieces that are 6.5 inches long to complete his job. When he cuts the pipe, how many pieces will he be able to use for his job?
Two thirds of the students answered this question correctly. We counted the number of words used in the stem and found 52 words; however, the problem requires translation of a word problem into an arithmetic procedure in order to solve this item: P1.
Because two different units, feet and inch, are used, we have to convert a foot to 12 inches, and then 15 feet must be 180 inches: S1.
The length of a pipe is 6.5 inches, so we need 27 pieces, 180/6.5 = 27 pieces: P2.
Dividing 180 by a decimal number, 6.5, belongs to the content domain of C2: C2.
There are two stepsâthe first to convert the unit to the common unit, and the second to carry out the computation: P9.
These simple problems suggest that several different attributes listed in Table 1.1 must be applied correctly in order to get the right answer. P2 is involved in both items, but the remaining attributes coded in the problems are not intersected. There are 27 attributes listed in Table 1.1 and only 45 items per test. All attributes are involved independently in different ways for each of 45 items, and none of the items involves an identical set of attributes. The attribute involvement is intertwined and complex. The problem is to determine how one can possibly separate the items into subsets based on attribute involvement, and take their subscores from each subset as the attributesâ performance.
Table 1.1 A Modified List of Knowledge, Skill, and Process Attributes Derived to Explain Performance on Mathematics Items From the TIMSS-R (1999) for Population 2 (Eighth Graders) for Some State Assessment
| Content Attributes |
| C1 | Basic concepts and operations in whole numbers and integers |
| C2 | Basic concepts and operations in fractions and decimals |
| EXP | Powers, roots, and scientific expression of numbers are separated from C2 |
| C3 | Basic concepts and operations in elementary algebra |
| C4 | Basic concepts and operations in two-dimensional geometry |
| C5 | Data and basic statistics |
| PROB | Basic concepts, properties, and computational skills |
| Process Attributes |
| P1 | Translate, formulate, and understand (only for seventh graders) equations and expressions to solve a problem |
| P2 | Computational applications of knowledge in arithmetic and geometry |
| P3 | Judgmental applications of knowledge in arithmetic and geometry |
| P4 | Applying rules in algebra and solving equations (plugging in included for seventh graders) |
| P5 | Logical reasoningâincludes case reasoning, deductive thinking skills, if-then, necessary and sufficient conditions, and generalization skills |
| P6 | Problem search; analytic thinking and problem restructuring; and inductive thinking |
| P7 | Generating, visualizing, and reading figures and graphs |
| P9 | Management of data and procedures, complex, and can set multigoals |
| P10 | Quantitative and logical reading (less than, must, need to be, at least, best, etc.) |
| Skill (Item Type) Attributes |
| S1 | Unit conversion |
| S2 | Apply number properties and relationships; number sense and number line |
| S3 | Using figures, tables, charts, and graphs |
| S3g | Using geometric figures |
| S4 | Approximation and estimation |
| S5 | Evaluate, verify, and check options |
| S6 | Patterns and relationships (inductive thinking skills) |
| S7 | Using proportional reasoning |
| S8 | Solving novel or unfamiliar problems |
| S9 | Comparison of two or more entities |
The search for the acceptable levels for helpful and reliable diagnostic information was continued in the 1980s and early 1990s. Tatsuoka (1984a) investigated by grouping erroneous rules in fraction problems into their sources of errors, and examined their stability across two parallel tests. She determined, for example, that 16 erroneous rules of operations originated from the action of making two equivalent fractions. The sources of errors or the sources of erroneous rules of operations were acceptably stable. She further investigated the changes of error types over four parallel tests of signed-number computations (Tatsuoka, 1983a; Tatsuoka, Birenbaum, & Arnold, 1990; Tatsuoka, Birenbaum, Lewis, & Sheehan, 1993), and Birenbaum and her associates (Birenbaum, Kelly, & Tatsuoka, 1993; Birenbaum & Tatsuoka, 1980) examined the stability of computational skills in algebra and exponential items by examining the agreement of a diagnostic classification from parallel subtests (Birenbaum, Tatsuoka, & Nasser, 1997). Tatsuoka and Tatsuoka (2005) tested the stability of classification results by the rule space method and found the testâretest correlations of attribute level are higher than those of item level. This series of studies confirmed that the erroneous rules are extremely un...