Exercise 1
Test-Retest Reliability
Behavior Rating Profile1
Guideline
Test-retest reliability measures the stability of test scores over time. To estimate this type of reliability, a test is administered twice to a group of examinees — generally with a week or two between the two administrations. The degree of reliability is usually expressed with a correlation coefficient. Note that when a correlation coefficient is used to describe reliability, it is called a "reliability coefficient," or, in this case, a "test-retest reliability coefficient." See Appendix A to review the correlation coefficient before attempting this exercise.
Background Notes
The Behavior Rating Profile, 2nd edition (BRP-2) is designed for rating students who display disturbed behavior. On one scale, parents rate their children on items such as "Is verbally aggressive to parents" and "Is shy; clings to parents." On another scale, teachers rate the children on items such as "Is an academic underachiever" and "Doesn't follow class rules." On three other scales, children rate themselves in relation to their home lives (e.g., "I often break rules set by my parents"), school lives (e.g., "My teachers give me work that I cannot do"), and peers (e.g., "Other kids don't seem to like me very much.").
Excerpt from the Manual
In the test-retest method, a test is administered to the same group of students on two occasions. A specified period of time is permitted to elapse between administrations, and the results are analyzed to test for mean differences or to determine the correlation of the two sets of data. Kaufman (1980) used this procedure to investigate the stability reliability [i.e., test-retest reliability] of the BRP-2 scales with 36 Indiana high school students, 27 of their parents, and 36 of their teachers, permitting two weeks to intervene between administrations... The resulting coefficients, reported in Table 4.3, range from .78 to .91 with only one coefficient falling below the .80 demarcation. These data provide evidence of the stability of the BRP-2 scales when they are used with adolescents. [See Table 4.3 on the next page.]
Table 4.3 Delayed Test-Retest Reliability of the BRP-2 Scales with Adolescents (decimals omitted)
| BRP-2 Scale |
r |
|
|
| Parent Rating Scale |
84 |
| Teacher Rating Scale |
91 |
| Student Rating Scales: Home |
78 |
| Student Rating Scales: School |
83 |
| Student Rating Scales: Peer |
86 |
Questions:
- Which one of the scales is the most reliable? Explain.
- Which one of the scales is the least reliable? Explain.
- In your opinion, are all the scales adequately reliable? Explain.
- The excerpt presents the results of only one of a number of reliability studies described in the manual for the BRP-2. In your opinion, is this one study sufficient or are others needed? Explain.
- In Table 4.3, decimals have been omitted. If they were not omitted, what would the reliability coefficient be for the Parent Rating Scale?
- The test-retest reliability coefficients are based on a two-week interval. Do you think the coefficients would be higher or lower if a two-month interval had been used? Explain.
- Speculate on why test makers usually allow an interval of a week or two between the two administrations of the test instead of giving the same test twice in a row at one sitting.
- If you were considering using this instrument, what other types of reliability coefficients, if any, would you like to see in the manual? Explain.
- In general, how important is test-retest reliability information for selecting a scale or test? Would you consider it a serious flaw if a manual did not contain information on this topic? Explain.
- If you have a measurement textbook, do the authors suggest a minimum acceptable value for a test-retest reliabi lity coefficient? If yes, what is it? If yes, do all of the coefficients in the excerpt exceed the minimum value?
Exercise 2
Interscorer Reliability
Wechsler Preschool and Primary Scale of Intelligence1
Guideline
Scoring some tests involves making subjective judgments. For example, some subjectivity often enters into scoring essays, and, as a result, one English teacher might give an essay a grade of A while another might give it a grade of B. Such a lack of agreement indicates a weakness in interscorer reliability (i.e., the consistency of scores from one scorer to another).
Interscorer reliability is usually judged by having a set of examinees' responses to the test scored by two or more scorers and correlating the two sets of scores by computing a correlation coefficient. Note that when a correlation coefficient is used for this prupose, it is called an "interscorer reliability coefficient." See Appendix A to review the correlation coefficient before attempting this exercise.
Background Notes
The Wechsler Preschool and Primary Scale of Intelligence-Revised (WPPSI-R) is an individually administered intelligence test for young children. The test administrator observes an individual examinee's responses and scores them.
Excerpt from the Manual
Most WPPSI-R subtests involve straightforward and quite objective scoring; however, some subtests are subjectively scored, and are therefore more vulnerable to scoring error. For these subtests, which include Comprehension, Vocabulary, Similarities, and Mazes, it was necessary to evaluate interscorer reliability. In addition, previous research with the WPPSI indicated a low rate of scoring agreement on the Geometric Design subtest (Sattler, 1976). A more objective set of scoring rules and procedures was created for this subtest, and its effect on scorer agreement also was evaluated.
To assess the interscorer reliability of the Comprehension, Vocabulary, Similarities, and Mazes subtests, a sample of 151 cases (83 males and 68 females) stratified by age was randomly selected from all cases collected for the standardization. For the Geometric Design subtest, a sample of 188 cases (105 males and 83 females) was randomly selected. A group of research scorers was trained and given practice in scoring the subtests. The cases were subdivided by age to control for age effects, and two scorers were selected at random to score all the cases in each age group.
To ensure that scorings were independent, any previous scoring notations on standardization Record Forms were masked, leaving only the verbatim responses on the Verbal subtests, the performance times and tracing on Mazes, and the actual drawings on Geometric Design. Scorers in the study recorded their scores on separate forms so that they never saw each other's scores. . . .
Interscorer reliability coefficients were as follows: .96 on Comprehension, .94 on Vocabulary, .96 on Similarities, .94 on Mazes, and .88 on Geometric Design. These results indicate that the scoring rules for these subtests are objective enough for different scorers to produce similar results.
Questions:
- Why was scorer agreement examined for only some of the WPPSI-R subtests?
- Cases were selected at random. What is random selection?
- Cases were selected from all cases collected for the standardization. What do you think the "standardization" is?
- Is it important to know that the research scorers were trained and given practice in scoring the subtests? Explain.
- How many scorers scored the cases in each age group? In your opinion, is this an adequate number?
- The responses had been previously scored. Is it important to know that the research scorers were not allowed to see the previous scoring notations? Why? Why not?
- Is it important to know that the research scorers did not see each other's scores? Why? Why not?
- On which subtest was the interscorer reliability the lowest? Explain.
- Overall, do you think that the interscorer reliability is adequate? Explain.