What We Know About Grading
eBook - ePub

What We Know About Grading

What Works, What Doesn't, and What's Next

Thomas R. Guskey, Susan M. Brookhart

Share book
  1. 236 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

What We Know About Grading

What Works, What Doesn't, and What's Next

Thomas R. Guskey, Susan M. Brookhart

Book details
Book preview
Table of contents
Citations

About This Book

Grading is one of the most hotly debated topics in education, and grading practices themselves are largely based on tradition, instinct, or personal history or philosophy. But to be effective, grading policies and practices must be based on trustworthy research evidence.

Enter this book: a review of 100-plus years of grading research that presents the broadest and most comprehensive summary of research on grading and reporting available to date, with clear takeaways for learning and teaching. Edited by Thomas R. Guskey and Susan M. Brookhart, this indispensable guide features thoughtful, thorough dives into the research from a distinguished team of scholars, geared to a broad range of stakeholders, including teachers, school leaders, policymakers, and researchers. Each chapter addresses a different area of grading research and describes how the major findings in that area might be leveraged to improve grading policy and practice. Ultimately, Guskey and Brookhart identify four themes emerging from the research that can guide these efforts: - Start with clear learning goals,
- Focus on the feedback function of grades,
- Limit the number of grade categories, and
- Provide multiple grades that reflect product, process, and progress criteria.

By distilling the vast body of research evidence into meaningful, actionable findings and strategies, this book is the jump-start all stakeholders need to build a better understanding of what works—and where to go from here.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is What We Know About Grading an online PDF/ePUB?
Yes, you can access What We Know About Grading by Thomas R. Guskey, Susan M. Brookhart in PDF and/or ePUB format, as well as other popular books in Education & Evaluation & Assessment in Education. We have over one million books available in our catalogue for you to explore.

Information

Publisher
ASCD
Year
2019
ISBN
9781416627654

Chapter 1

Reliability in Grading and Grading Scales

. . . . . . . . . . . . . . . . . . . .
Few people today would question the premise that students' grades should reflect the quality of their work and not depend on whether their teachers are "hard" or "easy" graders. But how much subjectivity on the part of teachers is involved in the grading process, and what do we know about its influence? The earliest research on grading dates to the 1800s and was concerned with this very issue. These early studies questioned the reliability of teachers' grading.

Why Is This Area of Research Important?

Reading the research on grading gives present-day educators cause for consternation. On the one hand, early studies of grading reliability clearly were motivated by dissatisfaction, and sometimes disdain, by researchers for teachers' unreliable practices. Our reaction to this, of course, is indignation: That's not right! On the other hand, the extent of the unreliability in grading identified in these early studies was huge. Grades for the same work varied dramatically from teacher to teacher, resulting in highly divergent conclusions about students, their learning, and their future studies. That's not right, either.
In this chapter, we describe these early studies of grade reliability as well as one contemporary study that replicated an early study. We gently critique some of the underlying bias in these studies, and then offer some practical suggestions for applying the studies' results to grading practices today. Despite their biases and flaws, these early studies do offer several clear implications for practice.

What Significant Studies Have Been Conducted in This Area?

In our review of the research, we found 16 individual studies of grading reliability from the early 20th century, plus two early reviews of grading studies by Kelly (1914) and Rugg (1918). These are described in Figure 1.1. We reference these early reviews because they include dozens of early studies in addition to the published studies we were able to locate. Some of the studies Kelly and Rugg reviewed were unpublished reports from school districts or universities that are unavailable to us a century later. In addition, we found an early statistical treatise on the subject by Edgeworth (1888) that we describe first because it set the stage for the research that followed.

Figure 1.1. Early Studies of the Reliability of Grades
Studies: Ashbaugh (1924)
Participants: University education students
Main Findings
  • Grading the same 7th grade arithmetic paper on three occasions, the mean remained constant, but the scores got closer together.
  • Inconsistencies among graders increased over time.
  • After discussion, graders devised a point scheme for each problem and grading variability decreased.
* * *
Studies: Bolton (1927)
Participants: 6th grade arithmetic teachers
Main Findings
  • Average deviation was 5 points out of 100 on 24 papers.
  • Lowest-quality work presented the greatest level of variation.
* * *
Studies: Brimi (2011)
Participants: English teachers
Main Findings
  • Range of scores was 46 points out of 100 and covered all five letter-grade levels.
* * *
Studies: Eells (1930)
Participants: Teachers in a college measurement course
Main Findings
  • Elementary teachers displayed grading inconsistency over time grading three geography and two history questions.
  • Estimated reliability was low.
  • Most agreement was found on one very poor paper.
* * *
Studies: Healy (1935)
Participants: 6th grade written compositions from 50 different teachers
Main Findings
  • Format and usage errors were weighed more heavily in grades than the quality of ideas.
* * *
Studies: Hulten (1925)
Participants: English teachers
Main Findings
  • Teacher inconsistency was revealed over time grading five compositions.
  • 20 percent changed from pass to fail or vice versa on the second marking.
* * *
Studies: Jacoby (1910)
Participants: College astronomy professors
Main Findings
  • There was little disagreement on grades for five high-quality exams.
* * *
Studies: Lauterbach (1928)
Participants: Teachers grading handwritten and typed papers
Main Findings
  • Student work quality was a source of grade variability.
  • In absolute terms, there was much variation by teacher for each paper.
  • In relative terms, teachers' marks reliably ranked students.
* * *
Studies: Shriner (1930)
Participants: High school English and algebra teachers
Main Findings
  • Teachers' grading was reliable.
  • There was greater teacher disagreement in grades for the poorer papers.
* * *
Studies: Silberstein (1922)
Participants: Teachers grading one English paper that originally passed in high school but was failed by the New York Regents
Main Findings
  • When teachers regraded the same paper, they changed their grade.
  • Scores on individual questions on the exam varied greatly, explaining the overall grading disagreement (except on one question about syntax, where grades were more uniform).
* * *
Studies: Sims (1933)
Participants: Reanalysis of four studies of grading arithmetic, algebra, high school English, and psychology exams
Main Findings
  • There were two kinds of variability in teachers' grades: (1) differences in students' work quality, and (2) "differences in the standards of grading found among school systems and among teachers within a system" (p. 637).
  • Teachers disagreed significantly on grades.
  • Changing from a 100-point scale to grades reduced disagreements.
* * *
Studies: Starch (1913)
Participants: College freshman English instructors
Main Findings
  • Teacher disagreement was significant, especially for the two poorest papers.
  • Four sources of variation were found and probable error reported for each: (1) differences among the standards of different schools (no influence), (2) differences among the standards of different teachers (some influence), (3) differences in the relative values placed by different teachers upon various elements in a paper, including content and form (larger influence), and (4) differences due to the pure inability to distinguish between closely allied degrees of merit (larger influence).
* * *
Studies: Starch (1915)
Participants: 6th and 7th grade teachers
Main Findings
  • Average teacher variability of 4.2 (out of 100) was reduced to 2.8 by forcing a normal distribution using a five-category scale (poor, inferior, medium, superior, and excellent).
* * *
Studies: Starch & Elliott (1912)
Participants: High school English teachers
Main Findings
  • Teacher disagreement in assigning grades was large (a range of 30–40 out of 100 points).
  • Teachers disagreed on rank order of papers.
* * *
Studies: Starch & Elliott (1913a)
Participants: High school mathematics teachers
Main Findings
  • Teacher disagreement on a mathematics exam was larger than it was on the English papers in Starch and Elliott (1912).
  • Teachers disagreed on the grade for one item's answer about as much as they did on the composite grade for the whole exam.
* * *
Studies: Starch & Elliott (1913b)
Participants: High school history teachers
Main Findings
  • Teacher disagreement on one history exam was larger than for the English or math exams in prior Starch and Elliott studies (1912, 1913a).
  • Study concluded that variability isn't due to subject, but "the examiner and method of examination" (p. 680).
Source: From "A Century of Grading Research: Meaning and Value in the Most Common Educational Measure," by S. M. Brookhart, T. R. Guskey, A. J. Bowers, J. H. McMillan, J. K. Smith, L. F. Smith, et al., 2016, Review of Educational Research, 86(4), pp. 803–848. Copyright 2016 by American Educational Research Association. Adapted with permission.

The earliest investigation we could find is a statistical study published by the Journal of the Royal Statistical Society in the United Kingdom and rarely cited in the U.S. grading literature. And it's a doozy—the study begins with a table of contents outlining 26 separate points the author wants to make! Professor F. Y. Edgeworth (1888), author of the study, made an important contribution to both statistics and grading research by applying normal curve theory—he called it the "Theory of Errors" (p. 600)—to the case of grading examinations. Normal curve theory was fairly new at the time. Mathematician Carl Friedrich Gauss introduced the theory in the early 1800s and pointed out its usefulness for estimating the size of error in any measure. Edgeworth deserves a lot of credit for realizing this advance in statistics could help us with practical problems in education.
Unlike some of the researchers who followed him, Edgeworth's motivation was not to criticize teachers and professors, but rather to make things fairer for students. He explained that when students' performance is poorly measured, bad decisions result, including mistakes in identifying students for "honours" upon graduation (by which Edgeworth meant "'successful candidates' in an open competition for the Army or the India or Home Civil Service" [p. 603]). Thus, unreliable grades had real consequences for students.
Edgeworth described the plight of those whose achievement was good enough for these important future opportunities but whose grades did not confirm it: "There are some of the pass men as good as some of the honour men; but, like the unsung brave 'who lived before Agamemnon,' they are huddled unknown amongst the ignominious throng, for want, not of talent, or learning, or industry, or judgment, but luck" (p. 616). Edgeworth considered this part of an argument for improving grading reliability. We love Edgeworth's poetic and righteous indignation.
Normal curve theory allowed Edgeworth to measure the amount of error in grades due to chance, which in itself was a contribution to research. But Edgeworth went beyond that to tease out different sources of grading error: (1) chance, (2) personal differences among graders regarding the whole exam and individual items on the exam, and (3) "taking [the examinee's] answers as representative of his proficiency" (p. 614). He did this by using both hypothetical and real data to calculate the probable amount of error in examination grades, under different conditions and for different exams.
The idea that multiple factors led to unreliable grades was a huge step forward. It gave educators a window into things they could do about the problem. We can't really do much about the fact that measures tend to vary by chance. We can, however, take steps to help graders develop a shared view of what knowledge and skills the items and tasks on an exam are supposed to measure. We also can take steps to make sure the items and tasks on an exam really are representative of what we would now call desired learning outcomes.

What Questions Have Been Addressed in This Research? What Have the Results of Those Studies Revealed?

As we have noted, the most valuable early studies of grading reliability investigated sources of variation in grading. The least valuable of these studies simply investigated whether variability in grading existed at all, found that it did (of course it did), and simply proclaimed it a bad thing. More valuable studies investigated whether grading variability was affected by the quality of the student work or its format (e.g., by asking whether teachers find it easier to agree on grades for good papers or poor ones). Other studies investigated whether changing the grading scale would make grading more reliable. In the early 20th century, the most prevalent grading scale was the 0 to 100 percentage scale, which proved to be exceedingly unreliable. Teachers were much more consistent when using grading scales with fewer categories, especially those with five categories or fewer.
The main finding from these early studies was that great variation existed in the grades teachers assign to students' work (Ashbaugh, 1924; Brimi, 2011; Eells, 1930; Healy, 1935; Hulten, 1925; Lauterbach, 1928; Silberstein, 1922; Sims, 1933; Starch, 1913, 1915; Starch & Elliott, 1912, 1913a, 1913b). This finding agrees with the two early reviews of grading studies by Kelly (1914) and Rugg (1918). Not every early study, however, was quite so pessimistic. Studies by Jacoby (1910), Bolton (1927), and Shriner (1930) argued that grading was not as unreliable as commonly believed at the time.
Early researchers attributed the inconsistency in teachers' grades to one or more of the following sources:
  • The criteria for evaluating the work...

Table of contents