Scoring engines created to grade student essays are not only valid tools, but in some cases outrank their human counterparts in accuracy and reliability, concluded a study by Ben Hamner and Mark Shermis.
The study, underwritten by the Hewlett Foundation’s Automated Student Assessment Prize (ASAP), compared automated scoring engines from eight commercial vendors, representing 97 percent of providers, and one university laboratory’s open-source grading engine.
Through a blind evaluation, study authors compared two dimensions: the distribution of scores and how the grader dealt with similarly good or poor essays.
“On both sets, the machines did as well or matched human performance,” said Shermis.
Researchers obtained eight essays, total, from four different states. Some essays required the writer to read and then respond to a prompt; others were persuasive, descriptive, or narrative.
The vendors then used essays and scores from at least two human graders to develop training models for their engines.
“They had about four weeks to work with that and get those models in shape,” Shermis said. “Then, in a 58-hour period, we gave them a test set of essays and we asked them to predict what the scores would be.”
The researchers were surprised to see the engines perform better on content-based essays, Shermis said. Overall, as expected, the engines did well.
“Some [electronic graders] did better on some kind of prompts and some on others,” said Tom Vander Ark, CEO of Open Education Solutions. “In general, seven or eight of them scored very, very high levels of agreement with expert graders and that was really the question we were studying.”
Possibilities for More Student Writing
State tests currently require a massive investment of manpower and time to grade written responses. The study stimulated tinkerers to create grading engines for both formative and high-stake assessments—routine teacher checks that students understand the material and standardized tests states give which could benefit from more written-response questions.
“Everybody that’s worked on this has one simple objective: we want American students to write a lot more,” Vander Ark said. “If high school students are only writing three papers in a semester, primarily because it’s so much work for teachers to grade 150 of those, that’s a big problem.”
Concerns about costs, accuracy, and reliability otherwise limit states to fewer written answers and essays.
“Writing assessments would be very difficult to implement if we solely relied on human graders,” Shermis said.
Man vs. Machine?
Ed-tech writer and advocate Audrey Watters worries that eliminating humans from grading could negatively affect what students learn. In college writing classes she taught, many of her students were “terrible writers” who had “been taught a very mechanistic way of writing, the five-paragraph special,” she said.
“Anytime you write something, there’s a human audience to it and I just don’t know if we’ve developed the technology yet where that software is grading things at that level,” Watters said. “Once we see writing as formulaic, that can be assessed and broken apart systematically by a computer algorithm, in turn the computer dictates the grade and in part, the instruction.”
Substantive feedback is really time-consuming, she noted, but methods such as peer grading could both help deal with paper overload and continue to include a human audience.
Vander Ark pointed out human graders are not flawless, either.
“Particularly when [people] are scoring a lot of papers you can have challenges like score drift, over time for an individual grader the scores may drift up or down,” Vander Ark.
The Future of Assessment
Shermis suggested automated scoring engines could work with teachers to improve student writing.
In a high-stakes assessment, the automated scoring could augment rather than replace an expert grader. States could use automated scoring to assess the accuracy of an expert’s grade, and a wide disagreement might trigger a second read.
“The infusion of technology” could transform year-round testing, Shermis said. “Instead of having one high-stakes test in March and getting the results in June, you could actually incorporate automated essay writing as a series of writing assessments.”
Automated tests, for instance, could assess students early in the year, and monitor their performance continuously.”If a teacher can spend less time on punctuation, word choice, grammar, sentence structure… you can increase the amount of time a teacher has to spend on a student’s quality of writing and on finding their voice, distinctive aspects of writing that only an English teacher can teach,” Vander Ark said.
An Evolving Field
ASAP, which kick-started the study, is an open competition for new automated scoring techniques. The competition is set to close at the end of April and could bring better scoring ideas to light.
“I’d be really shocked if something new didn’t come out of this,” said study coauthor Ben Hamner. “This is the first time [automatic grading] has really been widely looked at by people from such a variety of backgrounds. Amateurs…bring their expertise from other domains and find something unique about it.”
Some states, like Kansas, already give all of their high-stake assessments on computers.
“A lot of [grading programs] are already being used in both formative and summative application,” Vander Ark said. “Some are used currently in essay scoring in one of the backup readers. They’re clearly ready for prime time.”
“Contrasting State-of-the-Art Automated Scoring of Essays: Analysis,” Ben Hamner and Mark D. Shermis, April 2012: http://bit.ly/HJWwdP
Image by UC-Davis College of Engineering.