On standardized tests, computers have begun to grade, not just multiple-choice questions but even student writing such essays. Grading software evaluates essays about as well as a human on the kinds most state tests feature, said Mark Shermis, a professor of education at the University of Akron. The software is cannot determine whether an essay is convincing or poetically written, but merely if its grammar and structure match that of a well-constructed essay.
Research is “mixed” on whether computers accurately grade student writing for standardized tests, said Les Perelman of the Massachusetts Institute of Technology. He says machine grading is “very inaccurate,” and gives high marks for “essay length and pretentious diction.”
Upcoming national Common Core tests that will replace most state English and math tests will include more open-response questions that require written explanations and essays. Currently, states will individually choose whether to hire people to grade those answers or to contract with companies to have computers do the grading.
Artificial Intelligence
After determining such mechanical details as spelling, punctuation, and grammar, the systems work by finding out how human graders scored a sample set of essays, Shermis said. Shermis and colleagues conducted a study in 2012 analyzing almost all of the automated-scoring software available by entering essays from students in six states into nine computer programs. Researchers then compared grades the software assigned with those humans assigned, and found the two largely comparable.
Because grading software is “trained” to match human grading, any “biases” in human ratings also show up in the software, Shermis said. He cited Michigan tests as an example. The state’s rubric tells graders to ignore nonstandard English, such as ebonics.
“But no matter how well they train the raters, they still discount or undervalue essays with those kinds of expressions,” Shermis said. “And it would be possible, if you knew what variables were impacted, to actually instruct the machine to do something that human raters were not capable of doing.” But so far, the software programmers have not done that.
Artificial Writing
Computer grades match those from humans, especially because the kinds of essays tests assign are very basic, said David Williamson of the Educational Testing Service, which develops scoring software.
“The kind of essays that they’re asking for is focused on evaluating someone’s writing fluency and how well they write in English, and they’re writing essays like “Tell me what you did last summer,” and things like that,” he said.
Essays for most tests are “generated spontaneously under timed conditions, without prior warning about what the particular essay is,” he noted. “So that is a different kind of thing than what a lot of educators value developing.”
The National Council of Teachers of English has issued a position statement against automated essay grading.
“[H]igh-stakes writing tests alter the normal conditions of writing by denying students the opportunity to think, read, talk with others, address real audiences, develop ideas, and revise their emerging texts over time,” the statement says. “Often, the results of such tests can affect the livelihoods of teachers, the fate of schools, or the educational opportunities for students.”
The grades a computer may assign a student for an essay a test assigned now play into teacher evaluations and school ratings in most states.
“These high-stakes assessments are not what English teachers would do left up to their own devices as a way to develop writing skills,” said Shermis. “They like projects where they ask the students to plan out what they’re going to write, and draft what they’re going to write, and revise it several times until they sculpt a piece of prose that communicates effectively whatever the topic happens to be.”
Better Than Nothing
As essay grading software continues to improve, it can offer teachers a way to quickly, cheaply give students initial feedback on their writing, and that’s where much of recent research has focused, Williamson said
“The technology is getting better. It’s not perfect.” said Shermis, “But it’s a lot better than getting very limited feedback in a teaching situation.”
Image by Ryan Hyde.