High-Stakes Tests Vindicated

Published April 1, 2003

Can standardized tests provide a reliable gauge of student achievement and school quality?
_X_ Yes __ No __ Don’t Know

Can standardized tests still provide a reliable gauge of student achievement and school quality when test results are used to reward or sanction schools (“high-stakes” tests)?
_X_ Yes __ No __ Don’t Know

Do high-stakes tests encourage cheating by schools, teachers, and students, thus exaggerating student achievement?
__ Yes _X_ No __ Don’t Know


In a new study of school systems enrolling 9 percent of all U.S. public school students, scholars with the Manhattan Institute for Public Policy found “accountability systems that use high-stakes tests can … be designed to produce credible results that are not distorted by teaching to the test, cheating, or other manipulations of the testing system.”

The study, “Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?” by Jay P. Greene Ph.D., Marcus A. Winters, and Greg Forster Ph.D., examined the accuracy of high-stakes tests–tests whose results are used to reward or sanction schools–by comparing students’ scores on high-stakes tests with their scores on low-stakes tests. Greene and his colleagues found high- and low-stakes tests produce very similar score levels, suggesting the high-stakes tests are a credible tool for gauging student and school performance.

The researchers examined 5,587 schools in two states, Florida and Virginia, and seven school districts in seven different states. They found a high correlation between average test score levels on the two types of tests and a moderate correlation between high- and low-stakes tests on the gain in test scores from year-to-year. Of the jurisdictions studied, Florida showed the highest correlations, suggesting the Sunshine State’s high-stakes tests are highly reliable.

“Teaching to the Test”?

The adoption of high-stakes tests by states and districts for accountability purposes has generated criticism from some educators and researchers. They contend high-stakes tests are an inaccurate gauge of student ability and school quality because the stakes act as an incentive for cheating, teaching-to-the-test, and manipulation of test design to exaggerate student achievement. Teaching-to-the-test, according to these critics, means teaching only the specific knowledge needed to pass the test while failing to teach the broad concepts.

The Manhattan Institute scholars found little evidence to support the critics’ position.

“Most of these criticisms fail to withstand scrutiny,” they conclude. “Much of the research done in this area has been largely theoretical, anecdotal, or limited to one or another particular state test.”

For example, Audrey L. Amrein and David C. Berliner in 2002 criticized high-stakes tests because they found a weak correlation with other tests, such as the Scholastic Aptitude Test (SAT), the ACT, and Advanced Placement (AP) tests. But Greene and his colleagues argue that misleading results can be obtained when comparing exams taken by a select group of students–i.e., college entrance and AP exams–to high-stakes exams taken by the general student body. Similarly, comparing grades to test scores produces inaccurate results due to teacher subjectivity and grade inflation.

Educational Nihilism

Some of the criticism of high-stakes testing, Greene and his colleagues point out, is due to a general anti-testing bias and a belief that achievement cannot be measured. The Manhattan authors reject that notion as “educational nihilism.” Instead, they operate on the premise that student achievement is measurable through testing and seek to address whether the stakes in high-stakes testing distort the results of the tests.

To calculate the reliability of a high-stakes test, they compare scores on high-stakes tests to scores from tests that have no stakes, and therefore offer no incentive to manipulate the results. The nine jurisdictions selected for study are places where students are given both high- and low-stakes tests. The low-stakes tests were all nationally recognized standardized tests, while high-stakes tests were state-developed tests.

The study’s analysis uses both average scores and measures of year-to-year score gains. Average scores show whether students are meeting the standard, whereas score gains show how much students are learning in a year. Measures of score gains are valuable because they can isolate the impact a school is having on lifting student achievement, regardless of whether students are hitting the standard.

In the nine jurisdictions studied, average scores between high- and low-stakes tests correlated more often than did score gains. The correlations varied considerably among the jurisdictions, with Florida showing the highest correlation between its high- and low-stakes tests. Florida is considered to have one of the most aggressive high-stakes programs for both schools and students:

  • Schools that do not have sufficient numbers of students meeting the standard for two out of four years can lose their students to other schools, including private schools, through vouchers.
  • Students are also held accountable: Passage of the 3rd grade test is required to go to 4th grade, while passage of the 10th grade test is required for graduation.

Such a tough accountability program could provide an incentive for manipulation of scores on the high-stakes test. However, scores on Florida’s high-stakes test nearly match scores on the low-stakes tests. Florida’s high correlation shows the stakes in the high-stakes tests do not distort the test results. Teachers are not simply teaching the specific answers to pass the test; rather, they are teaching a broad set of skills and knowledge necessary to pass the FCAT and the Stanford 9, a nationally recognized test.

Although there was some correlation between high- and low-stakes testing in all jurisdictions, not everyone did as well as Florida. The authors suggest low correlation may be due to poor design of the high-stakes test or poor implementation. Differences in material covered by the high- and low-stakes tests could have reduced the correlation. On the score gains measure, lack of data or measurement error could have distorted the correlation.


Krista Kafer is senior policy analyst for education at The Heritage Foundation. Her email address is [email protected].


For more information …

“Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?” by Jay P. Greene Ph.D., Marcus A. Winters, and Greg Forster Ph.D., is available on the Manhattan Institute’s Web site at http://www.manhattan-institute.org/html/cr_33.htm.

Or use PolicyBot at http://www.heartland.org to search for document #11745.