Special Report

May/June 2012 A Test Worth Teaching To

The race to fix America’s broken system of standardized exams.

By Susan Headden

It is a given that the new assessments will be administered on computers. This assumes two things: that students are comfortable working digitally, and that school districts have the necessary technological capacity. The first is probably a safe assumption; the second less so. Ask any state assessment director what he worries about most, and the answer is almost always some variation on “bandwidth.” In an informal survey taken by the common core R&D teams, more than half the states are already reporting significant concerns about capacity, including the number of computers available, their configurations, and their power and speed. This poses a dilemma: requiring too much technology may present insurmountable challenges for states, while requiring too little may limit innovation. Right now, the test makers are forced to essentially guess what the state of technology will be in 2014. An assessment director in Virginia, a state that already uses computer testing—but has not signed on to the common core—old attendees at a recent conference that when a rural school in his state charged all of its laptops one night, it overloaded the building ’s circuits and shut off the facility ’s heat.

Technological capacity can also narrow or enlarge what educators call the “testing window ”—the amount of time they need to schedule for administering exams. The new tests will already require more time than existing assessments, but if districts don’t have enough computers for everybody to take the tests in the same week, they will have to enlarge the window even more, spreading testing over many weeks. In that the case, students at the back end will enjoy an advantage because they will have had more time to learn the material being tested.

While the new assessments will undoubtedly be harder to score than the current fill-in-the-bubble ones, that doesn’t necessarily mean that the essays will be scored by humans. People, as you may have heard from your robot friends, need to be recruited and trained; they are subjective; and, worst of all, they are slow. PARCC, for one, says it will bypass these fallible creatures as often as possible: it wants items scored very quickly by computers to maximize the opportunity for the results to be put to good instructional use.

Because of recent advances in artificial intelligence, according to a 2010 report by the ETS, Pearson, and the College Board, machines can score writing as reliably as real people. That is, studies have found high levels of agreement with actual humans when those humans are in agreement with each other. (Given how often humans disagree, even the ETS concedes this is at best a qualified accomplishment.) Machines can score aspects of grammar, usage, spelling , and the like, meaning that they are decent judges of what academics call the rules of “text production.” Some programs, according to the ETS, can even evaluate semantics and aspects of organization and flow. But machines are still lousy at assessing some pretty big stuff: the logic of an argument, for instance, and the extent to which concepts are accurately or reasonably described.

By way of making assurances, the ETS says that machines can identify “unique” and “more creative” writing and then refer those essays to humans. Still, the new tests will be assessing writing in the context of science, history, and other substantive subjects, so machines must somehow figure out how to score them for both writing and content. Likewise, machines struggle to score items that call for short constructed responses—for instance, an item that asks the student to identify the contrasting goals of the antagonist and the protagonist in a reading passage. A machine can handle this challenge, but only when the answer is fairly circumscribed. The more ways a concept can be described, the harder it is for the machine to judge whether the answer is right. (For now, both consortia are calling for computer scoring to the greatest extent possible, with a sampling of responses scored by humans for quality control.)

The risk of all this, of course, is that in pursuit of a cheaper, more efficient means of scoring , the test makers will assign essays that are inherently easier to score, thus undermining one of the common core’s central goals, which is to encourage the sort of synthesizing , analyzing , and conceptualizing that only the human brain can assess. Flawed and inconsistent though they may be, humans can at least render an accurate judgment on a piece of writing that rises above the rules of “text production.” Maybe this is why all those high-achieving countries that use essay-type tests to measure higher-order skills use real people to score those tests. “Machine-scored tests are cheap, constitute a very efficient and accurate way to measure the acquisition of most basic skills, and can produce almost instant results,” says Marc Tucker. “But they have a way to go before they will give either e. e. cummings or James Joyce a good grade.”

There might be one other non-robotic way to bring down the cost of scoring: assign the task to local teachers instead of test-company employees. According to the Stanford Center for Opportunity Policy in Education, the very act of scoring a high-quality assessment provides teachers with rich opportunities for learning about their students’ abilities and about how to adjust instruction. So teachers could score assessments as part of their professional development—in which case their ser vices would come “free.” Teachers, however, might find fault with this accounting method.

There’s no doubt that the joint common core effort provides opportunities for significant economies of scale: individual states can now have far better assessments than any one of them could afford to create on its own. But the fact remains that quality costs. The federal stimulus funding covers the creation of the initial assessments, but the overall cost of administering the tests dwarfs the cost of creating them. In addition, the stimulus money runs out in 2014, which is only the first assessment year. The Pioneer Institute, a right-leaning Boston-based think tank that has been critical of the common core standards, has put the total cost of assessment over the next seven years at $7 billion.

Whether that number proves accurate or not, it’s clear that the new testing regime represents a huge investment that most states haven’t yet figured out how to pay for. The current average cost per student of a standardized state test is about $19.93, with densely populated states paying far less and sparsely populated states paying far more. The SBAC estimates a per-student cost of $19.81 for the new summative tests and $7.50 for its optional benchmark assessments. But the Pioneer Institute says in a recent report that those numbers are unrealistically low given the consortium’s ambitious goals. PARCC, which has scaled back its original plans, projects combined costs for the two summative tests of $22 per pupil.

Susan Headden , a Pulitzer Prize-winning journalist, is a senior writer/editor at Education Sector, a Washington, D.C., think tank.


  • Caroline Grannan on May 09, 2012 3:08 PM:

    Education Sector is a partisan organization that promotes the currently popular package of policies known as "education reform," not an impartial source. This article needs a disclaimer cautioning that it is intended to promote the organization's viewpoint.

  • David on May 10, 2012 4:20 PM:

    Very interesting article. Thanks

  • Janet on May 18, 2012 2:51 AM:

    Standardized assessment is not a bad thing -- but in itself, it does not address two of the largest problems in the American education system.

    First, that impoverished students who (on average) are least prepared to do well in school, will find themselves in schools with the fewest resources for teaching them.

    Second, that teachers who might be willing to take on the huge challenge of teaching and inspiring students with learning disabilities or those whose homes and families haven't given them a solid foundation for school, risk low evaluations because in a school year they may help students make enormous progress and build a basis for future success, but they're not likely to have many students who score above grade level; they start too far behind.

    This article doesn't seriously address either of these problems.

  • Ritsumei on May 30, 2012 11:16 AM:

    I took one of those AP tests (US History) and did well. I remember next to nothing. US history and the philosophy of the Founders has, in the past several years, become a topic of particular interest to me. It's very clear to me that the sort of cramming for the test that year's history course was made of was useful for nothing. I didn't retain ANY knowledge to speak of, and we simply didn't cover much of what made the Founding Era great: it wasn't on the test. I'm NOT impressed with the AP tests. They are useful only as coupons for reduced-cost college credits. The teaching to the test, in my experience, guaranteed that the retention simply wasn't there.

    It is also worthy of note that all powers not delegated to the federal government are reserved to the states or to the people -thus all federal involvement is unconstitutional, as it is in no place in the Constitution delegated to the national government. Federal involvement is usurpation of rights that belong with parents, plain and simple. I find the "common core" movement deeply disturbing, as it relates to our freedom to educate our children, and to freedom in general. This sort of top-down, government-centric educational model is incompatible with our system in which sovereignty rests with "We The People," rather than the ruler. These so-called "common core" initiatives fill me with dread for the implications to our freedoms, and I say kudos to Virginia and any other state that refrains from participation.

    Frankly, putting government in charge of education - arguably the single most important leash on government excess over generations - is no different from putting the fox in charge of the hen house.

  • v98max on May 31, 2012 8:00 AM:

    When my dad's school district first flirted with competency testing, every member of the school board was given the citizenship test for legal immigrants. They all failed. Needless to say, it was quickly determined that the test must be too hard.

  • Liz Wisniewski on June 02, 2012 12:10 PM:

    And we continue to focus on weighing the pig........As a fourth grade teacher, I am encouraged to know that the tests will be improving, and yet as I read this I started smiling. The truth is that using the tests for teachers' information is not really necessary as any halfway decent teacher already knows what their students can do. Spend everyday with 21 kids teaching them for month after month and you know them as learners, you know what they can do and what they can't. If a teacher needs standardized test restults to know if a student cannot do multi-digit multiplication I would suggest that someone check what she is smoking in the outside smoking area.

    If only all this time and money was spent on helping children be "present for learning" and on making sure we hire intellectually energized and well trained people as teachers. Yet, we continue to think that weighing the pig is going to make it fatter.....sigh......

  • Bob Ellingsen on June 04, 2012 2:00 AM:

    I taught AP US History for twenty years, and I think the AP program represented the paradigm for what education ought to be. It kept my feet to the fire; I had to cover a rigorous curriculum and couldn't waste a minute. If I wanted my students to do well on the three essays on the AP test, they had to practice writing all year, and I had to read what they were writing and offer feedback. Moreover, even the multiple-choice questions on an AP exam usually require the student to do more than just recall facts. Finally, the presence of a high-stakes test that has meaning for the student changes the dynamic in the classroom. In a very real sense, the student and I were "on the same side." If he or she did well on the test, both of us would be very happy. "Teaching to the test" is as good or as bad as the test itself.