A teacher is returning a stack of essays. Two of them, by chance, are nearly identical in surface quality. Both are well-organized, clearly written, free of grammatical error. Both will receive similar grades and the grades will be defensible to anyone who reads the essays without having met the students.

The teacher knows that one of the essays was produced by a student who understands the material and the other by a student who does not. The teacher knows this with high confidence. She has been watching both students for months. She has heard each of them try to explain the central concept of the unit, in real time, in class, in response to questions she chose. The first student fumbled at first and then arrived at something genuinely insightful, and the route from fumbling to insight was the kind of thing that cannot be faked. The second student spoke smoothly and produced sentences in which all the right words appeared in roughly the right order, and the teacher could feel, by the end of the second sentence, that there was no comprehension under the surface.

Neither of these observations will appear on the essays. The grades will be similar. The student whom the teacher knows to be the better thinker will receive feedback indistinguishable from the student whom the teacher knows to be performing. The feedback she actually wants to give cannot be put in the margin of an essay because what she wants to give feedback on is not in the essay.

This is one of the open scandals of formal assessment, and almost no one talks about it. Everyone who teaches knows about it. Almost no one in the institutional layer above teachers does anything about it. The reason is partly historical, partly economic, and partly that nobody has been willing to take seriously what it would mean to assess reasoning instead of writing.

Writing as the dominant medium of formal evaluation is recent. The Cambridge Mathematical Tripos in 1747 is usually credited as the first major university examination conducted in writing rather than in oral disputation. Before the Tripos, university evaluation was overwhelmingly oral, in the disputational tradition that had been the central pedagogy of European universities since the twelfth century. The shift from oral to written examination at Cambridge was not driven by any theory that writing measured cognition more accurately. It was driven by scale. There were too many students. Disputation could not keep up. Writing was a compromise that allowed the examiners to evaluate more candidates per week.

The compromise was understood as a loss at the time. The educators who instituted it wrote about the trade-off explicitly. They thought oral examination was a better measure of what they cared about. They thought written examination was a necessary substitute, justified by the fact that the alternative was no examination at all for the marginal student. Many of them thought the substitution was temporary — that future generations would find a way to scale oral examination back up. None of them, as far as I have been able to find, argued that writing was actually a better measure of cognition than disputation. The shift was a concession, not a discovery.

Then, in the twentieth century, a second shift, larger and farther from the original. The SAT in 1926. The GRE in 1949. The LSAT in 1948. Each of these moved formal cognitive assessment further from oral disputation than the Cambridge written examination had. Multiple-choice testing does not measure reasoning at all in the formative sense in which the word reasoning is usually meant. It measures the moment of selecting an answer from a list, against a key prepared in advance. It tells you whether the candidate arrived at the correct answer. It tells you nothing about the path by which the candidate arrived, or whether the candidate could explain the path back, or whether the candidate would have arrived at the correct answer in a slightly different formulation of the question, or whether the candidate could distinguish the right answer from a wrong one whose surface looked similar. These distinctions are exactly what oral examination is designed to capture and exactly what multiple-choice examination is structurally incapable of capturing.

The defense of the multiple-choice test has always been that it correlates well enough with the things we actually care about. This is true. It correlates well enough that, for many practical purposes, the substitution is acceptable. It is also a substitution. When we say that someone has a high SAT score, we are reporting a number that correlates with their reasoning ability. We are not reporting their reasoning ability.

What would it actually take to grade reasoning instead?

It would require, first, eliciting reasoning under conditions that prevent unlimited revision. Writing as currently practiced permits hours, days, drafts, peer feedback, and editorial polish. The cognitive process that produced the polished essay happens out of view, distributed across an indefinite number of revisions, and what arrives on the page has been processed beyond the point where the original reasoning is recoverable. To see reasoning, you need a medium where the speaker has less time than they need to perform it perfectly — speech, or timed writing, or some form of interactive questioning. The condition in which reasoning is produced is not separable from the reasoning. They are one thing.

It would require, second, a rubric that distinguishes the structure of reasoning from its surface. Most rubrics that exist today do not. They reward clarity, organization, and grammatical precision — features of the surface — and they treat these features as proxies for reasoning quality without any evidence that the proxy is reliable. A good rubric for reasoning would have to score things like: how does the speaker’s argument move from one claim to the next; how does the speaker mark the boundary between what they know and what they are inferring; how does the speaker handle a place in their own argument where the reasoning becomes thin; how does the speaker integrate a new piece of information into the structure they have already built. These are not surface features. They are structural features of cognition under load. They can be observed. They can be scored. They are not measured by any current standardized assessment.

It would require, third, consistent application of the rubric, which is the constraint that historically has made oral assessment infeasible at scale. Trained human raters are expensive, slow, and inconsistent across raters and across time. Inter-rater reliability has been the binding constraint on oral assessment since at least the early twentieth century. The reason multiple-choice testing dominates formal assessment is not that anyone believed it measured cognition well. It is that it produces a number that two graders cannot disagree about. Reliability defeated validity, in the technical sense of those terms. The instrument that was easier to standardize won.

Until very recently, the third constraint was binding. It is no longer binding. Large language models can apply a structured rubric to a transcript with consistency that approaches and sometimes exceeds trained human raters, at a cost that is several orders of magnitude lower, at a speed that allows real-time feedback. The constraint that locked formal assessment into the multiple-choice format for most of a century is now soft. It is technically possible, today, to administer a verbal reasoning task to thousands of people, score the responses against a structured behavioral rubric, and produce dimensional feedback that distinguishes the structure of reasoning from the smoothness of delivery. The instruments are early, the rubrics are imperfect, the validity evidence is still being assembled. But the binding constraint is gone.

This is the moment when the historical compromise — writing as a stand-in for reasoning because we couldn’t grade reasoning at scale — is starting to come undone. The reason it is coming undone is technical. The reason it will not come undone quickly is institutional.

There are real concerns about doing this carelessly. They deserve to be taken seriously, even by people who think the underlying move is correct.

A rubric applied by a language model can encode the same fluency biases that human evaluators encode. If the rubric is built without attention to the distinction between fluency and reasoning, the model will score smooth speakers high and hesitant speakers low, and the evaluation will reproduce in machine form exactly the bias the move was supposed to correct. The rubric has to be built to look past the surface, and built carefully, and validated against populations that the heuristic systematically misjudges. This is hard. It is doable, but it is hard, and it has not been done well in most current commercial implementations.

A rubric that produces a number is harder to defend in court than a multiple-choice test that produces a number. The multiple-choice test is defensible because it is mechanical. A reasoning rubric produces a judgment, and judgments are open to appeal in ways that mechanical scoring is not. Institutions that use formal assessment for high-stakes decisions — admissions, certification, hiring — will be cautious about adopting a format that exposes them to litigation, even if the format measures what they care about more accurately.

Many jobs depend on the current system being labor-intensive. Admissions readers, examiners, graders, test prep companies, certification administrators — large professional ecosystems exist to administer the formats we currently use. These ecosystems will not welcome a technology that makes their function unnecessary, and they will frame their resistance in terms of fairness, bias, and student welfare, some of which concerns will be sincere and some of which will not be. Distinguishing the sincere from the protective is part of what the next decade of assessment policy will involve.

And there are legitimate welfare concerns. Oral and timed assessment is harder on students than take-home work. Students experience it as more stressful. The stress is real and it affects performance, particularly for students whose cognitive style is reflective rather than rapid. A reasoning-focused assessment regime that simply replaced essays with timed oral tests would harm a population of capable students whose reasoning is genuinely better than the format would let them show. A serious version of this assessment would have to be designed with that constraint in mind, which means slower formats, lower stakes, multiple modalities, and explicit accommodation for reflective students. None of that is technically difficult; all of it requires institutional will.

These are not reasons not to do this. They are reasons to do it carefully, with humility, with calibration against existing measures, with transparency about what the instrument can and cannot see, and with the understanding that the move from grading writing to grading reasoning is not a software upgrade. It is a change in what we believe assessment is for. That change will be slow because it should be slow.

But the technical constraint that justified the old compromise is no longer binding, and at some point the field has to acknowledge that. The reason most institutions still grade writing instead of reasoning is no longer that they cannot grade reasoning. It is that they have not yet decided to.

The teacher with the two essays will continue, for now, to give similar grades to the student who understands and the student who does not. She will continue to know the difference. The institution above her will continue to ask her for the grades and not for the knowledge. And the students will continue to graduate carrying credentials that reflect, in an averaged-out way, the quality of their writing — which is correlated with the quality of their thinking, often, and not always, and not in the cases where the difference matters most.

Why We Grade Writing Instead of Reasoning