You have probably been in this situation. A conversation that mattered — a defense, an interview, a presentation to a skeptical room — where you knew what you thought but couldn’t quite get it out the way you’d intended. Or the opposite: you spoke fluently, covered everything, and walked away unsure whether you’d actually said anything. Both experiences point at the same gap. The way you reason and the way you perform that reasoning under pressure are not the same thing. And almost no assessment tool measures the difference.

There is an assumption so embedded in the design of cognitive assessment that it rarely gets named. It is this: that thinking happens first, and language carries it afterward. You form a thought; you find words for it; the words go out into the world. On this view, language is a vehicle. What matters is the cargo. The vehicle is incidental.

This assumption is wrong. Or rather — it is wrong often enough, and in important enough cases, that building an entire measurement infrastructure on top of it produces systematic blind spots. The blind spots are not subtle. They show up in every high-stakes verbal assessment currently in use.¹

The philosopher Michael Reddy noticed something in the 1970s that has taken decades to fully absorb. He was analyzing the metaphors English speakers use when they talk about communication — and he found that nearly all of them share a hidden structure. We put ideas into words. We get something out of what someone says. Meaning is packed into an utterance and unpacked by the listener. The speaker sends a message; the receiver gets it. Reddy called this the conduit metaphor, and he argued that it wasn’t merely a figure of speech but a theory of mind — one that most people hold implicitly and almost nobody examines.²

The conduit metaphor tells you that meaning pre-exists expression. The thought is formed in some private interior space; language is then recruited to transmit it. Communication succeeds when the message arrives intact. Communication fails when something is lost in transmission.

But watch what happens in actual spontaneous speech. A person is asked a question they haven’t anticipated. They begin answering. Midway through the first sentence, the answer changes — not because they remembered something they’d forgotten, but because the act of speaking generated a thought that sitting in silence had not. The language didn’t carry the idea. The language made the idea. The sentence arrived somewhere its speaker hadn’t planned to go.

This is not an edge case. It is a common condition of unscripted verbal reasoning — though not a universal one. Some people think more fluidly in writing than in speech; some perform differently under observation; some produce their best reasoning in conditions that spontaneous speech doesn’t replicate. The claim is not that speech always reveals more than writing. It is that speech under conditions that don’t permit revision reveals something that writing, with its capacity for iteration and concealment, does not.

Vygotsky saw part of this clearly: thought is not expressed in the word but completed in it.³ The spoken sentence is not a readout of a finished mental process. It is, often, part of the process.

The implications for measurement are significant and largely unacknowledged.

If language is a conduit, then what you want to measure is the thought — and language is merely the medium through which you access it. Multiple choice questions do this reasonably well. Reading comprehension passages do it reasonably well. You present a stabilized linguistic object and ask the test-taker to process it correctly. The score tells you something real about how well they process written language under controlled conditions.

What it cannot tell you is how they generate language under uncontrolled ones.

The GRE Verbal Reasoning section measures vocabulary, reading comprehension, and the ability to identify logical relationships in written passages. These are genuine skills. They correlate with academic performance, particularly in programs that demand heavy reading and written analysis. The instrument does what it was designed to do.⁴

But consider what happens after the GRE. The doctoral student defends a dissertation. The lawyer argues before a judge who interrupts her mid-sentence. The consultant presents findings to a room that is already skeptical. The executive is asked, without preparation, to explain a decision that has just gone wrong. In every one of these situations, the person must reason verbally and spontaneously — must organize thought in real time, under social pressure, without the ability to revise. The GRE score predicts almost nothing about how well they will do this. It wasn’t designed to.

What would an instrument designed to measure spontaneous verbal reasoning actually look at?

Not fluency — however acquired, fluency is an unreliable indicator of reasoning quality. A speaker who has worked for years to achieve smooth, confident English production is not thereby a sophisticated reasoner. A speaker who stumbles slightly may be reasoning with considerable precision. The surface and the structure are different things, and confusing them has costs.⁵

Not vocabulary, exactly — though lexical precision matters, a large vocabulary deployed without structural reasoning produces impressive-sounding emptiness. This is a recognizable phenomenon in certain professional environments, and most existing assessments would not penalize it. A reasoning-based assessment would.

What matters is the architecture of the reasoning as it forms. Whether claims are supported or merely asserted. Whether the speaker can locate the vulnerable point in their own argument and address it rather than talking past it. Whether abstraction and example are held in productive tension — the speaker moving from the particular to the general and back again, rather than floating in abstraction or drowning in anecdote. Whether the response builds toward something or accumulates without arriving anywhere.

These qualities only become visible when a person speaks — when the reasoning has to happen in public, in sequence, without a net.

Fluency is not reasoning. Treating it as a proxy produces confident scores about the wrong thing.

Several commercial tools have emerged in recent years that use recorded speech as an input for evaluation — primarily in hiring contexts, scoring candidates on presentation quality, communication style, and interview performance. These are real products addressing real needs. They are not what is being described here.⁶

The distinction matters. Scoring how someone performs in an interview — whether they project confidence, whether their delivery is engaging, whether they make a favorable impression — is a different task from scoring the structure of their reasoning. The former is legible to human observers without any framework at all; we have always evaluated people on how they come across. The latter requires attending to things that are not visible at the surface: whether claims are grounded or merely asserted, whether the argument holds together, whether the speaker demonstrates any awareness of where their reasoning is incomplete.

The instruments that attempt reasoning assessment specifically are rare, expensive, and mostly confined to high-stakes institutional contexts. The Oral Proficiency Interview, developed by the American Council on the Teaching of Foreign Languages, sends trained raters into extended conversation with test-takers and scores the results against a detailed rubric. The results are genuinely informative. The process costs hundreds of dollars per administration and requires a certified human rater. It is not available to anyone who simply wants to know how their spoken reasoning holds up.

Until recently, the cost and scalability problem was insurmountable. Scoring spontaneous speech with the rigor the task requires needs a trained human listener — and trained human listeners are slow and expensive. What automated speech scoring existed was mostly limited to pronunciation and fluency metrics, the surface features that are easiest to quantify and least informative about reasoning.⁷

The situation has changed. Automated transcription is now accurate enough that the text of spontaneous speech can be reliably recovered and analyzed. Large language models can be prompted to score linguistic features of spoken responses against rubrics that capture reasoning structure — not just what was said, but how it was organized, how claims were warranted, whether the response demonstrated the capacity to hold complexity rather than resolve it prematurely. This does not produce perfect scores. But it produces informative ones, at a cost and scale that were previously impossible.

What does an informative score actually tell you?

Not a single number. A single number applied to something as multidimensional as verbal reasoning is like a single number applied to a decathlete — ten events collapsed into one tells you nothing about which they excel at or where they struggle. The composite is a summary. The profile is the information.

Verbal reasoning disaggregates into distinct capacities that correlate imperfectly with each other. The ability to compress — to say in one sentence what most people take a paragraph to say — is not the same as the ability to sustain a developing line of thought. Epistemic calibration, the capacity to spontaneously distinguish what you know from what you’re inferring, is not the same as abstraction. A person can score high on conceptual continuity and low on originality. These divergences are not noise. They are the signal.

This is standard psychometric thinking, routinely ignored in the public understanding of test scores. But it matters especially for spoken reasoning because the dimensions diverge more visibly in speech than in writing. Writing allows revision, which smooths over the gaps. Speech does not.

The measurement problem, properly understood, is not technical. The technical obstacles — transcription accuracy, scoring reliability, construct validity — are real but tractable. The deeper problem is conceptual: we have been measuring the wrong thing for long enough that we have stopped noticing.

The conduit model of language produces conduit-model assessments. If thought precedes language, then you measure the thought by presenting a linguistic stimulus and scoring the response. You don’t need to watch anyone speak. You need to see whether they can identify the correct answer among several options, or produce a written argument that demonstrates they have understood the question.

The alternative — treating language as constitutive of thought rather than merely expressive of it, and therefore measuring the reasoning that emerges in spontaneous speech — requires watching the process rather than scoring the product. It is harder. It is more informative. And it has been, until very recently, practically unavailable outside of research contexts and expensive institutional testing.

Whether an instrument built on this premise can measure what it intends to is an empirical question. The framework is theoretically grounded. The validation work is ongoing. The assumption is not that the problem is solved — it is that it is the right problem to work on.

Notes

¹ The blind spot is not unique to verbal assessment. The entire psychometric tradition has struggled with what Messick called “construct underrepresentation” — the gap between the construct you intend to measure and the construct your instrument actually captures. Verbal reasoning as it functions in real professional and academic life is considerably broader than verbal reasoning as measured by any existing standardized test.

² Reddy’s original essay, “The Conduit Metaphor,” appeared in Metaphor and Thought, edited by Andrew Ortony, in 1979. The following year, Lakoff and Johnson’s Metaphors We Live By (1980) extended the argument considerably. Where Reddy identified one pervasive metaphor governing how we talk about communication, Lakoff and Johnson argued that abstract thought is constitutively metaphorical across the board — that we don’t merely use metaphors to describe thinking, we think in them. The two books together make a case that has still not fully penetrated the psychometric tradition.

³ The quotation is from Thought and Language (1934/1986). Vygotsky’s argument is not simply that language and thought are connected but that the relationship between them is dynamic and developmental — that the structure of language shapes the structure of thought over time, not just in the moment of utterance. The implications for education are substantial and largely unimplemented.

⁴ The GRE’s predictive validity for graduate school performance is real but frequently overstated, and its validity varies significantly by field. For programs that require heavy reading and written analysis, the correlation is stronger. For programs that require oral performance, lab work, or applied reasoning, the correlation weakens considerably.

⁵ The point about fluency is worth stating carefully, because it runs against intuition in both directions. Native speakers sometimes assume their fluency reflects the quality of their reasoning; it doesn’t necessarily. Non-native speakers sometimes assume their accent or non-native phrasing undermines the credibility of their reasoning; it doesn’t necessarily. Fluency and reasoning are correlated but distinct. An assessment that conflates them will systematically misclassify people in both directions.

⁶ The distinction between interview performance scoring and reasoning structure scoring is not a criticism of the former. Knowing how a candidate presents themselves under pressure is legitimate and useful information. The problem arises when presentation is treated as a proxy for reasoning — when the fluent, confident speaker is assumed to be the rigorous thinker, and the hesitant or accented speaker is assumed not to be.

⁷ The dominant commercial automated speech scoring systems — used primarily for TOEFL and similar assessments — score pronunciation, fluency, vocabulary range, and grammatical accuracy. These features are measurable and relevant to language proficiency. They are not the same as reasoning quality.

How to Measure Verbal Reasoning — and Why Most Tests Get It Wrong

Notes