On multilingual reasoning and the instruments that fail to measure it
In 1979, the linguist Jim Cummins proposed a distinction that has quietly shaped English language education ever since. He called it BICS and CALP — Basic Interpersonal Communication Skills and Cognitive Academic Language Proficiency. The first refers to the conversational fluency that most people acquire naturally through immersion: the ability to follow and participate in everyday social interaction, to understand context-dependent language, to communicate in the immediate present. The second refers to something harder to acquire and harder to measure: the ability to use language for abstract, decontextualized reasoning — the kind of language that school requires, that academic writing demands, that high-stakes professional communication depends on.
The distinction was useful. It explained something educators kept observing: students who sounded fluent in the hallway struggled in the classroom. They could negotiate the cafeteria and navigate social dynamics and hold their own in casual conversation, but when asked to analyze a text or construct a written argument, the fluency that seemed so promising didn’t transfer. Cummins gave this gap a name.
The problem is what happened next.¹
The BICS/CALP framework was designed as a diagnostic tool — a way of explaining why surface fluency was an unreliable indicator of academic readiness, and why instruction for English learners needed to attend to something deeper than conversational ability. In practice, it became something else: a sorting mechanism, and eventually a ceiling. Once a student was identified as having BICS but not CALP, the task became remediation — building up the academic language that was presumed to be missing. The student’s actual thinking, their capacity to reason, their existing intellectual resources in the first language, largely dropped out of the picture.
What got lost is something worth recovering. A student who reasons sophisticatedly in their first language does not lack cognitive academic language proficiency. They lack the English-language vehicle for expressing a capacity they already have. The difference matters — for how the student is placed, what instruction they receive, and crucially, what they believe about themselves.²
The advanced English learner — the person who has moved well past BICS, who reads English comfortably, who can write a serviceable academic paragraph, who has been living and working in English for years — occupies a strange position in the assessment landscape. They are too proficient for the instruments designed to measure language development, which are calibrated for the earlier stages of acquisition. And they are invisible to the instruments designed for native speakers, which were not built with their particular profile in mind.
What this population needs — and does not have — is an assessment that measures the quality of their verbal reasoning in English independently of whether their English is natively produced. An instrument that can distinguish a sophisticated thinker whose English is slightly accented from a native speaker who reasons poorly. An instrument calibrated for the thing that actually matters at advanced proficiency levels: the structure of the argument, not the grammar or pronunciation or even vocabulary.
Nothing like this currently exists in any form that is widely available or affordable.³
Consider what the research on advanced multilingual speakers actually shows, as opposed to what the intuitions of monolingual institutions tend to assume.
Multilinguals don’t simply have a native language and a second language operating in parallel, each in its own compartment. The languages interact. Concepts available in one language influence how ideas are framed in the other. The multilingual speaker has access to multiple ways of categorizing experience, multiple rhetorical traditions, multiple sets of conceptual vocabulary — and the capacity to move between them is itself a cognitive resource, not a liability. Code-switching, the alternation between languages that is still treated as an error in many educational settings, is in many contexts evidence of sophisticated pragmatic control: the speaker is selecting, from multiple available systems, the one that best fits the communicative situation.⁴
None of this shows up on a standard language assessment. A test of English proficiency measures English proficiency. It does not measure cognitive flexibility, epistemic sophistication, or the reasoning capacity that a multilingual speaker may be carrying from their first language into their use of the second. The speaker who says something slightly ungrammatical but structurally precise — whose argument is well-organized, epistemically calibrated, and responsive to the question — will score lower than the native speaker who says something grammatically impeccable that means very little.
The scoring is wrong. Not procedurally — the instrument does what it was designed to do. Wrong in the sense that it measures the wrong thing for the purpose at hand.
There is a particular population for whom this matters with unusual acuity: the advanced English learner preparing for a high-stakes verbal performance in an English-speaking institutional context. The graduate school applicant whose video essay will be evaluated partly on spoken English quality. The professional seeking a role in an English-dominant organization. The researcher presenting findings to an international audience for the first time.
These people already know their English is not native. What they do not know — what no available instrument will tell them — is whether their verbal reasoning in English is as strong as their verbal reasoning in their first language, or whether something has been lost in the translation. Whether the argument that forms so clearly in the first language survives the passage into the second. Whether the precision that characterizes their thinking at home degrades under the added cognitive load of performing in a language that is not the one in which they learned to think.
A real question, with practical consequences, and currently no reliable way to answer it.⁵
What would a useful assessment for this population look like?
It would score the reasoning separately from the English. Not entirely — the two cannot be fully disentangled, and at advanced proficiency levels, the quality of the English is itself informative. But the scoring would need to weight the structure of the argument, the precision of the claims, the coherence of the reasoning, more heavily than grammatical accuracy or native-like pronunciation. A slightly accented, grammatically imperfect response that demonstrates sophisticated abstract reasoning should score higher than a fluent, well-pronounced response that says very little.
It would need to be normed appropriately — developed with the awareness that the reference population is not monolingual native speakers, and that the performance characteristics of advanced multilinguals differ in ways that are systematic and intelligible rather than simply deficient. Assessing this population against a native-speaker norm, without adjustment, produces results that are technically accurate and practically misleading.
And it would need to be honest about what it is measuring. Not English proficiency — instruments exist for that. Not cognitive ability in some general sense — too broad and too fraught. Something more specific: the capacity to reason verbally in English, at the level of argument structure and epistemic precision, under conditions that do not permit revision. A defined and measurable construct — and the construct that matters most for the situations this population is preparing for.
Cummins was right that surface fluency is an unreliable guide to the deeper thing. He was right that the deeper thing is what education and professional performance actually require. What the framework he proposed could not fully accommodate is the possibility that the deeper thing might already be present — that the problem is sometimes a measurement gap to be closed rather than a cognitive deficit to be remediated.
The advanced English learner who reasons well and sounds foreign has already arrived. Their instruments of expression are still catching up to their instruments of thought. The gap is real. It is also, in principle, closable. And closing it begins with having an honest way to measure where the reasoning actually is.⁶
¹ Cummins’ original formulation appeared in “Cognitive/Academic Language Proficiency, Linguistic Interdependence, the Optimum Age Question and Some Other Matters” (1979). The BICS/CALP distinction became enormously influential in ESL and bilingual education policy, particularly in the United States, where it was incorporated into the theoretical foundations of the WIDA framework and similar state-level English language development standards. Whether it was incorporated correctly is a separate and contested question.
² The critique of BICS/CALP as a sorting mechanism has been made most forcefully by researchers in the translanguaging tradition — particularly Ofelia García and Li Wei, whose work argues that the separation of languages into discrete proficiency levels misrepresents how multilingual cognition actually works. Their argument is not that the distinction between conversational and academic language is unreal, but that the framework tends to pathologize the learner rather than interrogate the institutional expectations the learner is being asked to meet.
³ IELTS Academic and TOEFL iBT both include speaking components that assess English oral production. Both score primarily on fluency, pronunciation, grammatical range, and vocabulary — the surface features of English production rather than the structure of the reasoning. The highest scores on these instruments are achievable by a speaker who produces impeccably structured, grammatically complex English that reasons poorly. A consequence of what they were designed to measure, not a design flaw.
⁴ The research on code-switching as a mark of sophisticated pragmatic control rather than linguistic confusion spans several decades, from Poplack’s foundational work in the 1980s to more recent neurolinguistic studies showing that multilingual speakers deploy both languages simultaneously rather than switching between separate systems. The persistent institutional treatment of code-switching as error — still common in many school settings — reflects an ideological commitment to linguistic purity that the research does not support.
⁵ The cognitive load question is real and underexamined. Performing in a second language imposes processing demands that reduce the resources available for other cognitive operations — including the construction of complex arguments. The performance gap between a speaker’s first and second language reasoning is not simply a proficiency gap; it is partly a resource allocation gap. Instruction and assessment that ignore this will systematically underestimate the reasoning capacity of speakers who are allocating significant resources to the language itself.
⁶ The title of this essay uses the word “articulate” deliberately. In English, to be articulate is to express ideas clearly and effectively — a compliment most often paid to native speakers. The non-native speaker who is articulate in this sense — whose reasoning is clear and precise even when their English is imperfect — is routinely assessed as less capable than a native speaker who is articulate only in the surface sense. The conflation of these two meanings encodes an assumption about where clarity comes from that the assessment evidence does not support.