A research paper from expressivecognition.org

Conflict of Interest Statement

This research was conducted by the developer of the Expressive Cognition assessment tool (expressivecognition.org), a freely accessible assessment with an optional paid report tier. This relationship is disclosed in the interest of full transparency. No external funding was received for this research.

AI Usage Statement

This study employs Claude Sonnet 4 (Anthropic), GPT-5 mini (OpenAI), and Mistral Large (Mistral AI) as automated scoring agents within a predefined behavioral rubric and a blinded multi-pass evaluation protocol. All three models function as measurement instruments: they apply the rubric to speech transcripts and generate dimension-level scores under controlled prompting conditions. Claude was also used for editorial assistance in the preparation of this manuscript. The theoretical framework, research design, analyses, and all interpretive conclusions are the work of the Expressive Cognition research program. Full responsibility for the accuracy, integrity, and originality of the manuscript rests with the project.

Abstract

This study establishes normative Verbal Reasoning Index (VRI) scores for 99 guests from Conversations with Tyler, a long-form intellectual interview podcast, scored across six core reasoning dimensions and two moderators by three independent large language models from three different vendors (Claude Sonnet 4, GPT-5 mini, and Mistral Large) in three blinded passes each. The resulting corpus — balanced across nine disciplinary cells with 10–12 speakers each — constitutes the largest labeled spontaneous verbal reasoning dataset in healthy adults currently available. Pairwise cross-model VRI agreement averages r = .668 across the three vendors (Sonnet↔GPT-5m r = .758, Sonnet↔Mistral r = .657, GPT-5m↔Mistral r = .590), with systematic calibration offsets that form a strict generosity gradient: Mistral scores highest on average, Sonnet intermediate, GPT-5 mini lowest. Discipline rank ordering is preserved across all three models: philosophy and hard science at the top, literary arts at the bottom. Inter-pass reliability for Sonnet is excellent (mean VRI spread across three passes = 0.19). Confirmatory factor analyses estimated independently on each scorer's data reject a unidimensional model and recover an identical two-factor structure under all three scorers — Abstraction, Compression, and Originality clustering as Generative Range, and Epistemic Calibration and Generative Self-Monitoring clustering as Calibrative Control — with factor composition stable across models and factor separation (φ = .38 Sonnet → .77 GPT-5 mini → .89 Mistral) varying systematically with each scorer's within-factor correlation pattern. Each of the three models selects a different best-fitting home for Conceptual Continuity, identifying it empirically as a boundary dimension whose factorial placement is scorer-convention-determined. The finding that three independently-prompted frontier LLMs reproduce the same disciplinary hierarchy and the same factor composition — despite having no access to speaker identity, discipline, or each other's scores — provides convergent validity evidence for the EC rubric's capacity to detect real differences in spontaneous verbal reasoning across intellectual domains.

Keywords: verbal reasoning, normative data, LLM scoring, cross-model agreement, spontaneous speech, podcast discourse, Conversations with Tyler

Introduction

The Expressive Cognition (EC) rubric scores spontaneous speech across six core dimensions of verbal reasoning — Abstraction, Compression, Originality, Conceptual Continuity, Epistemic Calibration, and Generative Self-Monitoring — plus two moderator dimensions (Vocabulary and Syntactic Control) that are reported but excluded from the composite Verbal Reasoning Index (VRI). Prior work has established construct validity for the rubric in known-groups designs using Supreme Court oral arguments and academic seminar speech (companion paper), and ecological validity in a 30-guest subset of Conversations with Tyler (CWT) guests correlated against external intellectual reputation (companion paper).

The present study extends this work in two directions. First, it expands the CWT sample from 30 to 99 guests balanced across nine disciplinary cells, producing normative data that allows VRI scores to be interpreted relative to a reference population of high-ability conversational speakers. Second, it introduces cross-model scoring — the same 99 transcripts scored independently by three frontier LLMs from three different vendors (Anthropic, OpenAI, and Mistral AI) — providing the first published inter-model reliability data for an LLM-applied psychometric rubric at this scale, and the first three-way factor-invariance test of the underlying construct.

Method

Sample

Ninety-nine guests from Conversations with Tyler were selected using a purposive sampling design stratified across nine disciplinary cells. A sampling script enumerated all CWT guests from the public episode index, assigned discipline tags, and selected approximately 11 guests per cell, force-including 30 guests from the prior ecological validity study. One joint-guest episode (Noel Johnson and Mark Koyama) was excluded because the rubric assumes a single speaker. The final sample comprised 99 unique speakers.

Table 1. Sample composition by discipline cell.

Cell	n	Example guests
Philosophy	12	Agnes Callard, Slavoj Žižek, Noam Chomsky, Peter Singer
Economics	10	Daron Acemoglu, Esther Duflo (not in sample), Larry Summers
Hard Science	10	Alison Gopnik, David Deutsch, Steven Pinker, Ed Boyden
Social Science	11	Daniel Kahneman, Jonathan Haidt, Philip Tetlock
History	11	Niall Ferguson, Jill Lepore, Ada Palmer
Law/Policy	11	Cass Sunstein, Samantha Power, Jamal Greene
Lit/Arts	11	Margaret Atwood, Camille Paglia, Dana Gioia
Tech/Entrepreneurship	11	Vitalik Buterin, Sam Altman, Marc Andreessen
Journalism/Public	11	Malcolm Gladwell, Ezra Klein, Nate Silver

Transcript Extraction and Screening

For each guest, the full CWT transcript was fetched from the public CWT website and processed through a three-stage screening protocol designed to isolate spontaneous reasoning from rehearsed or recited material.

Stage 1 — Pre-filtering. Host speech was stripped; only guest turns were retained. Turns below a minimum word threshold were excluded.

Stage 2 — Spontaneity screening. Each remaining turn was evaluated by Claude Sonnet 4 for spontaneity. Turns classified as rehearsed set-pieces, recitations, memorized factual lists, or pre-drafted statements were excluded. The screener operated blind to the EC scoring rubric.

Stage 3 — Inclusion threshold. Guests were included only if their screened transcript contained ≥1,500 words across ≥8 retained turns. All 99 candidates passed this threshold. Median screened transcript length was 7,500 words (range: 1,515–11,267).

Scoring

Three scoring runs were conducted independently on the same 99 transcripts. All three used identical prompts, rubric text, dimension definitions, band descriptors, JSON output schema, temperature (0.3), batch structure, and three-pass shuffled-blinding protocol. The only thing that varied was the scoring model itself.

Sonnet 4 scoring. Claude Sonnet 4 (claude-sonnet-4-20250514, Anthropic) scored each transcript blinded — identified only as "Speaker A," "Speaker B," etc. — using the full v3 EC behavioral descriptor rubric. Speaker labels were randomized independently on each of three scoring passes. Total scoring passes: 297 (99 guests × 3 passes). Guests were scored in shuffled batches of 6 to enable cross-guest blinding within each batch.

GPT-5 mini scoring. GPT-5 mini (OpenAI) scored the same transcripts using the same rubric and protocol. Total scoring passes: 294 (98 guests × 3 passes; one guest was excluded due to a scoring pipeline error).

Mistral Large scoring. Mistral Large (mistral-large-latest, Mistral AI) scored the same transcripts via the La Plateforme REST API using the same rubric, protocol, and schema. Total scoring passes: 297 (99 guests × 3 passes). All 99 guests produced valid scores.

The three scoring runs were completely independent: no model had access to any other's scores, and no post-hoc calibration was applied. Cross-model analyses that require a common sample use the 98 guests valid under all three scorers.

Measures

Verbal Reasoning Index (VRI). A weighted composite of six core dimensions: Abstraction (.18), Compression (.16), Originality (.16), Conceptual Continuity (.16), Epistemic Calibration (.18), and Generative Self-Monitoring (.16). Weights reflect the theoretical priority of Abstraction and Epistemic Calibration as the dimensions most closely linked to the Gf-dominant construct EC targets.

Moderator dimensions. Vocabulary and Syntactic Control are scored but excluded from the VRI composite. They capture Gc-linked linguistic competence that correlates with education and language background rather than with the reasoning construct.

Factor-structure models. To test whether the six core dimensions reflect a single general factor or a more differentiated structure, we ran confirmatory factor analyses independently on each scoring model's data. Four nested models were estimated by maximum likelihood on the 6×6 dimension-level correlation matrix (three-pass mean per dimension): a one-factor baseline (M1) and three two-factor specifications that differ in where Conceptual Continuity is assigned (M2a: on Generative Range only; M2b: on Calibrative Control only; M2c: cross-loading both). Fit was evaluated by χ², RMSEA, CFI, and SRMR against conventional cutoffs, with AIC used for model selection. The same nested-model comparison was run on Sonnet, GPT-5 mini, and Mistral Large data separately, yielding three independent CFA results that can be compared for structural replication. Fit computations were implemented directly (see scripts/cwt-norms/cfa.mjs) rather than via an external SEM package to keep the full analysis pipeline reproducible from a single repository.

Results

Overall Descriptive Statistics

Table 2. Overall VRI descriptive statistics by model.

Statistic	Sonnet 4	GPT-5 mini	Mistral Large
n	98	98	98
Mean VRI	7.01	6.67	7.74
SD	0.45	0.34	0.44
Min	5.25	5.25	6.63
Max	7.68	7.35	8.84
Median	7.02	6.70	7.73

The three scoring models produce a strict generosity gradient: Mistral scores highest on average (mean VRI 7.74), Sonnet is intermediate (7.01), and GPT-5 mini scores lowest (6.67). The difference between the most-generous and least-generous model is more than one full scale point on VRI (1.07). Mistral and Sonnet have similar distributional spread (SD = 0.44 and 0.45 respectively), while GPT-5 mini compresses the range somewhat (SD = 0.34). Despite the calibration differences in absolute level, all three models place the lowest scorer and the highest scorers in the same guests: Ana Vidovic at the bottom on both Sonnet and GPT-5 mini, and philosophy/hard-science guests at the top on all three models.

Discipline Cell Means

Table 3. Mean VRI by discipline cell, all three models.

Cell	n	Sonnet 4	GPT-5 mini	Mistral Large
Philosophy	12	7.44	6.88	8.17
Hard Science	10	7.42	6.89	8.16
Social Science	11	7.08	6.80	7.77
Tech/Entrepreneurship	11	6.95	6.72	7.71
History	11	7.11	6.76	7.62
Journalism/Public	11	6.70	6.60	7.61
Economics	10	6.97	6.65	7.56
Law/Policy	11	6.78	6.49	7.52
Lit/Arts	11	6.61	6.27	7.50

All three models preserve the top-two and bottom-one discipline ranking: philosophy and hard science are the highest-scoring cells on every scorer, and literary arts is the lowest on every scorer. The ordering of intermediate cells shifts somewhat between scorers — Mistral places social science and tech higher than history, while Sonnet places history above both — but the differences within the middle band are small and generally within the standard error of the cell means. The discipline rank ordering of cells is substantially stable across all three scoring models even though their absolute level calibrations differ by more than a full scale point.

Dimension-Level Cell Profiles

Table 4. Sonnet 4 mean dimension scores by discipline cell.

Cell	Abs	Cmp	Ori	CC	EC	GSM	Voc	Syn
Philosophy	8.00	7.00	7.39	7.75	7.75	6.67	8.03	6.97
Hard Science	7.70	6.70	7.53	7.87	7.90	6.73	7.97	6.97
Social Science	7.52	6.52	7.03	7.64	7.52	6.18	7.79	6.79
History	7.42	6.42	7.24	7.94	7.39	6.18	7.85	6.91
Tech/Entrepreneurship	7.24	6.27	7.36	7.48	7.12	6.15	7.39	6.39
Economics	7.27	6.33	6.80	7.43	7.57	6.30	7.53	6.50
Law/Policy	7.21	6.09	6.52	7.39	7.27	6.06	7.42	6.48
Journalism/Public	6.82	5.76	6.73	7.52	7.15	6.15	7.21	6.27
Lit/Arts	6.94	5.85	7.00	7.12	6.82	5.88	7.45	6.48

Notable discipline-specific patterns:

Philosophy achieves the maximum mean Abstraction (8.00) — every philosopher in the sample operates at the "Principled" band. This is a ceiling effect consistent with the prior ecological validity study.
Hard Science leads on Epistemic Calibration (7.90) and Originality (7.53), reflecting the epistemic marking and novel-framing demands of scientific discourse.
History leads on Conceptual Continuity (7.94), consistent with the narrative coherence demands of historical analysis.
Lit/Arts scores lowest on five of six core dimensions and lowest on VRI. This is interpreted as a construct-appropriate finding: the EC rubric measures analytical verbal reasoning, not narrative or associative reasoning. Literary discourse deploys different cognitive operations than the analytical register the rubric targets.

Cross-Model Agreement

Table 5. Pairwise cross-model agreement statistics (Pearson r), n = 98 common sample.

Dimension	Sonnet↔GPT-5m	Sonnet↔Mistral	GPT-5m↔Mistral
Abstraction	.643	.642	.606
Compression	.397	.611	.272
Originality	.823	.718	.726
Conceptual Continuity	.556	.460	.501
Epistemic Calibration	.666	.497	.368
Generative Self-Monitoring	.518	.355	.391
Vocabulary	.620	.640	.656
Syntax	.513	.514	.496
VRI	.758	.657	.589
VRI Spearman ρ	.721	.595	.536

Table 5b. Mean calibration offsets on VRI (higher-scoring model − lower-scoring model).

Pair	Offset (points)	Direction
Sonnet − GPT-5 mini	+0.33	Sonnet more generous
Mistral − Sonnet	+0.73	Mistral more generous
Mistral − GPT-5 mini	+1.06	Mistral more generous

The three models form a strict generosity gradient on VRI: Mistral > Sonnet > GPT-5 mini, with a 1.06-point total spread from the most-generous to the least-generous scorer. Pairwise agreement is highest between Sonnet and GPT-5 mini (r = .758) and lowest between GPT-5 mini and Mistral (r = .589), with Sonnet↔Mistral intermediate (r = .657). The mean pairwise r across the three vendors is .668. All three pairwise correlations are substantial and well above chance, indicating that despite the absolute-level calibration differences, the three models substantially agree on the rank ordering of speakers.

At the dimension level, Originality shows the highest pairwise agreement for the Sonnet↔GPT-5m pair (r = .823), and also high agreement for the Sonnet↔Mistral and GPT-5m↔Mistral pairs (r = .718 and .726 respectively). Compression is the dimension with the most scorer-specific disagreement: Sonnet and Mistral agree moderately on Compression (r = .611), but GPT-5 mini diverges from both (r = .397 with Sonnet and r = .272 with Mistral). This suggests that GPT-5 mini operationalizes propositional density somewhat differently from the other two scorers, which may reflect prompt-interpretation sensitivity on that dimension. Epistemic Calibration and Generative Self-Monitoring also show lower pairwise agreement with Mistral than with the Sonnet↔GPT-5m pair, consistent with Mistral's broader interpretation of what counts as evidence of real-time epistemic marking and self-revision.

The Mistral–Sonnet–GPT-5 mini generosity gradient is observable on every dimension except Syntactic Control. Mistral scores substantially higher than Sonnet on Abstraction (+0.65), Compression (+0.69), Originality (+0.81), Conceptual Continuity (+0.72), and Vocabulary (+0.56), and higher than GPT-5 mini on the same dimensions by even larger margins. The largest single inter-model dimension offset is the Mistral–GPT-5 mini gap on Originality (+1.80 points), reflecting the fact that Mistral credits far more reframings as genuinely novel than GPT-5 mini does.

Inter-Pass Reliability

Table 6. Inter-pass reliability (Sonnet 4): mean spread across three passes per dimension.

Dimension	Mean Spread	SD
Abstraction	0.09	0.29
Compression	0.15	0.44
Originality	0.18	0.39
Conceptual Continuity	0.29	0.56
Epistemic Calibration	0.29	0.50
Gen. Self-Monitoring	0.38	0.49
Vocabulary	0.15	0.36
Syntax	0.16	0.37
VRI	0.19	0.23

Inter-pass reliability is excellent. Mean VRI spread across three passes is 0.19 points — the same speaker scored three times by the same model under different blinding labels produces VRI scores that differ by less than two-tenths of a scale point on average. Abstraction is the most stable dimension (mean spread 0.09), and Generative Self-Monitoring is the least stable (0.38), consistent with the finding that GSM is more context-sensitive than other dimensions.

Factor Structure: Confirmatory Factor Analysis

To test whether the six core EC dimensions reflect a single general verbal-reasoning factor or a more differentiated structure, we ran a confirmatory factor analysis on each model's scoring data independently. Four nested models were estimated by maximum likelihood on the dimension-level correlation matrix (n = 99 guests, each with the three-pass mean for each dimension):

M1 (1-factor): All six core dimensions load on a single general Verbal Reasoning Capacity factor.
M2a (2-factor, Cont → GR): Generative Range (GR) = Abstraction, Compression, Originality, Conceptual Continuity; Calibrative Control (CC) = Epistemic Calibration, Generative Self-Monitoring.
M2b (2-factor, Cont → CC): GR = Abstraction, Compression, Originality; CC = Conceptual Continuity, Epistemic Calibration, Generative Self-Monitoring.
M2c (2-factor, Cont cross-loads): As M2a/M2b, but Conceptual Continuity freely loads on both factors.

Fit indices reported follow standard cutoffs: RMSEA < .08 acceptable, < .06 excellent; CFI > .90 acceptable, > .95 excellent; SRMR < .08 acceptable.

Table 7. Confirmatory factor analysis fit indices by scoring model.

Model	Sonnet 4			GPT-5 mini			Mistral Large
	χ² (df)	RMSEA	CFI	χ² (df)	RMSEA	CFI	χ² (df)	RMSEA	CFI
M1 (1-factor)	135.6 (15)	.286	.712	30.7 (15)	.103	.918	41.3 (15)	.134	.959
M2a (Cont → GR)	35.8 (14)	.126	.948	18.8 (14)	.059	.975	32.2 (14)	.115	.971
M2b (Cont → CC)	34.7 (14)	.123	.951	30.1 (14)	.108	.916	27.9 (14)	.101	.978
M2c (Cont cross-loads)	21.1 (13)	.080	.981	18.8 (13)	.067	.970	26.1 (13)	.101	.979

Best-fitting model per scorer in bold. Winning models by AIC: Sonnet → M2c (AIC = -4.9); GPT-5 mini → M2a (AIC = -9.2); Mistral → M2b (AIC = -0.1). Each of the three scorers selects a different best-fitting specification — and the difference between them is entirely about where Conceptual Continuity belongs in the two-factor structure. Every scorer rejects the one-factor model in the direction of a two-factor solution, but the three scorers disagree on Continuity's factorial home.

Table 8. Standardized factor loadings under each model's best-fitting CFA solution.

Dimension	Sonnet (M2c) GR	Sonnet (M2c) CC	GPT-5m (M2a) GR	GPT-5m (M2a) CC	Mistral (M2b) GR	Mistral (M2b) CC
Abstraction	.921	—	.707	—	.988	—
Compression	1.000	—	.679	—	.993	—
Originality	.680	—	.562	—	.833	—
Conceptual Continuity	.342	.357	.765	—	—	.798
Epistemic Calibration	—	.824	—	.858	—	.730
Gen. Self-Monitoring	—	1.000	—	.700	—	.742
Factor correlation (φ)		.384		.766		.886

Three scorers produce three different best-fitting placements for Conceptual Continuity — Sonnet cross-loads it on both factors, GPT-5 mini assigns it to Generative Range only, and Mistral assigns it to Calibrative Control only. This is a striking and consistent finding: the one dimension whose factorial home varies across scorers is the same one under all three, and each scorer picks a different home for it. Abstraction, Compression, and Originality cluster as Generative Range on every scorer. Epistemic Calibration and Generative Self-Monitoring cluster as Calibrative Control on every scorer. No scorer ever assigns any of these five dimensions differently.

Four findings are substantively important across the three scoring models:

1. A one-factor model is rejected by every scorer. No model is consistent with a unidimensional general-verbal-reasoning factor. Sonnet rejects M1 decisively (RMSEA = .286, CFI = .712). GPT-5 mini rejects it more mildly but still clearly (RMSEA = .103, CFI = .918; Δχ²(1) M1 vs. M2a = 11.91, p = .0006). Mistral also rejects it in the direction of a two-factor solution (RMSEA = .134, CFI = .959; Δχ²(1) M1 vs. M2b = 13.32, p = .0003). Under all three scorers, the two-factor structure fits significantly better than the unidimensional alternative.

2. Factor composition is invariant across scorers. All three models independently cluster Abstraction, Compression, and Originality as one factor (Generative Range) and Epistemic Calibration and Generative Self-Monitoring as the other (Calibrative Control). No scorer assigns any of these five dimensions differently. The empirical clustering of the five core dimensions is model-invariant across three independent LLMs from three different vendors.

3. Factor separation varies monotonically with scorer calibration. The factor correlation φ — how distinct GR and CC are from one another — rises monotonically from Sonnet (φ = .384, well-separated factors) through GPT-5 mini (φ = .766) to Mistral (φ = .886, nearly-merging factors). The raw correlation matrices clarify the source: the three scorers differ dramatically in their within-factor correlations. Sonnet's Abs↔Comp correlation is .921; Mistral's is .981 (nearly rank-degenerate); GPT-5 mini's is .471. The scorers with tighter within-factor correlations produce factors that are also more correlated with one another — because when every dimension inside a factor is near-identical, any cross-factor signal dominates the residual. This is a calibration phenomenon, not a structural disagreement about what the factors are.

4. Conceptual Continuity is empirically a boundary dimension with three different best-fitting homes. Sonnet's best-fitting solution has Continuity cross-loading both factors (.342 on GR, .357 on CC). GPT-5 mini's best-fitting solution (M2a) places Continuity entirely on GR (.765), with any CC cross-loading collapsing to essentially zero (−0.014) when estimated. Mistral's best-fitting solution (M2b) places Continuity entirely on CC (.798), with no GR loading. Three frontier LLMs, three distinct homes for the same dimension. This is the strongest empirical evidence to date that Conceptual Continuity is a boundary dimension whose factorial placement is scorer-convention-determined rather than construct-determined. It is therefore excluded from both the reported Generative Range and Calibrative Control subscores in production, retaining only the five unambiguously-loading dimensions (Abstraction, Compression, Originality for GR; Epistemic Calibration, Generative Self-Monitoring for CC). The production decision is not a theoretical choice; it is what the three-way CFA comparison forces.

Individual Speaker Results

Table 9. Top 10 highest VRI guests under Sonnet 4, with matched scores from the other two models.

Rank	Guest	Cell	Sonnet	GPT-5m	Mistral
1	Rebecca Kukla	Philosophy	7.68	6.97	8.22
1	Ed Boyden	Hard Science	7.68	7.35	8.05
1	Michelle Dawson	Hard Science	7.68	7.13	8.73
4	Agnes Callard	Philosophy	7.63	7.03	7.62
4	Henry Farrell	Social Science	7.63	7.09	7.73
4	Cass Sunstein	Law/Policy	7.63	6.74	7.83
7	Alison Gopnik	Hard Science	7.57	7.09	8.56
7	David Deutsch	Hard Science	7.57	7.03	8.68
7	Vitalik Buterin	Tech	7.57	7.20	7.90
10	Russ Roberts	Economics	7.52	6.63	7.79

Table 10. Top 10 largest Sonnet↔GPT-5m VRI disagreements (with Mistral comparison).

Guest	Cell	Sonnet	GPT-5m	\|Δ S−G\|	Mistral
Camille Paglia	Lit/Arts	6.67	5.57	1.11	7.94
Dana Gioia	Lit/Arts	7.34	6.42	0.92	7.67
Cass Sunstein	Law/Policy	7.63	6.74	0.89	7.83
Russ Roberts	Economics	7.52	6.63	0.89	7.79
Peter Singer	Philosophy	7.36	6.49	0.87	7.68
Diarmaid MacCulloch	History	7.47	6.69	0.78	7.84
Abhijit Banerjee	Economics	7.29	6.58	0.71	7.69
Rebecca Kukla	Philosophy	7.68	6.97	0.71	8.22
Jess Wade	Hard Science	7.07	6.37	0.70	7.84
Marc Andreessen	Tech	7.45	6.75	0.69	8.00

The largest Sonnet↔GPT-5 mini disagreements are asymmetric: in all 10 cases Sonnet scores higher than GPT-5 mini, and in all 10 cases Mistral scores higher than both. This pattern is consistent across the full sample — 87 of 98 guests receive higher VRI from Sonnet than from GPT-5 mini, and 96 of 98 receive higher VRI from Mistral than from GPT-5 mini. The disagreements are concentrated among speakers whose discourse style is rhetorically confident and compressed (Paglia, Sunstein, Andreessen). The consistent direction of disagreement across all three model pairs — Mistral highest, Sonnet intermediate, GPT-5 mini lowest — suggests GPT-5 mini evaluates such speakers more strictly against the rubric's behavioral anchors, while the more generous scorers credit rhetorical confidence as evidence of underlying reasoning capacity even when the behavioral markers are ambiguous.

Discussion

The Normative Contribution

This study provides the first large-scale normative dataset for the EC Verbal Reasoning Index. Ninety-eight speakers scored by three independent frontier language models across nine balanced disciplinary cells produce VRI distributions that can serve as a reference population for interpreting individual scores. Median VRI in this population is approximately 7.0 on Sonnet, 6.7 on GPT-5 mini, and 7.7 on Mistral — a population pre-selected by the same interviewer for intellectual distinction. The scorer-specific median matters because, as documented above, the three scoring models differ by more than a full scale point in absolute-level calibration.

The compressed VRI ranges (5.25–7.68 on Sonnet, 5.25–7.35 on GPT-5 mini, 6.63–8.84 on Mistral) reflect this pre-selection. The normative data should be interpreted as norms for the upper end of the ability distribution, not as population norms. A broader normative study — currently planned using Prolific recruitment with concurrent ICAR fluid reasoning assessment — will provide norms across the full ability range.

Cross-Model Agreement as Validity Evidence

The pairwise inter-model agreement reported here (mean r = .668 on VRI across three pairs, range .589 to .758) is, to our knowledge, the first published three-way cross-vendor LLM-as-judge reliability statistic for a psychometric rubric. Three models from three different vendors — Anthropic (Sonnet 4), OpenAI (GPT-5 mini), and Mistral AI (Mistral Large) — trained on different data, with different architectures, applied the same rubric to the same transcripts without any coordination, and produced scores whose rank-orderings substantially agree.

This agreement is stronger than most published inter-rater reliability estimates for human-scored performance assessments in educational measurement, even when averaged across all three independent pairs. It does not mean the models are "correct" — all three could share systematic biases inherited from shared web-scale pretraining data. But it establishes that the construct measured by the EC rubric is scorer-recoverable across vendor boundaries: the same rubric, applied by three frontier LLMs that share neither architecture nor training corpus nor vendor lineage, produces broadly the same rank ordering of speakers. This is the fundamental requirement for measurement reliability, and it holds across the three-way comparison even though the absolute-level calibration of the three scorers differs by more than a full scale point.

A Two-Factor Reasoning Structure, Stable in Composition and Variable in Separation

The confirmatory factor analyses converge on a substantively important claim: the six core EC dimensions are not a single lump. A one-factor model is rejected under all three scoring models, and all three independently recover the same two-factor composition — Abstraction, Compression, and Originality clustering as a Generative Range factor, and Epistemic Calibration and Generative Self-Monitoring clustering as a Calibrative Control factor. That three independently-prompted frontier LLMs, with no access to each other's scoring and no shared intermediate representations, partition the same six dimensions into the same two empirical clusters is the strongest single piece of convergent-validity evidence in this study — and the three-way replication is meaningfully stronger than a two-way one, because it essentially closes the door on "both models inherited the same partition from shared pretraining."

The theoretical interpretation is straightforward. Generative Range captures the productive side of verbal reasoning — the ability to operate at high levels of abstraction, to pack propositions densely, and to generate non-obvious reframings. Calibrative Control captures the monitoring side — the ability to mark epistemic status explicitly and to revise one's own formulations upward in real time. Both factors contribute to what the VRI composite measures, and they are moderately-to-highly correlated across all three scorers (φ = .384 Sonnet, .766 GPT-5 mini, .886 Mistral) — distinguishable but not independent, as expected for two aspects of a broader verbal-reasoning capacity.

Two findings vary systematically across scorers rather than replicating cleanly. First, the magnitude of factor separation differs substantially: Sonnet produces well-separated factors (φ = .384), GPT-5 mini produces moderately separated factors (φ = .766), and Mistral produces nearly-merging factors (φ = .886). The source is visible in the raw correlation matrices: each scorer shows a characteristic within-factor correlation pattern. Sonnet has Abs↔Comp r = .921; Mistral has .981 (nearly rank-degenerate); GPT-5 mini has .471 (the most dimension-independent of the three). The stricter the within-factor correlations, the more correlated the resulting factors also become, because near-redundant within-factor dimensions leave only cross-factor residual signal. This is a scoring-calibration phenomenon, not a structural disagreement about what the factors are — all three scorers still prefer the two-factor structure to the one-factor alternative by conventional fit criteria.

Second, Conceptual Continuity — and Continuity alone — receives three different best-fitting factorial homes across the three scorers. Sonnet's preferred model cross-loads Continuity on both factors. GPT-5 mini's preferred model places Continuity on Generative Range only. Mistral's preferred model places Continuity on Calibrative Control only. No other dimension varies in its factor assignment across scorers. We interpret this as evidence that Continuity is empirically a boundary dimension — its correlations with the other five are ambiguous enough that the "right" factor for it depends on idiosyncratic scoring calibration rather than underlying construct structure. This finding is not a defect; it is the clearest possible empirical justification for the production decision to exclude Continuity from both the reported Generative Range and Calibrative Control subscores. The reported subscores use only the five dimensions whose factor assignment is invariant across all three independent scorers. Users of the scale should specify the scoring model just as they would specify the norming sample, because absolute-level calibration and factor-separation magnitude both vary across scorers; the rank-ordering of speakers does not.

Disciplinary Reasoning Signatures

The finding that all three models independently reproduce the same top-of-hierarchy and bottom-of-hierarchy disciplinary ranking — philosophy and hard science at the top, literary arts at the bottom — provides convergent validity evidence that the rubric detects real differences in discourse register rather than random variation. Intermediate-cell orderings shift slightly across scorers, but always within the standard error of the cell means. The dimension-level profiles are consistent with what is known about how disciplinary training shapes discourse: philosophers reason at the level of principles (Abstraction at or near the scale ceiling), scientists mark epistemic boundaries explicitly (the highest Epistemic Calibration means), historians build cumulative narrative arguments (the highest Conceptual Continuity means), and literary speakers use associative and narrative reasoning that the analytical rubric does not fully capture.

Limitations

The sample is pre-selected. All speakers are guests on a single interview podcast, selected by the same host for intellectual distinction. This compresses the VRI range and limits generalizability.
No external criterion. Unlike the prior ecological validity study, this study does not correlate VRI with an external measure. The disciplinary patterns are interpretable but not validated against independent criteria.
The cross-model calibration offsets are systematic and large. The three scorers span a strict 1.06-point generosity gradient on VRI: Mistral > Sonnet > GPT-5 mini. Norms based on one model's scores are not interchangeable with norms based on another's. Any production use of the EC rubric must specify the scoring model, and any longitudinal comparison of a single speaker over time must use the same scorer throughout.
Compression shows scorer-specific disagreement. GPT-5 mini operationalizes propositional density notably differently from Sonnet (r = .397) and Mistral (r = .272), while Sonnet and Mistral agree more substantially with each other (r = .611). This suggests the GPT-5 mini prompt interpretation on Compression may be an outlier, and the dimension may benefit from additional behavioral anchoring to align scorers.
Literary arts scores reflect a construct boundary. The rubric measures analytical verbal reasoning. Speakers whose primary mode is narrative, associative, or performative will score lower not because they are less intelligent but because the rubric is not designed to measure their kind of reasoning. This is a scope limitation, not a measurement failure.
Single-pass written-vs-spoken comparison is not included. The prior ecological validity study included a written-vs-spoken pilot for three guests. The present study does not extend this comparison.
Conceptual Continuity's factor placement is scorer-dependent. The three-scorer CFA comparison reveals that Conceptual Continuity has no single empirical home in the two-factor structure — each scorer places it differently. This is a limitation of the current rubric insofar as it suggests the Continuity behavioral descriptors are ambiguous with respect to the Generative Range / Calibrative Control partition. It is also the clearest empirical justification for excluding Continuity from the reported GR and CC subscores. Future rubric revisions could either sharpen the Continuity anchors to force a consistent factorial home or formally recognize it as a cross-cutting dimension that contributes to VRI without loading on either subscore.

Conclusion

Ninety-eight speakers from Conversations with Tyler, scored by three independent frontier large language models from three different vendors across six core dimensions of verbal reasoning, produce normative VRI data that is internally consistent (inter-pass reliability = 0.19), cross-model reliable (mean pairwise r = .668), discipline-differentiated in theoretically predicted directions, and factorially coherent under every scorer.

Two complementary sources of convergent validity support the EC rubric's construct claim. First, all three scoring models independently produce the same disciplinary hierarchy — philosophy and hard science at the top, literary arts at the bottom, with theoretically interpretable dimension-level profiles for each discipline — despite having no access to speaker identity, discipline, or each other's scores. Second, confirmatory factor analyses estimated independently on each of the three scorers' data reject a unidimensional model and recover an identical two-factor composition: Abstraction, Compression, and Originality cluster as Generative Range; Epistemic Calibration and Generative Self-Monitoring cluster as Calibrative Control. The factor composition is invariant across scorers. Only factor separation magnitude and the placement of one boundary dimension (Conceptual Continuity) vary, and both variations are attributable to scoring-calibration differences rather than to disagreement about what the factors are.

Together, these findings suggest that what the rubric measures is real, even if what it measures is narrower than intelligence writ large. The EC rubric measures analytical verbal reasoning as performed in spontaneous speech, and the construct is internally structured as two moderately-to-highly correlated but empirically distinguishable components — one productive, one calibrative. Within that scope, the rubric measures reliably, it differentiates meaningfully, and its underlying factor structure is recoverable by three independent frontier LLMs from three different vendors. The three-way replication of the Generative Range / Calibrative Control partition, with a three-way disagreement confined to a single boundary dimension, constitutes the strongest evidence currently available that the two-factor structure of verbal reasoning recovered here is a property of the construct rather than an artifact of any individual scorer.

References

Biggs, J. B., & Collis, K. F. (1982). Evaluating the quality of learning: The SOLO taxonomy. Academic Press.

Flavell, J. H. (1979). Metacognition and cognitive monitoring. American Psychologist, 34(10), 906–911.

Guilford, J. P. (1967). The nature of human intelligence. McGraw-Hill.

Kintsch, W. (1974). The representation of meaning in memory. Erlbaum.

Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Cognition, 14(1), 41–104.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50(9), 741–749.

Simonton, D. K. (1999). Origins of genius: Darwinian perspectives on creativity. Oxford University Press.

Simpson, R., Briggs, S., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. The Regents of the University of Michigan.

van Dijk, T. A., & Kintsch, W. (1983). Strategies of discourse comprehension. Academic Press.

Vygotsky, L. S. (1962). Thought and language. MIT Press.

Appendix A: Complete Per-Guest Dimension Scores

The full per-guest dimension-level scores are presented in three tables, one per scoring model, to keep each table within the page width. Scores are 3-pass blinded averages on the 1–9 scale. Abs = Abstraction; Cmp = Compression; Ori = Originality; CC = Conceptual Continuity; EC = Epistemic Calibration; GSM = Generative Self-Monitoring; Voc = Vocabulary (moderator); Syn = Syntactic Control (moderator). VRI = Verbal Reasoning Index composite weighted from the six core dimensions. Rows sorted by discipline cell, then by that scorer's VRI descending within cell.

Appendix A.1: Claude Sonnet 4 scores (n = 98)

Guest	Cell	Abs	Cmp	Ori	CC	EC	GSM	Voc	Syn	VRI
Russ Roberts	economics	8	7	7	8	8	7	7.7	6.7	7.52
Daron Acemoglu	economics	8	7	8	8	7	6	8	7	7.34
Abhijit Banerjee	economics	7.3	6.3	7	8	8	7	8	7	7.29
Paul Krugman	economics	7.3	6.3	6.3	7.3	8	7	7.7	6.7	7.08
Raj Chetty	economics	7	6	6.7	7.7	8	6.7	7	6.3	7.02
Alain Bertaud	economics	7	6	7	8	7	6	7	6	6.84
Larry Summers	economics	7	6	6	7	8	6.3	7.7	7	6.75
Simon Johnson	economics	7	6	6	7	7.7	6	7.3	6.3	6.64
Nassim Nicholas Taleb	economics	7	6.7	8	6.3	6.3	5.3	8	6	6.61
Alan Taylor	economics	7	6	6	7	7.7	5.7	7	6	6.59
Ed Boyden	hard_science	8	7	8	8	8	7	8	7	7.68
Michelle Dawson	hard_science	8	7	8	8	8	7	8	7	7.68
Alison Gopnik	hard_science	8	7	8	8	7.7	6.7	8	7	7.57
David Deutsch	hard_science	8	7	8	8	7.7	6.7	8	7	7.57
Steven Pinker	hard_science	8	7	7	8	8	7	8	7	7.52
Michael Nielsen	hard_science	8	7	8	7	8	6	8	7	7.36
Paul Bloom	hard_science	7.7	6.7	7.3	7.7	8	6.7	7.7	6.7	7.35
Philip Ball	hard_science	7.3	6.3	7	8	8	6.7	8	7	7.24
Atul Gawande	hard_science	7	6	7	8	8	7	8	7	7.18
Jess Wade	hard_science	7	6	7	8	7.7	6.7	8	7	7.07
Diarmaid Macculloch	history	8	7	7.3	8	8	6.3	8	7	7.47
Ada Palmer	history	8	7	8	8	7	6	8	7	7.34
Adam Tooze	history	8	7	8	8	7	6	8	7	7.34
Jill Lepore	history	8	7	8	8	7	6	8	7	7.34
Roy Foster	history	7.7	6.7	7	8	8	6.3	8	7	7.30
Helen Castor	history	7	6	7	8	8	7	8	7	7.18
Jennifer Burns	history	7	6	7	8	8	6.3	7	6.3	7.07
Paul Gillingham	history	7	6	7	8	7	6	8	7	6.84
Niall Ferguson	history	7	6	7	8	7	6	8	7	6.84
Patricia Fara	history	7	6	6.7	7.7	7.3	6	7.3	6.7	6.79
Reza Aslan	history	7	6	6.7	7.7	7	6	8	7	6.73
Ezra Klein	journalism_public	7	6	7	8	8	7	7	6.7	7.18
Nate Silver	journalism_public	7	6	6.3	7.3	8	7	7	6	6.97
David Brooks	journalism_public	7	6	7	8	7	6	7.3	6.3	6.84
Malcolm Gladwell	journalism_public	7	6	7	8	7	6	7	6	6.84
Larissa Macfarquhar	journalism_public	7	6	7	8	7	6	8	7	6.84
Andrew Sullivan	journalism_public	7	6	7	8	7	6	8	7	6.84
Ben Thompson	journalism_public	7	6	6.3	7.3	7	6	7	6	6.63
Barkha Dutt	journalism_public	7	6	6	7	7	6	7	6	6.52
Ben Westhoff	journalism_public	6	6	7	7	7	6	7	6	6.50
Annie Jacobsen	journalism_public	7	4.7	6.3	7	7.7	6	7	6	6.48
Andrew Ross Sorkin	journalism_public	6	4.7	7	7	6	5.7	7	6	6.05
Cass Sunstein	law_policy	8	7	7.7	8	8	7	8	7	7.63
Jamal Greene	law_policy	8	7	7	8	8	6.7	8	7	7.47
Rachel Harmon	law_policy	7	6	6.3	7.3	8	6.3	7.3	6.3	6.86
Ben Sasse	law_policy	7	6	7	8	7	6	7.7	7	6.84
Bruno Macaes	law_policy	7.3	6.3	7.3	7.7	6.7	5.7	7.7	6.3	6.84
Jennifer Pahlka	law_policy	7	6	7	8	7	6	7	6	6.84
Samantha Power	law_policy	7	6	6	7	8	6.3	8	7	6.75
Stanley Mcchrystal	law_policy	7	6	6	7	7	6	7	6	6.52
Tom Tugendhat	law_policy	7	6	6	7	6.7	5.7	7	6.7	6.41
John O Brennan	law_policy	7	4.7	5.3	6.3	7.7	6	7	6	6.21
Leopoldo Lopez	law_policy	7	6	6	7	6	5	7	6	6.18
Dana Gioia	lit_arts	8	7	8	8	7	6	8	7	7.34
Margaret Atwood	lit_arts	7	6	8	7	8	7	8	7	7.18
Brian Koppelman	lit_arts	7	6	7	8	7.7	6.7	7.3	6.3	7.07
Fuchsia Dunlop	lit_arts	7	6	7	8	7	6	8	7	6.84
Alex Ross	lit_arts	7	6	7	8	7	6	8	7	6.84
Camille Paglia	lit_arts	8	7	8	6.7	5.7	4.7	8	7	6.67
Benjamin Moser	lit_arts	7	6	7	7.3	6.7	5.7	8	7	6.62
Andy Weir	lit_arts	6.3	5.3	7	7.3	6.3	6	6.7	6	6.39
Cynthia Haven	lit_arts	7	6	6	6	7	6	7	6	6.36
Emily St John Mandel	lit_arts	6	5	7	6	7	6	7	6	6.18
Ana Vidovic	lit_arts	6	4	5	6	5.7	4.7	6	5	5.25
Rebecca Kukla	philosophy	8	7	8	8	8	7	8	7	7.68
Agnes Callard	philosophy	8	7	7.7	8	8	7	8	7	7.63
David Bentley Hart	philosophy	8	7	7	8	8	7	8.3	7.7	7.52
Elijah Millgram	philosophy	8	7	8	7	8	7	8	7	7.52
William Macaskill	philosophy	8	7	7	8	8	7	8	7	7.52
Amia Srinivasan	philosophy	8	7	7	8	8	6.7	8	7	7.47
Rabbi David Wolpe	philosophy	8	7	7.3	7.7	8	6.7	8	7	7.47
John Gray	philosophy	8	7	8	8	7.3	6.3	8	7	7.45
Kwame Anthony Appiah	philosophy	8	7	7	8	8	6.3	8	7	7.41
Peter Singer	philosophy	8	7	6	8	8	7	8	7	7.36
Noam Chomsky	philosophy	8	7	7.7	8	7	6	8	7	7.29
Slavoj Zizek	philosophy	8	7	8	6.3	6.7	6	8	6	7.01
Henry Farrell	social_science	8	7	7.7	8	8	7	8	7	7.63
Daniel Kahneman	social_science	8	7	8	7.3	8	6.3	8	7	7.47
Jonathan Haidt	social_science	8	7	7	8	8	6	8	7	7.36
Joseph Henrich	social_science	8	7	8	8	7	6	8	7	7.34
Arthur Brooks	social_science	7.7	6.7	7	8	7.7	6.7	8	7	7.29
Philip E Tetlock	social_science	7	6	7	7.7	8	6.3	8	7	7.02
Chris Blattman	social_science	7	6	6.7	7.7	8	6.7	7.7	6.7	7.02
Harvey Mansfield	social_science	8	7	7	8	6	5.3	8	7	6.89
Ashley Mears	social_science	7	6	7	7.3	7	6	7	6	6.73
Daniel Carpenter	social_science	7	6	6	7	8	6	8	7	6.70
Eric Kaufmann	social_science	7	6	6	7	7	5.7	7	6	6.47
Vitalik Buterin	tech_entrepreneurship	8	7	8	8	7.7	6.7	8	7	7.57
Marc Andreessen	tech_entrepreneurship	8	7	8	8	7	6.7	8	7	7.45
Audrey Tang	tech_entrepreneurship	8	7	8	7	7.7	6	8	7	7.30
Balaji Srinivasan	tech_entrepreneurship	8	7	8	7	7	6.3	8	7	7.23
Daniel Gross	tech_entrepreneurship	7	6	7	8	7.7	6.7	7	6	7.07
Sam Altman	tech_entrepreneurship	7	6	7	7.7	7.7	6	7	6	6.91
Brian Armstrong	tech_entrepreneurship	7	6	7	7.7	7.7	6	7	6	6.91
Chris Dixon	tech_entrepreneurship	7	6	7	8	7	6	7.3	6.3	6.84
Blake Scholl	tech_entrepreneurship	7	6	8	7	6	6	7	6	6.66
Patrick Collison	tech_entrepreneurship	6.7	6	7	7	6	5.3	7	6	6.33
David Rubenstein	tech_entrepreneurship	6	5	6	7	7	6	7	6	6.18

Appendix A.2: GPT-5 mini scores (n = 99)

Guest	Cell	Abs	Cmp	Ori	CC	EC	GSM	Voc	Syn	VRI
Daron Acemoglu	economics	7.7	6.3	7	7.7	7.7	6.7	6.7	6.7	7.19
Larry Summers	economics	7.7	6	5.7	6.3	8	7.3	7	7	6.87
Raj Chetty	economics	7	5.7	6	7	7.7	7	6	6	6.75
Nassim Nicholas Taleb	economics	7.7	6	7	6	6.7	7	6.7	6	6.74
Russ Roberts	economics	7.3	5.7	6	6.7	7	7	6	6.7	6.63
Alan Taylor	economics	7.3	6	5	6.7	7.7	6.7	6.3	6.7	6.59
Abhijit Banerjee	economics	7	5.7	6	6.3	7.3	7	6	6	6.58
Paul Krugman	economics	7.3	6	6	6.3	7	6.7	6.3	6.7	6.58
Alain Bertaud	economics	7.7	5.3	6	7	6.7	6	6.7	6.7	6.47
Simon Johnson	economics	7.3	5	5.3	6.3	6.3	6	6.3	6.3	6.09
Ed Boyden	hard_science	7.7	6	7.7	7.7	8	7	7	6.7	7.35
Michelle Dawson	hard_science	7.7	6	7	7.3	7.7	7	7	6	7.13
Alison Gopnik	hard_science	8	5.7	7	7.3	7.7	6.7	7	7	7.09
David Deutsch	hard_science	8	5.3	6.7	7.3	7.7	7	7.3	6.7	7.03
Steven Pinker	hard_science	7.3	6	6	7.3	7.7	7	7.3	7.3	6.91
Paul Bloom	hard_science	7.3	6	6	7	7.7	7	6	6.3	6.86
Philip Ball	hard_science	7.3	6	6	6.7	7.7	7	6.7	6.7	6.81
Michael Nielsen	hard_science	7.7	5.7	6	6.3	7.3	7	6.3	6.3	6.70
Atul Gawande	hard_science	7.7	5.7	6	6.7	7	7	6.7	6.7	6.69
Ezekiel Emanuel	hard_science	7	5.7	6	6.7	7	6.7	6.3	6.3	6.52
Jess Wade	hard_science	7	4.3	6	6.3	7.3	7	6.7	6.7	6.37
Helen Castor	history	7.7	6	6	7.7	8	7	7	7	7.09
Adam Tooze	history	8	6.3	6.3	7	7.7	7	8	7.3	7.09
Roy Foster	history	7.7	6	6	7.3	7.7	6.7	7.3	7.3	6.92
Paul Gillingham	history	8	6.3	6	7	7	7	7.3	7	6.91
Ada Palmer	history	7.3	6	6.7	7	7.3	6.7	7.3	7.3	6.85
Jill Lepore	history	7.3	5.3	6.3	7	7.7	7	6.3	7	6.81
Diarmaid Macculloch	history	7.3	6	6.3	7	7	6.3	7.7	7	6.69
Jennifer Burns	history	7	5.3	6	7	7.3	6.3	6.3	6.3	6.53
Niall Ferguson	history	7	6	6	6.7	7	6.3	7	6.7	6.52
Patricia Fara	history	7	5.7	6	7	7	6.3	6.3	6.7	6.52
Reza Aslan	history	7.3	5.7	6	6.7	7	6	6.7	6	6.47
Nate Silver	journalism_public	8	6	6	7	8	7	6	6	7.04
Barkha Dutt	journalism_public	7.3	6	6	7	7.7	7	6.7	6.7	6.86
Malcolm Gladwell	journalism_public	7.3	6	6	7	7	7.3	6.7	7	6.79
Ben Thompson	journalism_public	7	6.3	6	7	7.3	6	6.3	6	6.63
Ezra Klein	journalism_public	7.3	5.7	6	6.7	7.3	6.3	6	6	6.59
Andrew Sullivan	journalism_public	7.3	6	6	6.3	6.7	7	6.7	6.7	6.57
Andrew Ross Sorkin	journalism_public	6.7	5.7	6	6.7	7	6.7	6	6	6.46
David Brooks	journalism_public	7.3	4.7	6	6.3	7	7	6.3	6	6.42
Larissa Macfarquhar	journalism_public	7	4.7	6	6.7	7	7	7	7	6.41
Ben Westhoff	journalism_public	6.7	5.7	6	6.7	7	6.3	6	6	6.41
Annie Jacobsen	journalism_public	7	4.7	6	6.3	7.3	6.7	6	6.3	6.37
Jamal Greene	law_policy	7.7	6	6	7	8	7	6.7	6.3	6.98
Rachel Harmon	law_policy	7.7	6	5.7	7	7.7	6.7	6.7	6.3	6.81
Cass Sunstein	law_policy	7	6	6	7	7.3	7	6.7	6.3	6.74
Bruno Macaes	law_policy	7.3	6	6	7	7	6.3	6.3	6	6.63
Ben Sasse	law_policy	7	5.7	6	6.7	7	6.7	6	6.3	6.52
Samantha Power	law_policy	7.7	5	5.3	6.7	7	7	6.7	6.7	6.48
Jennifer Pahlka	law_policy	7	6	6	6.7	6.7	6.3	6	6	6.46
Stanley Mcchrystal	law_policy	7.3	5.7	6	6.3	6.7	6	6	6	6.36
Tom Tugendhat	law_policy	7	5.3	5.3	6.7	7	6.3	6.7	6.3	6.31
Leopoldo Lopez	law_policy	7	6	5	6.7	6	6	6	6	6.13
John O Brennan	law_policy	7	5	4.3	6	7.3	6	6.3	6	5.99
Alex Ross	lit_arts	7.7	6	6	7	7	6.7	7.7	7	6.75
Andy Weir	lit_arts	7	6	6	7	7	7	6.3	6.3	6.68
Margaret Atwood	lit_arts	7	5.7	6.7	6.3	7.3	6.7	7	7	6.63
Fuchsia Dunlop	lit_arts	7	6	6.3	7	7	6	7.3	6.7	6.57
Cynthia Haven	lit_arts	7	5.7	5.7	6.7	7.3	6.3	6.7	6	6.47
Brian Koppelman	lit_arts	7	5	6	6.7	7.3	6.7	7.3	7	6.47
Dana Gioia	lit_arts	7.7	4.7	6	7	6.7	6.3	7.3	7	6.42
Benjamin Moser	lit_arts	7	5.3	6	6	6.7	6.3	6.3	6	6.25
Emily St John Mandel	lit_arts	6	4.3	5.7	6	7	6.3	6	6.3	5.91
Camille Paglia	lit_arts	7	3.3	6.7	6	4.7	5.7	7.3	6.7	5.57
Ana Vidovic	lit_arts	6	3.3	4	6	6	6	6	6	5.25
Agnes Callard	philosophy	7.7	6	6.7	7	7.7	7	6.7	6	7.03
Rabbi David Wolpe	philosophy	8	5.3	6.3	7	8	7	7.7	7	6.99
John Gray	philosophy	8	6	6	7	7.7	7	7.7	6.7	6.98
Amia Srinivasan	philosophy	8	6	6.3	7	7.7	6.7	7	6.7	6.98
Rebecca Kukla	philosophy	7.7	5.7	6.7	7	7.7	7	6.3	6.7	6.97
Elijah Millgram	philosophy	7.7	6	6.7	6.7	7.3	7.3	7	6.3	6.97
Noam Chomsky	philosophy	8	6.3	6.3	7.3	7	6.7	7	7	6.97
William Macaskill	philosophy	7.7	6	5.7	7	8	7	6.7	6.3	6.93
David Bentley Hart	philosophy	7.7	6	6	7	7.7	7	7.7	7.3	6.92
Kwame Anthony Appiah	philosophy	7.7	6	6	7	7.3	6.7	6.3	7	6.81
Slavoj Zizek	philosophy	7	5.3	6.7	6.7	7	6.3	6.7	6	6.52
Peter Singer	philosophy	7.7	5.7	5.3	6.7	7.3	6	6.7	6.3	6.49
Daniel Kahneman	social_science	8	6	7	7	8	7.3	6.3	6.3	7.25
Henry Farrell	social_science	8	6	6.3	7	8	7	7.3	7	7.09
Joseph Henrich	social_science	7.3	6	6.7	7	8	7	6.7	6	7.03
Daniel Carpenter	social_science	7.7	6	6	7	8	7	6.7	6.7	6.98
Philip E Tetlock	social_science	7.7	6	6.3	7	7.3	7.3	6.7	6	6.97
Jonathan Haidt	social_science	7.3	6	6	7	7.3	7	6.3	6.7	6.80
Arthur Brooks	social_science	7.7	5	6.3	7	7	7	7	6.3	6.69
Chris Blattman	social_science	7.3	5.3	6	7	7.3	6.3	6	6	6.59
Harvey Mansfield	social_science	7.7	6	6	6.3	7	6.3	7	6	6.59
Eric Kaufmann	social_science	7.3	6	5.7	6.7	7	6	6	6	6.47
Ashley Mears	social_science	7.3	5	5.7	7	7	6	6	6	6.37
Vitalik Buterin	tech_entrepreneurship	8	6	7	7	8	7	7	6.3	7.20
Audrey Tang	tech_entrepreneurship	7.7	5.7	7.3	7.3	7.3	6.3	7	7	6.97
Balaji Srinivasan	tech_entrepreneurship	7.7	6	7	6.7	7	7	7	6.3	6.91
Daniel Gross	tech_entrepreneurship	7	6	6	7	8	7	6	6.3	6.86
Chris Dixon	tech_entrepreneurship	7.3	6	6	7	7.7	6.7	7.3	6.3	6.81
Blake Scholl	tech_entrepreneurship	7.3	6	7	7	6.7	6.7	6	6	6.79
Marc Andreessen	tech_entrepreneurship	7.3	6	6	6.7	7.7	6.7	6.3	6	6.75
Sam Altman	tech_entrepreneurship	7.3	6	6	6.7	7.3	7	6	6.3	6.75
Brian Armstrong	tech_entrepreneurship	7	5.7	6	6.7	7.7	6.7	6	6.3	6.64
Patrick Collison	tech_entrepreneurship	7	6	6	6.3	6.7	6	6	6	6.35
David Rubenstein	tech_entrepreneurship	7	4.7	4.3	6	7	6	6	6	5.88

Appendix A.3: Mistral Large scores (n = 99)

Guest	Cell	Abs	Cmp	Ori	CC	EC	GSM	Voc	Syn	VRI
Daron Acemoglu	economics	8	7	8	9	8	8	8	7	8.00
Russ Roberts	economics	8	7	8	8	8	7.7	8	7	7.79
Nassim Nicholas Taleb	economics	8	7	8.3	8	7.7	7.3	9	7	7.73
Alain Bertaud	economics	8	7.3	8	8.3	7.3	7.3	8	7	7.72
Abhijit Banerjee	economics	8	7	7.3	8	8.3	7.3	8	7	7.69
Paul Krugman	economics	8	7	7.7	8	8	7	8	7	7.63
Larry Summers	economics	8	7.3	7	8	8	7	8	7	7.57
Raj Chetty	economics	8	7	7	8	8	7	8	7	7.52
Simon Johnson	economics	7.7	6.7	7	8	7.7	7	8	7	7.35
Alan Taylor	economics	7	6	6	7.3	7	6.3	7.3	6.7	6.63
Michelle Dawson	hard_science	9	8	9	9	9	8.3	9	8	8.73
David Deutsch	hard_science	9	8	9	9	9	8	9	8	8.68
Alison Gopnik	hard_science	9	8	9	9	8.3	8	8.7	7.7	8.56
Steven Pinker	hard_science	9	8	8	9	8.7	7.7	9	8	8.41
Ed Boyden	hard_science	8	7	9	8.3	8	8	8	7	8.05
Philip Ball	hard_science	8	7	8	8.3	8.7	8	8	7	8.01
Michael Nielsen	hard_science	8	7	8	8	8.3	8	8	7	7.90
Paul Bloom	hard_science	8	7	8	8	8	8	8	7	7.84
Jess Wade	hard_science	8	7	8	8.7	8	7.3	8.3	7.3	7.84
Ezekiel Emanuel	hard_science	8	7	8	8	7.3	7	8	7	7.56
Atul Gawande	hard_science	8	7	8	8	7	7.3	8	7	7.55
Helen Castor	history	8	7	8	9	8	8	9	8	8.00
Ada Palmer	history	8	7	8.3	8.7	8	7.7	9	8	7.95
Roy Foster	history	8	7	8	9	8	7.7	9	8	7.95
Reza Aslan	history	8	7	8	9	8	7.3	9	8	7.89
Diarmaid Macculloch	history	8	7	8	9	8	7	9	8	7.84
Adam Tooze	history	8	7	8	8	8	7	9	8	7.68
Paul Gillingham	history	8	7	8	8.3	7.7	7	8.3	7.3	7.67
Jill Lepore	history	8	7	8	8	7	7	8	7	7.50
Niall Ferguson	history	8	7	8	8	7	7	8.7	7.7	7.50
Jennifer Burns	history	7	6	7	8	7	7	7	7	7.00
Patricia Fara	history	7	6	7	8	7	6	8	7	6.84
Nate Silver	journalism_public	8	7	8	8	8	8	8	7	7.84
Larissa Macfarquhar	journalism_public	8	7	8	8.3	8	7.7	8	7	7.84
Ezra Klein	journalism_public	8	7	8	8	8.3	7.3	8	7	7.79
Andrew Sullivan	journalism_public	8	7	8	8	7.7	8	8.3	7.3	7.78
Annie Jacobsen	journalism_public	8	7	8	8.7	7.7	7.3	8.3	7	7.78
Barkha Dutt	journalism_public	8	7	7.3	8.3	8	7.3	8.3	7.3	7.68
Malcolm Gladwell	journalism_public	8	7	8	8	7	8	8	7	7.66
Ben Thompson	journalism_public	8	7	7.7	8.3	7.7	7	8	7	7.62
David Brooks	journalism_public	8	7	8	8	7.3	7.3	8	7	7.61
Andrew Ross Sorkin	journalism_public	7	6	7	8	7	7.3	7.7	7	7.05
Ben Westhoff	journalism_public	7	6	7	8	7	7	7	7	7.00
Cass Sunstein	law_policy	8	7	8	8.3	7.7	8	8	7	7.83
Jennifer Pahlka	law_policy	8	7	8	8.7	7.3	8	8	7	7.83
Leopoldo Lopez	law_policy	8	7	8	8.3	8	7.3	8	7	7.79
Samantha Power	law_policy	8	7	8	8	8	7.3	8	7	7.73
Ben Sasse	law_policy	8	7	8	8.3	7.7	7.3	8	7	7.73
Bruno Macaes	law_policy	8	7	8	8	7	7.3	8	7	7.55
Jamal Greene	law_policy	8	7	7	8	8	7	8	7	7.52
Rachel Harmon	law_policy	8	7	7	8	8	7	7.7	7	7.52
Tom Tugendhat	law_policy	7.3	6.7	7	8	7.7	7	8	7.3	7.29
Stanley Mcchrystal	law_policy	7.3	6.3	7	8	7.3	7	7	6.7	7.17
John O Brennan	law_policy	7	6	6	7	7.3	7	8	7	6.74
Brian Koppelman	lit_arts	8	7	8	8.7	8	8	8	7	7.95
Camille Paglia	lit_arts	8.3	7.3	9	8.3	7.3	7.3	9	8	7.94
Andy Weir	lit_arts	8	7	8	8.7	7.7	7.3	8	7	7.78
Margaret Atwood	lit_arts	8	7	8	8	8	7	8.3	7.3	7.68
Dana Gioia	lit_arts	8	7	8	8.3	7.7	7	8.7	7.7	7.67
Alex Ross	lit_arts	8	7	8	8	7.7	7	8.7	7.3	7.62
Benjamin Moser	lit_arts	8	7	8	8	7	7	8.3	7.3	7.50
Cynthia Haven	lit_arts	8	7	8	8	7	7	8	7	7.50
Fuchsia Dunlop	lit_arts	7.3	6.3	7.3	8.3	7.3	7	8.3	7	7.28
Emily St John Mandel	lit_arts	7	6	7	7.7	7	7	7	7	6.95
Ana Vidovic	lit_arts	7	6	6	7	7	7	7	7	6.68
Elijah Millgram	philosophy	9	8	9	9	9	9	9	8	8.84
John Gray	philosophy	9	8	9	9	9	8	9	8	8.68
David Bentley Hart	philosophy	9	8	8.7	9	9	8	9	8	8.63
Amia Srinivasan	philosophy	9	8	9	9	8.3	8	9	8	8.56
Noam Chomsky	philosophy	9	8	9	9	8	8	9	8	8.50
Rebecca Kukla	philosophy	8.7	7.7	8.7	8.7	7.7	8	8.7	7.7	8.22
Kwame Anthony Appiah	philosophy	8	7	8	8.3	9	8	8.3	7.3	8.07
William Macaskill	philosophy	8	7	8	8.3	8	8	8	7	7.89
Rabbi David Wolpe	philosophy	8	7	8	8	8	7	8	7	7.68
Peter Singer	philosophy	8	7	7.3	8.3	8	7.3	8	7	7.68
Slavoj Zizek	philosophy	8	7	8.3	8	7	7.7	8.7	7	7.66
Agnes Callard	philosophy	8	7	8	8	7.7	7	8	7	7.62
Daniel Kahneman	social_science	8.3	7.3	8.3	8.3	9	8	8.3	7.3	8.24
Arthur Brooks	social_science	8.7	7.7	8	9	8	7.7	9	8	8.17
Harvey Mansfield	social_science	8.3	7.3	8	9	7.3	8	8.3	7.3	7.99
Philip E Tetlock	social_science	8	7	8	8.7	8.3	7.7	8	7	7.95
Daniel Carpenter	social_science	8	7	7.7	8	8.3	7.7	8.3	7	7.79
Jonathan Haidt	social_science	8	7	8	8.3	8	7	8	7	7.73
Henry Farrell	social_science	8	7	8	8	7.7	7.7	8.3	7.3	7.73
Chris Blattman	social_science	8	7	7.3	8	8.3	7.3	8	7	7.69
Joseph Henrich	social_science	8	7	8	8.3	7.3	7	8	7	7.61
Ashley Mears	social_science	7.7	6.7	7.7	8	7	7	8	7	7.33
Eric Kaufmann	social_science	7.3	6.3	7	8	7.3	7	7.7	6.7	7.17
Audrey Tang	tech_entrepreneurship	8.3	7.7	8.7	8.7	8	8.3	9	8	8.27
Blake Scholl	tech_entrepreneurship	8	7.7	9	9	8	8	8	7	8.27
Marc Andreessen	tech_entrepreneurship	8	7	8	9	8	8	8	7	8.00
Vitalik Buterin	tech_entrepreneurship	8.3	7.3	8	8.3	8	7.3	8.3	7.3	7.90
Sam Altman	tech_entrepreneurship	8	7	8	8.3	7.7	7.3	8	7	7.73
Patrick Collison	tech_entrepreneurship	8	7	8	8.7	7.7	7	8	7	7.73
Chris Dixon	tech_entrepreneurship	8	7	8	8.3	7.3	7	8	7	7.61
Daniel Gross	tech_entrepreneurship	8	7	8	8	7	7.3	8	7	7.55
Balaji Srinivasan	tech_entrepreneurship	8	7	8	8	7	7	8	7	7.50
Brian Armstrong	tech_entrepreneurship	8	7	8	8	7	7	7.3	7	7.50
David Rubenstein	tech_entrepreneurship	7	6	6.3	7.3	7	7	7.7	7	6.79

Normative Verbal Reasoning Profiles From 99 Podcast Guests: A Three-Model Scoring Study Using the Expressive Cognition Rubric