A research paper from expressivecognition.org
Conflict of Interest Statement
This research was conducted by the developer of the Expressive Cognition assessment tool (expressivecognition.org), a freely accessible assessment with an optional paid report tier. This relationship is disclosed in the interest of full transparency. No external funding was received for this research.
AI Usage Statement
This study employs Claude Sonnet 4 (Anthropic), GPT-5 mini (OpenAI), and Mistral Large (Mistral AI) as automated scoring agents within a predefined behavioral rubric and a blinded multi-pass evaluation protocol. All three models function as measurement instruments: they apply the rubric to speech transcripts and generate dimension-level scores under controlled prompting conditions. Claude was also used for editorial assistance in the preparation of this manuscript. The theoretical framework, research design, analyses, and all interpretive conclusions are the work of the Expressive Cognition research program. Full responsibility for the accuracy, integrity, and originality of the manuscript rests with the project.
Abstract
This study establishes normative Verbal Reasoning Index (VRI) scores for 99 guests from Conversations with Tyler, a long-form intellectual interview podcast, scored across six core reasoning dimensions and two moderators by three independent large language models from three different vendors (Claude Sonnet 4, GPT-5 mini, and Mistral Large) in three blinded passes each. The resulting corpus — balanced across nine disciplinary cells with 10–12 speakers each — constitutes the largest labeled spontaneous verbal reasoning dataset in healthy adults currently available. Pairwise cross-model VRI agreement averages r = .668 across the three vendors (Sonnet↔GPT-5m r = .758, Sonnet↔Mistral r = .657, GPT-5m↔Mistral r = .590), with systematic calibration offsets that form a strict generosity gradient: Mistral scores highest on average, Sonnet intermediate, GPT-5 mini lowest. Discipline rank ordering is preserved across all three models: philosophy and hard science at the top, literary arts at the bottom. Inter-pass reliability for Sonnet is excellent (mean VRI spread across three passes = 0.19). Confirmatory factor analyses estimated independently on each scorer's data reject a unidimensional model and recover an identical two-factor structure under all three scorers — Abstraction, Compression, and Originality clustering as Generative Range, and Epistemic Calibration and Generative Self-Monitoring clustering as Calibrative Control — with factor composition stable across models and factor separation (φ = .38 Sonnet → .77 GPT-5 mini → .89 Mistral) varying systematically with each scorer's within-factor correlation pattern. Each of the three models selects a different best-fitting home for Conceptual Continuity, identifying it empirically as a boundary dimension whose factorial placement is scorer-convention-determined. The finding that three independently-prompted frontier LLMs reproduce the same disciplinary hierarchy and the same factor composition — despite having no access to speaker identity, discipline, or each other's scores — provides convergent validity evidence for the EC rubric's capacity to detect real differences in spontaneous verbal reasoning across intellectual domains.
Keywords: verbal reasoning, normative data, LLM scoring, cross-model agreement, spontaneous speech, podcast discourse, Conversations with Tyler
Introduction
The Expressive Cognition (EC) rubric scores spontaneous speech across six core dimensions of verbal reasoning — Abstraction, Compression, Originality, Conceptual Continuity, Epistemic Calibration, and Generative Self-Monitoring — plus two moderator dimensions (Vocabulary and Syntactic Control) that are reported but excluded from the composite Verbal Reasoning Index (VRI). Prior work has established construct validity for the rubric in known-groups designs using Supreme Court oral arguments and academic seminar speech (companion paper), and ecological validity in a 30-guest subset of Conversations with Tyler (CWT) guests correlated against external intellectual reputation (companion paper).
The present study extends this work in two directions. First, it expands the CWT sample from 30 to 99 guests balanced across nine disciplinary cells, producing normative data that allows VRI scores to be interpreted relative to a reference population of high-ability conversational speakers. Second, it introduces cross-model scoring — the same 99 transcripts scored independently by three frontier LLMs from three different vendors (Anthropic, OpenAI, and Mistral AI) — providing the first published inter-model reliability data for an LLM-applied psychometric rubric at this scale, and the first three-way factor-invariance test of the underlying construct.
Method
Sample
Ninety-nine guests from Conversations with Tyler were selected using a purposive sampling design stratified across nine disciplinary cells. A sampling script enumerated all CWT guests from the public episode index, assigned discipline tags, and selected approximately 11 guests per cell, force-including 30 guests from the prior ecological validity study. One joint-guest episode (Noel Johnson and Mark Koyama) was excluded because the rubric assumes a single speaker. The final sample comprised 99 unique speakers.
Table 1. Sample composition by discipline cell.
| Cell | n | Example guests |
|---|---|---|
| Philosophy | 12 | Agnes Callard, Slavoj Žižek, Noam Chomsky, Peter Singer |
| Economics | 10 | Daron Acemoglu, Esther Duflo (not in sample), Larry Summers |
| Hard Science | 10 | Alison Gopnik, David Deutsch, Steven Pinker, Ed Boyden |
| Social Science | 11 | Daniel Kahneman, Jonathan Haidt, Philip Tetlock |
| History | 11 | Niall Ferguson, Jill Lepore, Ada Palmer |
| Law/Policy | 11 | Cass Sunstein, Samantha Power, Jamal Greene |
| Lit/Arts | 11 | Margaret Atwood, Camille Paglia, Dana Gioia |
| Tech/Entrepreneurship | 11 | Vitalik Buterin, Sam Altman, Marc Andreessen |
| Journalism/Public | 11 | Malcolm Gladwell, Ezra Klein, Nate Silver |
Transcript Extraction and Screening
For each guest, the full CWT transcript was fetched from the public CWT website and processed through a three-stage screening protocol designed to isolate spontaneous reasoning from rehearsed or recited material.
Stage 1 — Pre-filtering. Host speech was stripped; only guest turns were retained. Turns below a minimum word threshold were excluded.
Stage 2 — Spontaneity screening. Each remaining turn was evaluated by Claude Sonnet 4 for spontaneity. Turns classified as rehearsed set-pieces, recitations, memorized factual lists, or pre-drafted statements were excluded. The screener operated blind to the EC scoring rubric.
Stage 3 — Inclusion threshold. Guests were included only if their screened transcript contained ≥1,500 words across ≥8 retained turns. All 99 candidates passed this threshold. Median screened transcript length was 7,500 words (range: 1,515–11,267).
Scoring
Three scoring runs were conducted independently on the same 99 transcripts. All three used identical prompts, rubric text, dimension definitions, band descriptors, JSON output schema, temperature (0.3), batch structure, and three-pass shuffled-blinding protocol. The only thing that varied was the scoring model itself.
Sonnet 4 scoring. Claude Sonnet 4 (claude-sonnet-4-20250514, Anthropic) scored each transcript blinded — identified only as "Speaker A," "Speaker B," etc. — using the full v3 EC behavioral descriptor rubric. Speaker labels were randomized independently on each of three scoring passes. Total scoring passes: 297 (99 guests × 3 passes). Guests were scored in shuffled batches of 6 to enable cross-guest blinding within each batch.
GPT-5 mini scoring. GPT-5 mini (OpenAI) scored the same transcripts using the same rubric and protocol. Total scoring passes: 294 (98 guests × 3 passes; one guest was excluded due to a scoring pipeline error).
Mistral Large scoring. Mistral Large (mistral-large-latest, Mistral AI) scored the same transcripts via the La Plateforme REST API using the same rubric, protocol, and schema. Total scoring passes: 297 (99 guests × 3 passes). All 99 guests produced valid scores.
The three scoring runs were completely independent: no model had access to any other's scores, and no post-hoc calibration was applied. Cross-model analyses that require a common sample use the 98 guests valid under all three scorers.
Measures
Verbal Reasoning Index (VRI). A weighted composite of six core dimensions: Abstraction (.18), Compression (.16), Originality (.16), Conceptual Continuity (.16), Epistemic Calibration (.18), and Generative Self-Monitoring (.16). Weights reflect the theoretical priority of Abstraction and Epistemic Calibration as the dimensions most closely linked to the Gf-dominant construct EC targets.
Moderator dimensions. Vocabulary and Syntactic Control are scored but excluded from the VRI composite. They capture Gc-linked linguistic competence that correlates with education and language background rather than with the reasoning construct.
Factor-structure models. To test whether the six core dimensions reflect a single general factor or a more differentiated structure, we ran confirmatory factor analyses independently on each scoring model's data. Four nested models were estimated by maximum likelihood on the 6×6 dimension-level correlation matrix (three-pass mean per dimension): a one-factor baseline (M1) and three two-factor specifications that differ in where Conceptual Continuity is assigned (M2a: on Generative Range only; M2b: on Calibrative Control only; M2c: cross-loading both). Fit was evaluated by χ², RMSEA, CFI, and SRMR against conventional cutoffs, with AIC used for model selection. The same nested-model comparison was run on Sonnet, GPT-5 mini, and Mistral Large data separately, yielding three independent CFA results that can be compared for structural replication. Fit computations were implemented directly (see scripts/cwt-norms/cfa.mjs) rather than via an external SEM package to keep the full analysis pipeline reproducible from a single repository.
Results
Overall Descriptive Statistics
Table 2. Overall VRI descriptive statistics by model.
| Statistic | Sonnet 4 | GPT-5 mini | Mistral Large |
|---|---|---|---|
| n | 98 | 98 | 98 |
| Mean VRI | 7.01 | 6.67 | 7.74 |
| SD | 0.45 | 0.34 | 0.44 |
| Min | 5.25 | 5.25 | 6.63 |
| Max | 7.68 | 7.35 | 8.84 |
| Median | 7.02 | 6.70 | 7.73 |
The three scoring models produce a strict generosity gradient: Mistral scores highest on average (mean VRI 7.74), Sonnet is intermediate (7.01), and GPT-5 mini scores lowest (6.67). The difference between the most-generous and least-generous model is more than one full scale point on VRI (1.07). Mistral and Sonnet have similar distributional spread (SD = 0.44 and 0.45 respectively), while GPT-5 mini compresses the range somewhat (SD = 0.34). Despite the calibration differences in absolute level, all three models place the lowest scorer and the highest scorers in the same guests: Ana Vidovic at the bottom on both Sonnet and GPT-5 mini, and philosophy/hard-science guests at the top on all three models.
Discipline Cell Means
Table 3. Mean VRI by discipline cell, all three models.
| Cell | n | Sonnet 4 | GPT-5 mini | Mistral Large |
|---|---|---|---|---|
| Philosophy | 12 | 7.44 | 6.88 | 8.17 |
| Hard Science | 10 | 7.42 | 6.89 | 8.16 |
| Social Science | 11 | 7.08 | 6.80 | 7.77 |
| Tech/Entrepreneurship | 11 | 6.95 | 6.72 | 7.71 |
| History | 11 | 7.11 | 6.76 | 7.62 |
| Journalism/Public | 11 | 6.70 | 6.60 | 7.61 |
| Economics | 10 | 6.97 | 6.65 | 7.56 |
| Law/Policy | 11 | 6.78 | 6.49 | 7.52 |
| Lit/Arts | 11 | 6.61 | 6.27 | 7.50 |
All three models preserve the top-two and bottom-one discipline ranking: philosophy and hard science are the highest-scoring cells on every scorer, and literary arts is the lowest on every scorer. The ordering of intermediate cells shifts somewhat between scorers — Mistral places social science and tech higher than history, while Sonnet places history above both — but the differences within the middle band are small and generally within the standard error of the cell means. The discipline rank ordering of cells is substantially stable across all three scoring models even though their absolute level calibrations differ by more than a full scale point.
Dimension-Level Cell Profiles
Table 4. Sonnet 4 mean dimension scores by discipline cell.
| Cell | Abs | Cmp | Ori | CC | EC | GSM | Voc | Syn |
|---|---|---|---|---|---|---|---|---|
| Philosophy | 8.00 | 7.00 | 7.39 | 7.75 | 7.75 | 6.67 | 8.03 | 6.97 |
| Hard Science | 7.70 | 6.70 | 7.53 | 7.87 | 7.90 | 6.73 | 7.97 | 6.97 |
| Social Science | 7.52 | 6.52 | 7.03 | 7.64 | 7.52 | 6.18 | 7.79 | 6.79 |
| History | 7.42 | 6.42 | 7.24 | 7.94 | 7.39 | 6.18 | 7.85 | 6.91 |
| Tech/Entrepreneurship | 7.24 | 6.27 | 7.36 | 7.48 | 7.12 | 6.15 | 7.39 | 6.39 |
| Economics | 7.27 | 6.33 | 6.80 | 7.43 | 7.57 | 6.30 | 7.53 | 6.50 |
| Law/Policy | 7.21 | 6.09 | 6.52 | 7.39 | 7.27 | 6.06 | 7.42 | 6.48 |
| Journalism/Public | 6.82 | 5.76 | 6.73 | 7.52 | 7.15 | 6.15 | 7.21 | 6.27 |
| Lit/Arts | 6.94 | 5.85 | 7.00 | 7.12 | 6.82 | 5.88 | 7.45 | 6.48 |
Notable discipline-specific patterns:
- Philosophy achieves the maximum mean Abstraction (8.00) — every philosopher in the sample operates at the "Principled" band. This is a ceiling effect consistent with the prior ecological validity study.
- Hard Science leads on Epistemic Calibration (7.90) and Originality (7.53), reflecting the epistemic marking and novel-framing demands of scientific discourse.
- History leads on Conceptual Continuity (7.94), consistent with the narrative coherence demands of historical analysis.
- Lit/Arts scores lowest on five of six core dimensions and lowest on VRI. This is interpreted as a construct-appropriate finding: the EC rubric measures analytical verbal reasoning, not narrative or associative reasoning. Literary discourse deploys different cognitive operations than the analytical register the rubric targets.
Cross-Model Agreement
Table 5. Pairwise cross-model agreement statistics (Pearson r), n = 98 common sample.
| Dimension | Sonnet↔GPT-5m | Sonnet↔Mistral | GPT-5m↔Mistral |
|---|---|---|---|
| Abstraction | .643 | .642 | .606 |
| Compression | .397 | .611 | .272 |
| Originality | .823 | .718 | .726 |
| Conceptual Continuity | .556 | .460 | .501 |
| Epistemic Calibration | .666 | .497 | .368 |
| Generative Self-Monitoring | .518 | .355 | .391 |
| Vocabulary | .620 | .640 | .656 |
| Syntax | .513 | .514 | .496 |
| VRI | .758 | .657 | .589 |
| VRI Spearman ρ | .721 | .595 | .536 |
Table 5b. Mean calibration offsets on VRI (higher-scoring model − lower-scoring model).
| Pair | Offset (points) | Direction |
|---|---|---|
| Sonnet − GPT-5 mini | +0.33 | Sonnet more generous |
| Mistral − Sonnet | +0.73 | Mistral more generous |
| Mistral − GPT-5 mini | +1.06 | Mistral more generous |
The three models form a strict generosity gradient on VRI: Mistral > Sonnet > GPT-5 mini, with a 1.06-point total spread from the most-generous to the least-generous scorer. Pairwise agreement is highest between Sonnet and GPT-5 mini (r = .758) and lowest between GPT-5 mini and Mistral (r = .589), with Sonnet↔Mistral intermediate (r = .657). The mean pairwise r across the three vendors is .668. All three pairwise correlations are substantial and well above chance, indicating that despite the absolute-level calibration differences, the three models substantially agree on the rank ordering of speakers.
At the dimension level, Originality shows the highest pairwise agreement for the Sonnet↔GPT-5m pair (r = .823), and also high agreement for the Sonnet↔Mistral and GPT-5m↔Mistral pairs (r = .718 and .726 respectively). Compression is the dimension with the most scorer-specific disagreement: Sonnet and Mistral agree moderately on Compression (r = .611), but GPT-5 mini diverges from both (r = .397 with Sonnet and r = .272 with Mistral). This suggests that GPT-5 mini operationalizes propositional density somewhat differently from the other two scorers, which may reflect prompt-interpretation sensitivity on that dimension. Epistemic Calibration and Generative Self-Monitoring also show lower pairwise agreement with Mistral than with the Sonnet↔GPT-5m pair, consistent with Mistral's broader interpretation of what counts as evidence of real-time epistemic marking and self-revision.
The Mistral–Sonnet–GPT-5 mini generosity gradient is observable on every dimension except Syntactic Control. Mistral scores substantially higher than Sonnet on Abstraction (+0.65), Compression (+0.69), Originality (+0.81), Conceptual Continuity (+0.72), and Vocabulary (+0.56), and higher than GPT-5 mini on the same dimensions by even larger margins. The largest single inter-model dimension offset is the Mistral–GPT-5 mini gap on Originality (+1.80 points), reflecting the fact that Mistral credits far more reframings as genuinely novel than GPT-5 mini does.
Inter-Pass Reliability
Table 6. Inter-pass reliability (Sonnet 4): mean spread across three passes per dimension.
| Dimension | Mean Spread | SD |
|---|---|---|
| Abstraction | 0.09 | 0.29 |
| Compression | 0.15 | 0.44 |
| Originality | 0.18 | 0.39 |
| Conceptual Continuity | 0.29 | 0.56 |
| Epistemic Calibration | 0.29 | 0.50 |
| Gen. Self-Monitoring | 0.38 | 0.49 |
| Vocabulary | 0.15 | 0.36 |
| Syntax | 0.16 | 0.37 |
| VRI | 0.19 | 0.23 |
Inter-pass reliability is excellent. Mean VRI spread across three passes is 0.19 points — the same speaker scored three times by the same model under different blinding labels produces VRI scores that differ by less than two-tenths of a scale point on average. Abstraction is the most stable dimension (mean spread 0.09), and Generative Self-Monitoring is the least stable (0.38), consistent with the finding that GSM is more context-sensitive than other dimensions.
Factor Structure: Confirmatory Factor Analysis
To test whether the six core EC dimensions reflect a single general verbal-reasoning factor or a more differentiated structure, we ran a confirmatory factor analysis on each model's scoring data independently. Four nested models were estimated by maximum likelihood on the dimension-level correlation matrix (n = 99 guests, each with the three-pass mean for each dimension):
- M1 (1-factor): All six core dimensions load on a single general Verbal Reasoning Capacity factor.
- M2a (2-factor, Cont → GR): Generative Range (GR) = Abstraction, Compression, Originality, Conceptual Continuity; Calibrative Control (CC) = Epistemic Calibration, Generative Self-Monitoring.
- M2b (2-factor, Cont → CC): GR = Abstraction, Compression, Originality; CC = Conceptual Continuity, Epistemic Calibration, Generative Self-Monitoring.
- M2c (2-factor, Cont cross-loads): As M2a/M2b, but Conceptual Continuity freely loads on both factors.
Fit indices reported follow standard cutoffs: RMSEA < .08 acceptable, < .06 excellent; CFI > .90 acceptable, > .95 excellent; SRMR < .08 acceptable.
Table 7. Confirmatory factor analysis fit indices by scoring model.
| Model | Sonnet 4 | GPT-5 mini | Mistral Large | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| χ² (df) | RMSEA | CFI | χ² (df) | RMSEA | CFI | χ² (df) | RMSEA | CFI | ||||
| M1 (1-factor) | 135.6 (15) | .286 | .712 | 30.7 (15) | .103 | .918 | 41.3 (15) | .134 | .959 | |||
| M2a (Cont → GR) | 35.8 (14) | .126 | .948 | 18.8 (14) | .059 | .975 | 32.2 (14) | .115 | .971 | |||
| M2b (Cont → CC) | 34.7 (14) | .123 | .951 | 30.1 (14) | .108 | .916 | 27.9 (14) | .101 | .978 | |||
| M2c (Cont cross-loads) | 21.1 (13) | .080 | .981 | 18.8 (13) | .067 | .970 | 26.1 (13) | .101 | .979 |
Best-fitting model per scorer in bold. Winning models by AIC: Sonnet → M2c (AIC = -4.9); GPT-5 mini → M2a (AIC = -9.2); Mistral → M2b (AIC = -0.1). Each of the three scorers selects a different best-fitting specification — and the difference between them is entirely about where Conceptual Continuity belongs in the two-factor structure. Every scorer rejects the one-factor model in the direction of a two-factor solution, but the three scorers disagree on Continuity's factorial home.
Table 8. Standardized factor loadings under each model's best-fitting CFA solution.
| Dimension | Sonnet (M2c) GR | Sonnet (M2c) CC | GPT-5m (M2a) GR | GPT-5m (M2a) CC | Mistral (M2b) GR | Mistral (M2b) CC |
|---|---|---|---|---|---|---|
| Abstraction | .921 | — | .707 | — | .988 | — |
| Compression | 1.000 | — | .679 | — | .993 | — |
| Originality | .680 | — | .562 | — | .833 | — |
| Conceptual Continuity | .342 | .357 | .765 | — | — | .798 |
| Epistemic Calibration | — | .824 | — | .858 | — | .730 |
| Gen. Self-Monitoring | — | 1.000 | — | .700 | — | .742 |
| Factor correlation (φ) | .384 | .766 | .886 |
Three scorers produce three different best-fitting placements for Conceptual Continuity — Sonnet cross-loads it on both factors, GPT-5 mini assigns it to Generative Range only, and Mistral assigns it to Calibrative Control only. This is a striking and consistent finding: the one dimension whose factorial home varies across scorers is the same one under all three, and each scorer picks a different home for it. Abstraction, Compression, and Originality cluster as Generative Range on every scorer. Epistemic Calibration and Generative Self-Monitoring cluster as Calibrative Control on every scorer. No scorer ever assigns any of these five dimensions differently.
Four findings are substantively important across the three scoring models:
1. A one-factor model is rejected by every scorer. No model is consistent with a unidimensional general-verbal-reasoning factor. Sonnet rejects M1 decisively (RMSEA = .286, CFI = .712). GPT-5 mini rejects it more mildly but still clearly (RMSEA = .103, CFI = .918; Δχ²(1) M1 vs. M2a = 11.91, p = .0006). Mistral also rejects it in the direction of a two-factor solution (RMSEA = .134, CFI = .959; Δχ²(1) M1 vs. M2b = 13.32, p = .0003). Under all three scorers, the two-factor structure fits significantly better than the unidimensional alternative.
2. Factor composition is invariant across scorers. All three models independently cluster Abstraction, Compression, and Originality as one factor (Generative Range) and Epistemic Calibration and Generative Self-Monitoring as the other (Calibrative Control). No scorer assigns any of these five dimensions differently. The empirical clustering of the five core dimensions is model-invariant across three independent LLMs from three different vendors.
3. Factor separation varies monotonically with scorer calibration. The factor correlation φ — how distinct GR and CC are from one another — rises monotonically from Sonnet (φ = .384, well-separated factors) through GPT-5 mini (φ = .766) to Mistral (φ = .886, nearly-merging factors). The raw correlation matrices clarify the source: the three scorers differ dramatically in their within-factor correlations. Sonnet's Abs↔Comp correlation is .921; Mistral's is .981 (nearly rank-degenerate); GPT-5 mini's is .471. The scorers with tighter within-factor correlations produce factors that are also more correlated with one another — because when every dimension inside a factor is near-identical, any cross-factor signal dominates the residual. This is a calibration phenomenon, not a structural disagreement about what the factors are.
4. Conceptual Continuity is empirically a boundary dimension with three different best-fitting homes. Sonnet's best-fitting solution has Continuity cross-loading both factors (.342 on GR, .357 on CC). GPT-5 mini's best-fitting solution (M2a) places Continuity entirely on GR (.765), with any CC cross-loading collapsing to essentially zero (−0.014) when estimated. Mistral's best-fitting solution (M2b) places Continuity entirely on CC (.798), with no GR loading. Three frontier LLMs, three distinct homes for the same dimension. This is the strongest empirical evidence to date that Conceptual Continuity is a boundary dimension whose factorial placement is scorer-convention-determined rather than construct-determined. It is therefore excluded from both the reported Generative Range and Calibrative Control subscores in production, retaining only the five unambiguously-loading dimensions (Abstraction, Compression, Originality for GR; Epistemic Calibration, Generative Self-Monitoring for CC). The production decision is not a theoretical choice; it is what the three-way CFA comparison forces.
Individual Speaker Results
Table 9. Top 10 highest VRI guests under Sonnet 4, with matched scores from the other two models.
| Rank | Guest | Cell | Sonnet | GPT-5m | Mistral |
|---|---|---|---|---|---|
| 1 | Rebecca Kukla | Philosophy | 7.68 | 6.97 | 8.22 |
| 1 | Ed Boyden | Hard Science | 7.68 | 7.35 | 8.05 |
| 1 | Michelle Dawson | Hard Science | 7.68 | 7.13 | 8.73 |
| 4 | Agnes Callard | Philosophy | 7.63 | 7.03 | 7.62 |
| 4 | Henry Farrell | Social Science | 7.63 | 7.09 | 7.73 |
| 4 | Cass Sunstein | Law/Policy | 7.63 | 6.74 | 7.83 |
| 7 | Alison Gopnik | Hard Science | 7.57 | 7.09 | 8.56 |
| 7 | David Deutsch | Hard Science | 7.57 | 7.03 | 8.68 |
| 7 | Vitalik Buterin | Tech | 7.57 | 7.20 | 7.90 |
| 10 | Russ Roberts | Economics | 7.52 | 6.63 | 7.79 |
Table 10. Top 10 largest Sonnet↔GPT-5m VRI disagreements (with Mistral comparison).
| Guest | Cell | Sonnet | GPT-5m | |Δ S−G| | Mistral |
|---|---|---|---|---|---|
| Camille Paglia | Lit/Arts | 6.67 | 5.57 | 1.11 | 7.94 |
| Dana Gioia | Lit/Arts | 7.34 | 6.42 | 0.92 | 7.67 |
| Cass Sunstein | Law/Policy | 7.63 | 6.74 | 0.89 | 7.83 |
| Russ Roberts | Economics | 7.52 | 6.63 | 0.89 | 7.79 |
| Peter Singer | Philosophy | 7.36 | 6.49 | 0.87 | 7.68 |
| Diarmaid MacCulloch | History | 7.47 | 6.69 | 0.78 | 7.84 |
| Abhijit Banerjee | Economics | 7.29 | 6.58 | 0.71 | 7.69 |
| Rebecca Kukla | Philosophy | 7.68 | 6.97 | 0.71 | 8.22 |
| Jess Wade | Hard Science | 7.07 | 6.37 | 0.70 | 7.84 |
| Marc Andreessen | Tech | 7.45 | 6.75 | 0.69 | 8.00 |
The largest Sonnet↔GPT-5 mini disagreements are asymmetric: in all 10 cases Sonnet scores higher than GPT-5 mini, and in all 10 cases Mistral scores higher than both. This pattern is consistent across the full sample — 87 of 98 guests receive higher VRI from Sonnet than from GPT-5 mini, and 96 of 98 receive higher VRI from Mistral than from GPT-5 mini. The disagreements are concentrated among speakers whose discourse style is rhetorically confident and compressed (Paglia, Sunstein, Andreessen). The consistent direction of disagreement across all three model pairs — Mistral highest, Sonnet intermediate, GPT-5 mini lowest — suggests GPT-5 mini evaluates such speakers more strictly against the rubric's behavioral anchors, while the more generous scorers credit rhetorical confidence as evidence of underlying reasoning capacity even when the behavioral markers are ambiguous.
Discussion
The Normative Contribution
This study provides the first large-scale normative dataset for the EC Verbal Reasoning Index. Ninety-eight speakers scored by three independent frontier language models across nine balanced disciplinary cells produce VRI distributions that can serve as a reference population for interpreting individual scores. Median VRI in this population is approximately 7.0 on Sonnet, 6.7 on GPT-5 mini, and 7.7 on Mistral — a population pre-selected by the same interviewer for intellectual distinction. The scorer-specific median matters because, as documented above, the three scoring models differ by more than a full scale point in absolute-level calibration.
The compressed VRI ranges (5.25–7.68 on Sonnet, 5.25–7.35 on GPT-5 mini, 6.63–8.84 on Mistral) reflect this pre-selection. The normative data should be interpreted as norms for the upper end of the ability distribution, not as population norms. A broader normative study — currently planned using Prolific recruitment with concurrent ICAR fluid reasoning assessment — will provide norms across the full ability range.
Cross-Model Agreement as Validity Evidence
The pairwise inter-model agreement reported here (mean r = .668 on VRI across three pairs, range .589 to .758) is, to our knowledge, the first published three-way cross-vendor LLM-as-judge reliability statistic for a psychometric rubric. Three models from three different vendors — Anthropic (Sonnet 4), OpenAI (GPT-5 mini), and Mistral AI (Mistral Large) — trained on different data, with different architectures, applied the same rubric to the same transcripts without any coordination, and produced scores whose rank-orderings substantially agree.
This agreement is stronger than most published inter-rater reliability estimates for human-scored performance assessments in educational measurement, even when averaged across all three independent pairs. It does not mean the models are "correct" — all three could share systematic biases inherited from shared web-scale pretraining data. But it establishes that the construct measured by the EC rubric is scorer-recoverable across vendor boundaries: the same rubric, applied by three frontier LLMs that share neither architecture nor training corpus nor vendor lineage, produces broadly the same rank ordering of speakers. This is the fundamental requirement for measurement reliability, and it holds across the three-way comparison even though the absolute-level calibration of the three scorers differs by more than a full scale point.
A Two-Factor Reasoning Structure, Stable in Composition and Variable in Separation
The confirmatory factor analyses converge on a substantively important claim: the six core EC dimensions are not a single lump. A one-factor model is rejected under all three scoring models, and all three independently recover the same two-factor composition — Abstraction, Compression, and Originality clustering as a Generative Range factor, and Epistemic Calibration and Generative Self-Monitoring clustering as a Calibrative Control factor. That three independently-prompted frontier LLMs, with no access to each other's scoring and no shared intermediate representations, partition the same six dimensions into the same two empirical clusters is the strongest single piece of convergent-validity evidence in this study — and the three-way replication is meaningfully stronger than a two-way one, because it essentially closes the door on "both models inherited the same partition from shared pretraining."
The theoretical interpretation is straightforward. Generative Range captures the productive side of verbal reasoning — the ability to operate at high levels of abstraction, to pack propositions densely, and to generate non-obvious reframings. Calibrative Control captures the monitoring side — the ability to mark epistemic status explicitly and to revise one's own formulations upward in real time. Both factors contribute to what the VRI composite measures, and they are moderately-to-highly correlated across all three scorers (φ = .384 Sonnet, .766 GPT-5 mini, .886 Mistral) — distinguishable but not independent, as expected for two aspects of a broader verbal-reasoning capacity.
Two findings vary systematically across scorers rather than replicating cleanly. First, the magnitude of factor separation differs substantially: Sonnet produces well-separated factors (φ = .384), GPT-5 mini produces moderately separated factors (φ = .766), and Mistral produces nearly-merging factors (φ = .886). The source is visible in the raw correlation matrices: each scorer shows a characteristic within-factor correlation pattern. Sonnet has Abs↔Comp r = .921; Mistral has .981 (nearly rank-degenerate); GPT-5 mini has .471 (the most dimension-independent of the three). The stricter the within-factor correlations, the more correlated the resulting factors also become, because near-redundant within-factor dimensions leave only cross-factor residual signal. This is a scoring-calibration phenomenon, not a structural disagreement about what the factors are — all three scorers still prefer the two-factor structure to the one-factor alternative by conventional fit criteria.
Second, Conceptual Continuity — and Continuity alone — receives three different best-fitting factorial homes across the three scorers. Sonnet's preferred model cross-loads Continuity on both factors. GPT-5 mini's preferred model places Continuity on Generative Range only. Mistral's preferred model places Continuity on Calibrative Control only. No other dimension varies in its factor assignment across scorers. We interpret this as evidence that Continuity is empirically a boundary dimension — its correlations with the other five are ambiguous enough that the "right" factor for it depends on idiosyncratic scoring calibration rather than underlying construct structure. This finding is not a defect; it is the clearest possible empirical justification for the production decision to exclude Continuity from both the reported Generative Range and Calibrative Control subscores. The reported subscores use only the five dimensions whose factor assignment is invariant across all three independent scorers. Users of the scale should specify the scoring model just as they would specify the norming sample, because absolute-level calibration and factor-separation magnitude both vary across scorers; the rank-ordering of speakers does not.
Disciplinary Reasoning Signatures
The finding that all three models independently reproduce the same top-of-hierarchy and bottom-of-hierarchy disciplinary ranking — philosophy and hard science at the top, literary arts at the bottom — provides convergent validity evidence that the rubric detects real differences in discourse register rather than random variation. Intermediate-cell orderings shift slightly across scorers, but always within the standard error of the cell means. The dimension-level profiles are consistent with what is known about how disciplinary training shapes discourse: philosophers reason at the level of principles (Abstraction at or near the scale ceiling), scientists mark epistemic boundaries explicitly (the highest Epistemic Calibration means), historians build cumulative narrative arguments (the highest Conceptual Continuity means), and literary speakers use associative and narrative reasoning that the analytical rubric does not fully capture.
Limitations
The sample is pre-selected. All speakers are guests on a single interview podcast, selected by the same host for intellectual distinction. This compresses the VRI range and limits generalizability.
No external criterion. Unlike the prior ecological validity study, this study does not correlate VRI with an external measure. The disciplinary patterns are interpretable but not validated against independent criteria.
The cross-model calibration offsets are systematic and large. The three scorers span a strict 1.06-point generosity gradient on VRI: Mistral > Sonnet > GPT-5 mini. Norms based on one model's scores are not interchangeable with norms based on another's. Any production use of the EC rubric must specify the scoring model, and any longitudinal comparison of a single speaker over time must use the same scorer throughout.
Compression shows scorer-specific disagreement. GPT-5 mini operationalizes propositional density notably differently from Sonnet (r = .397) and Mistral (r = .272), while Sonnet and Mistral agree more substantially with each other (r = .611). This suggests the GPT-5 mini prompt interpretation on Compression may be an outlier, and the dimension may benefit from additional behavioral anchoring to align scorers.
Literary arts scores reflect a construct boundary. The rubric measures analytical verbal reasoning. Speakers whose primary mode is narrative, associative, or performative will score lower not because they are less intelligent but because the rubric is not designed to measure their kind of reasoning. This is a scope limitation, not a measurement failure.
Single-pass written-vs-spoken comparison is not included. The prior ecological validity study included a written-vs-spoken pilot for three guests. The present study does not extend this comparison.
Conceptual Continuity's factor placement is scorer-dependent. The three-scorer CFA comparison reveals that Conceptual Continuity has no single empirical home in the two-factor structure — each scorer places it differently. This is a limitation of the current rubric insofar as it suggests the Continuity behavioral descriptors are ambiguous with respect to the Generative Range / Calibrative Control partition. It is also the clearest empirical justification for excluding Continuity from the reported GR and CC subscores. Future rubric revisions could either sharpen the Continuity anchors to force a consistent factorial home or formally recognize it as a cross-cutting dimension that contributes to VRI without loading on either subscore.
Conclusion
Ninety-eight speakers from Conversations with Tyler, scored by three independent frontier large language models from three different vendors across six core dimensions of verbal reasoning, produce normative VRI data that is internally consistent (inter-pass reliability = 0.19), cross-model reliable (mean pairwise r = .668), discipline-differentiated in theoretically predicted directions, and factorially coherent under every scorer.
Two complementary sources of convergent validity support the EC rubric's construct claim. First, all three scoring models independently produce the same disciplinary hierarchy — philosophy and hard science at the top, literary arts at the bottom, with theoretically interpretable dimension-level profiles for each discipline — despite having no access to speaker identity, discipline, or each other's scores. Second, confirmatory factor analyses estimated independently on each of the three scorers' data reject a unidimensional model and recover an identical two-factor composition: Abstraction, Compression, and Originality cluster as Generative Range; Epistemic Calibration and Generative Self-Monitoring cluster as Calibrative Control. The factor composition is invariant across scorers. Only factor separation magnitude and the placement of one boundary dimension (Conceptual Continuity) vary, and both variations are attributable to scoring-calibration differences rather than to disagreement about what the factors are.
Together, these findings suggest that what the rubric measures is real, even if what it measures is narrower than intelligence writ large. The EC rubric measures analytical verbal reasoning as performed in spontaneous speech, and the construct is internally structured as two moderately-to-highly correlated but empirically distinguishable components — one productive, one calibrative. Within that scope, the rubric measures reliably, it differentiates meaningfully, and its underlying factor structure is recoverable by three independent frontier LLMs from three different vendors. The three-way replication of the Generative Range / Calibrative Control partition, with a three-way disagreement confined to a single boundary dimension, constitutes the strongest evidence currently available that the two-factor structure of verbal reasoning recovered here is a property of the construct rather than an artifact of any individual scorer.
References
Biggs, J. B., & Collis, K. F. (1982). Evaluating the quality of learning: The SOLO taxonomy. Academic Press.
Flavell, J. H. (1979). Metacognition and cognitive monitoring. American Psychologist, 34(10), 906–911.
Guilford, J. P. (1967). The nature of human intelligence. McGraw-Hill.
Kintsch, W. (1974). The representation of meaning in memory. Erlbaum.
Levelt, W. J. M. (1983). Monitoring and self-repair in speech. Cognition, 14(1), 41–104.
Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
Messick, S. (1995). Validity of psychological assessment. American Psychologist, 50(9), 741–749.
Simonton, D. K. (1999). Origins of genius: Darwinian perspectives on creativity. Oxford University Press.
Simpson, R., Briggs, S., Ovens, J., & Swales, J. M. (2002). The Michigan corpus of academic spoken English. The Regents of the University of Michigan.
van Dijk, T. A., & Kintsch, W. (1983). Strategies of discourse comprehension. Academic Press.
Vygotsky, L. S. (1962). Thought and language. MIT Press.
Appendix A: Complete Per-Guest Dimension Scores
The full per-guest dimension-level scores are presented in three tables, one per scoring model, to keep each table within the page width. Scores are 3-pass blinded averages on the 1–9 scale. Abs = Abstraction; Cmp = Compression; Ori = Originality; CC = Conceptual Continuity; EC = Epistemic Calibration; GSM = Generative Self-Monitoring; Voc = Vocabulary (moderator); Syn = Syntactic Control (moderator). VRI = Verbal Reasoning Index composite weighted from the six core dimensions. Rows sorted by discipline cell, then by that scorer's VRI descending within cell.
Appendix A.1: Claude Sonnet 4 scores (n = 98)
| Guest | Cell | Abs | Cmp | Ori | CC | EC | GSM | Voc | Syn | VRI |
|---|---|---|---|---|---|---|---|---|---|---|
| Russ Roberts | economics | 8 | 7 | 7 | 8 | 8 | 7 | 7.7 | 6.7 | 7.52 |
| Daron Acemoglu | economics | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Abhijit Banerjee | economics | 7.3 | 6.3 | 7 | 8 | 8 | 7 | 8 | 7 | 7.29 |
| Paul Krugman | economics | 7.3 | 6.3 | 6.3 | 7.3 | 8 | 7 | 7.7 | 6.7 | 7.08 |
| Raj Chetty | economics | 7 | 6 | 6.7 | 7.7 | 8 | 6.7 | 7 | 6.3 | 7.02 |
| Alain Bertaud | economics | 7 | 6 | 7 | 8 | 7 | 6 | 7 | 6 | 6.84 |
| Larry Summers | economics | 7 | 6 | 6 | 7 | 8 | 6.3 | 7.7 | 7 | 6.75 |
| Simon Johnson | economics | 7 | 6 | 6 | 7 | 7.7 | 6 | 7.3 | 6.3 | 6.64 |
| Nassim Nicholas Taleb | economics | 7 | 6.7 | 8 | 6.3 | 6.3 | 5.3 | 8 | 6 | 6.61 |
| Alan Taylor | economics | 7 | 6 | 6 | 7 | 7.7 | 5.7 | 7 | 6 | 6.59 |
| Ed Boyden | hard_science | 8 | 7 | 8 | 8 | 8 | 7 | 8 | 7 | 7.68 |
| Michelle Dawson | hard_science | 8 | 7 | 8 | 8 | 8 | 7 | 8 | 7 | 7.68 |
| Alison Gopnik | hard_science | 8 | 7 | 8 | 8 | 7.7 | 6.7 | 8 | 7 | 7.57 |
| David Deutsch | hard_science | 8 | 7 | 8 | 8 | 7.7 | 6.7 | 8 | 7 | 7.57 |
| Steven Pinker | hard_science | 8 | 7 | 7 | 8 | 8 | 7 | 8 | 7 | 7.52 |
| Michael Nielsen | hard_science | 8 | 7 | 8 | 7 | 8 | 6 | 8 | 7 | 7.36 |
| Paul Bloom | hard_science | 7.7 | 6.7 | 7.3 | 7.7 | 8 | 6.7 | 7.7 | 6.7 | 7.35 |
| Philip Ball | hard_science | 7.3 | 6.3 | 7 | 8 | 8 | 6.7 | 8 | 7 | 7.24 |
| Atul Gawande | hard_science | 7 | 6 | 7 | 8 | 8 | 7 | 8 | 7 | 7.18 |
| Jess Wade | hard_science | 7 | 6 | 7 | 8 | 7.7 | 6.7 | 8 | 7 | 7.07 |
| Diarmaid Macculloch | history | 8 | 7 | 7.3 | 8 | 8 | 6.3 | 8 | 7 | 7.47 |
| Ada Palmer | history | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Adam Tooze | history | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Jill Lepore | history | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Roy Foster | history | 7.7 | 6.7 | 7 | 8 | 8 | 6.3 | 8 | 7 | 7.30 |
| Helen Castor | history | 7 | 6 | 7 | 8 | 8 | 7 | 8 | 7 | 7.18 |
| Jennifer Burns | history | 7 | 6 | 7 | 8 | 8 | 6.3 | 7 | 6.3 | 7.07 |
| Paul Gillingham | history | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Niall Ferguson | history | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Patricia Fara | history | 7 | 6 | 6.7 | 7.7 | 7.3 | 6 | 7.3 | 6.7 | 6.79 |
| Reza Aslan | history | 7 | 6 | 6.7 | 7.7 | 7 | 6 | 8 | 7 | 6.73 |
| Ezra Klein | journalism_public | 7 | 6 | 7 | 8 | 8 | 7 | 7 | 6.7 | 7.18 |
| Nate Silver | journalism_public | 7 | 6 | 6.3 | 7.3 | 8 | 7 | 7 | 6 | 6.97 |
| David Brooks | journalism_public | 7 | 6 | 7 | 8 | 7 | 6 | 7.3 | 6.3 | 6.84 |
| Malcolm Gladwell | journalism_public | 7 | 6 | 7 | 8 | 7 | 6 | 7 | 6 | 6.84 |
| Larissa Macfarquhar | journalism_public | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Andrew Sullivan | journalism_public | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Ben Thompson | journalism_public | 7 | 6 | 6.3 | 7.3 | 7 | 6 | 7 | 6 | 6.63 |
| Barkha Dutt | journalism_public | 7 | 6 | 6 | 7 | 7 | 6 | 7 | 6 | 6.52 |
| Ben Westhoff | journalism_public | 6 | 6 | 7 | 7 | 7 | 6 | 7 | 6 | 6.50 |
| Annie Jacobsen | journalism_public | 7 | 4.7 | 6.3 | 7 | 7.7 | 6 | 7 | 6 | 6.48 |
| Andrew Ross Sorkin | journalism_public | 6 | 4.7 | 7 | 7 | 6 | 5.7 | 7 | 6 | 6.05 |
| Cass Sunstein | law_policy | 8 | 7 | 7.7 | 8 | 8 | 7 | 8 | 7 | 7.63 |
| Jamal Greene | law_policy | 8 | 7 | 7 | 8 | 8 | 6.7 | 8 | 7 | 7.47 |
| Rachel Harmon | law_policy | 7 | 6 | 6.3 | 7.3 | 8 | 6.3 | 7.3 | 6.3 | 6.86 |
| Ben Sasse | law_policy | 7 | 6 | 7 | 8 | 7 | 6 | 7.7 | 7 | 6.84 |
| Bruno Macaes | law_policy | 7.3 | 6.3 | 7.3 | 7.7 | 6.7 | 5.7 | 7.7 | 6.3 | 6.84 |
| Jennifer Pahlka | law_policy | 7 | 6 | 7 | 8 | 7 | 6 | 7 | 6 | 6.84 |
| Samantha Power | law_policy | 7 | 6 | 6 | 7 | 8 | 6.3 | 8 | 7 | 6.75 |
| Stanley Mcchrystal | law_policy | 7 | 6 | 6 | 7 | 7 | 6 | 7 | 6 | 6.52 |
| Tom Tugendhat | law_policy | 7 | 6 | 6 | 7 | 6.7 | 5.7 | 7 | 6.7 | 6.41 |
| John O Brennan | law_policy | 7 | 4.7 | 5.3 | 6.3 | 7.7 | 6 | 7 | 6 | 6.21 |
| Leopoldo Lopez | law_policy | 7 | 6 | 6 | 7 | 6 | 5 | 7 | 6 | 6.18 |
| Dana Gioia | lit_arts | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Margaret Atwood | lit_arts | 7 | 6 | 8 | 7 | 8 | 7 | 8 | 7 | 7.18 |
| Brian Koppelman | lit_arts | 7 | 6 | 7 | 8 | 7.7 | 6.7 | 7.3 | 6.3 | 7.07 |
| Fuchsia Dunlop | lit_arts | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Alex Ross | lit_arts | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Camille Paglia | lit_arts | 8 | 7 | 8 | 6.7 | 5.7 | 4.7 | 8 | 7 | 6.67 |
| Benjamin Moser | lit_arts | 7 | 6 | 7 | 7.3 | 6.7 | 5.7 | 8 | 7 | 6.62 |
| Andy Weir | lit_arts | 6.3 | 5.3 | 7 | 7.3 | 6.3 | 6 | 6.7 | 6 | 6.39 |
| Cynthia Haven | lit_arts | 7 | 6 | 6 | 6 | 7 | 6 | 7 | 6 | 6.36 |
| Emily St John Mandel | lit_arts | 6 | 5 | 7 | 6 | 7 | 6 | 7 | 6 | 6.18 |
| Ana Vidovic | lit_arts | 6 | 4 | 5 | 6 | 5.7 | 4.7 | 6 | 5 | 5.25 |
| Rebecca Kukla | philosophy | 8 | 7 | 8 | 8 | 8 | 7 | 8 | 7 | 7.68 |
| Agnes Callard | philosophy | 8 | 7 | 7.7 | 8 | 8 | 7 | 8 | 7 | 7.63 |
| David Bentley Hart | philosophy | 8 | 7 | 7 | 8 | 8 | 7 | 8.3 | 7.7 | 7.52 |
| Elijah Millgram | philosophy | 8 | 7 | 8 | 7 | 8 | 7 | 8 | 7 | 7.52 |
| William Macaskill | philosophy | 8 | 7 | 7 | 8 | 8 | 7 | 8 | 7 | 7.52 |
| Amia Srinivasan | philosophy | 8 | 7 | 7 | 8 | 8 | 6.7 | 8 | 7 | 7.47 |
| Rabbi David Wolpe | philosophy | 8 | 7 | 7.3 | 7.7 | 8 | 6.7 | 8 | 7 | 7.47 |
| John Gray | philosophy | 8 | 7 | 8 | 8 | 7.3 | 6.3 | 8 | 7 | 7.45 |
| Kwame Anthony Appiah | philosophy | 8 | 7 | 7 | 8 | 8 | 6.3 | 8 | 7 | 7.41 |
| Peter Singer | philosophy | 8 | 7 | 6 | 8 | 8 | 7 | 8 | 7 | 7.36 |
| Noam Chomsky | philosophy | 8 | 7 | 7.7 | 8 | 7 | 6 | 8 | 7 | 7.29 |
| Slavoj Zizek | philosophy | 8 | 7 | 8 | 6.3 | 6.7 | 6 | 8 | 6 | 7.01 |
| Henry Farrell | social_science | 8 | 7 | 7.7 | 8 | 8 | 7 | 8 | 7 | 7.63 |
| Daniel Kahneman | social_science | 8 | 7 | 8 | 7.3 | 8 | 6.3 | 8 | 7 | 7.47 |
| Jonathan Haidt | social_science | 8 | 7 | 7 | 8 | 8 | 6 | 8 | 7 | 7.36 |
| Joseph Henrich | social_science | 8 | 7 | 8 | 8 | 7 | 6 | 8 | 7 | 7.34 |
| Arthur Brooks | social_science | 7.7 | 6.7 | 7 | 8 | 7.7 | 6.7 | 8 | 7 | 7.29 |
| Philip E Tetlock | social_science | 7 | 6 | 7 | 7.7 | 8 | 6.3 | 8 | 7 | 7.02 |
| Chris Blattman | social_science | 7 | 6 | 6.7 | 7.7 | 8 | 6.7 | 7.7 | 6.7 | 7.02 |
| Harvey Mansfield | social_science | 8 | 7 | 7 | 8 | 6 | 5.3 | 8 | 7 | 6.89 |
| Ashley Mears | social_science | 7 | 6 | 7 | 7.3 | 7 | 6 | 7 | 6 | 6.73 |
| Daniel Carpenter | social_science | 7 | 6 | 6 | 7 | 8 | 6 | 8 | 7 | 6.70 |
| Eric Kaufmann | social_science | 7 | 6 | 6 | 7 | 7 | 5.7 | 7 | 6 | 6.47 |
| Vitalik Buterin | tech_entrepreneurship | 8 | 7 | 8 | 8 | 7.7 | 6.7 | 8 | 7 | 7.57 |
| Marc Andreessen | tech_entrepreneurship | 8 | 7 | 8 | 8 | 7 | 6.7 | 8 | 7 | 7.45 |
| Audrey Tang | tech_entrepreneurship | 8 | 7 | 8 | 7 | 7.7 | 6 | 8 | 7 | 7.30 |
| Balaji Srinivasan | tech_entrepreneurship | 8 | 7 | 8 | 7 | 7 | 6.3 | 8 | 7 | 7.23 |
| Daniel Gross | tech_entrepreneurship | 7 | 6 | 7 | 8 | 7.7 | 6.7 | 7 | 6 | 7.07 |
| Sam Altman | tech_entrepreneurship | 7 | 6 | 7 | 7.7 | 7.7 | 6 | 7 | 6 | 6.91 |
| Brian Armstrong | tech_entrepreneurship | 7 | 6 | 7 | 7.7 | 7.7 | 6 | 7 | 6 | 6.91 |
| Chris Dixon | tech_entrepreneurship | 7 | 6 | 7 | 8 | 7 | 6 | 7.3 | 6.3 | 6.84 |
| Blake Scholl | tech_entrepreneurship | 7 | 6 | 8 | 7 | 6 | 6 | 7 | 6 | 6.66 |
| Patrick Collison | tech_entrepreneurship | 6.7 | 6 | 7 | 7 | 6 | 5.3 | 7 | 6 | 6.33 |
| David Rubenstein | tech_entrepreneurship | 6 | 5 | 6 | 7 | 7 | 6 | 7 | 6 | 6.18 |
Appendix A.2: GPT-5 mini scores (n = 99)
| Guest | Cell | Abs | Cmp | Ori | CC | EC | GSM | Voc | Syn | VRI |
|---|---|---|---|---|---|---|---|---|---|---|
| Daron Acemoglu | economics | 7.7 | 6.3 | 7 | 7.7 | 7.7 | 6.7 | 6.7 | 6.7 | 7.19 |
| Larry Summers | economics | 7.7 | 6 | 5.7 | 6.3 | 8 | 7.3 | 7 | 7 | 6.87 |
| Raj Chetty | economics | 7 | 5.7 | 6 | 7 | 7.7 | 7 | 6 | 6 | 6.75 |
| Nassim Nicholas Taleb | economics | 7.7 | 6 | 7 | 6 | 6.7 | 7 | 6.7 | 6 | 6.74 |
| Russ Roberts | economics | 7.3 | 5.7 | 6 | 6.7 | 7 | 7 | 6 | 6.7 | 6.63 |
| Alan Taylor | economics | 7.3 | 6 | 5 | 6.7 | 7.7 | 6.7 | 6.3 | 6.7 | 6.59 |
| Abhijit Banerjee | economics | 7 | 5.7 | 6 | 6.3 | 7.3 | 7 | 6 | 6 | 6.58 |
| Paul Krugman | economics | 7.3 | 6 | 6 | 6.3 | 7 | 6.7 | 6.3 | 6.7 | 6.58 |
| Alain Bertaud | economics | 7.7 | 5.3 | 6 | 7 | 6.7 | 6 | 6.7 | 6.7 | 6.47 |
| Simon Johnson | economics | 7.3 | 5 | 5.3 | 6.3 | 6.3 | 6 | 6.3 | 6.3 | 6.09 |
| Ed Boyden | hard_science | 7.7 | 6 | 7.7 | 7.7 | 8 | 7 | 7 | 6.7 | 7.35 |
| Michelle Dawson | hard_science | 7.7 | 6 | 7 | 7.3 | 7.7 | 7 | 7 | 6 | 7.13 |
| Alison Gopnik | hard_science | 8 | 5.7 | 7 | 7.3 | 7.7 | 6.7 | 7 | 7 | 7.09 |
| David Deutsch | hard_science | 8 | 5.3 | 6.7 | 7.3 | 7.7 | 7 | 7.3 | 6.7 | 7.03 |
| Steven Pinker | hard_science | 7.3 | 6 | 6 | 7.3 | 7.7 | 7 | 7.3 | 7.3 | 6.91 |
| Paul Bloom | hard_science | 7.3 | 6 | 6 | 7 | 7.7 | 7 | 6 | 6.3 | 6.86 |
| Philip Ball | hard_science | 7.3 | 6 | 6 | 6.7 | 7.7 | 7 | 6.7 | 6.7 | 6.81 |
| Michael Nielsen | hard_science | 7.7 | 5.7 | 6 | 6.3 | 7.3 | 7 | 6.3 | 6.3 | 6.70 |
| Atul Gawande | hard_science | 7.7 | 5.7 | 6 | 6.7 | 7 | 7 | 6.7 | 6.7 | 6.69 |
| Ezekiel Emanuel | hard_science | 7 | 5.7 | 6 | 6.7 | 7 | 6.7 | 6.3 | 6.3 | 6.52 |
| Jess Wade | hard_science | 7 | 4.3 | 6 | 6.3 | 7.3 | 7 | 6.7 | 6.7 | 6.37 |
| Helen Castor | history | 7.7 | 6 | 6 | 7.7 | 8 | 7 | 7 | 7 | 7.09 |
| Adam Tooze | history | 8 | 6.3 | 6.3 | 7 | 7.7 | 7 | 8 | 7.3 | 7.09 |
| Roy Foster | history | 7.7 | 6 | 6 | 7.3 | 7.7 | 6.7 | 7.3 | 7.3 | 6.92 |
| Paul Gillingham | history | 8 | 6.3 | 6 | 7 | 7 | 7 | 7.3 | 7 | 6.91 |
| Ada Palmer | history | 7.3 | 6 | 6.7 | 7 | 7.3 | 6.7 | 7.3 | 7.3 | 6.85 |
| Jill Lepore | history | 7.3 | 5.3 | 6.3 | 7 | 7.7 | 7 | 6.3 | 7 | 6.81 |
| Diarmaid Macculloch | history | 7.3 | 6 | 6.3 | 7 | 7 | 6.3 | 7.7 | 7 | 6.69 |
| Jennifer Burns | history | 7 | 5.3 | 6 | 7 | 7.3 | 6.3 | 6.3 | 6.3 | 6.53 |
| Niall Ferguson | history | 7 | 6 | 6 | 6.7 | 7 | 6.3 | 7 | 6.7 | 6.52 |
| Patricia Fara | history | 7 | 5.7 | 6 | 7 | 7 | 6.3 | 6.3 | 6.7 | 6.52 |
| Reza Aslan | history | 7.3 | 5.7 | 6 | 6.7 | 7 | 6 | 6.7 | 6 | 6.47 |
| Nate Silver | journalism_public | 8 | 6 | 6 | 7 | 8 | 7 | 6 | 6 | 7.04 |
| Barkha Dutt | journalism_public | 7.3 | 6 | 6 | 7 | 7.7 | 7 | 6.7 | 6.7 | 6.86 |
| Malcolm Gladwell | journalism_public | 7.3 | 6 | 6 | 7 | 7 | 7.3 | 6.7 | 7 | 6.79 |
| Ben Thompson | journalism_public | 7 | 6.3 | 6 | 7 | 7.3 | 6 | 6.3 | 6 | 6.63 |
| Ezra Klein | journalism_public | 7.3 | 5.7 | 6 | 6.7 | 7.3 | 6.3 | 6 | 6 | 6.59 |
| Andrew Sullivan | journalism_public | 7.3 | 6 | 6 | 6.3 | 6.7 | 7 | 6.7 | 6.7 | 6.57 |
| Andrew Ross Sorkin | journalism_public | 6.7 | 5.7 | 6 | 6.7 | 7 | 6.7 | 6 | 6 | 6.46 |
| David Brooks | journalism_public | 7.3 | 4.7 | 6 | 6.3 | 7 | 7 | 6.3 | 6 | 6.42 |
| Larissa Macfarquhar | journalism_public | 7 | 4.7 | 6 | 6.7 | 7 | 7 | 7 | 7 | 6.41 |
| Ben Westhoff | journalism_public | 6.7 | 5.7 | 6 | 6.7 | 7 | 6.3 | 6 | 6 | 6.41 |
| Annie Jacobsen | journalism_public | 7 | 4.7 | 6 | 6.3 | 7.3 | 6.7 | 6 | 6.3 | 6.37 |
| Jamal Greene | law_policy | 7.7 | 6 | 6 | 7 | 8 | 7 | 6.7 | 6.3 | 6.98 |
| Rachel Harmon | law_policy | 7.7 | 6 | 5.7 | 7 | 7.7 | 6.7 | 6.7 | 6.3 | 6.81 |
| Cass Sunstein | law_policy | 7 | 6 | 6 | 7 | 7.3 | 7 | 6.7 | 6.3 | 6.74 |
| Bruno Macaes | law_policy | 7.3 | 6 | 6 | 7 | 7 | 6.3 | 6.3 | 6 | 6.63 |
| Ben Sasse | law_policy | 7 | 5.7 | 6 | 6.7 | 7 | 6.7 | 6 | 6.3 | 6.52 |
| Samantha Power | law_policy | 7.7 | 5 | 5.3 | 6.7 | 7 | 7 | 6.7 | 6.7 | 6.48 |
| Jennifer Pahlka | law_policy | 7 | 6 | 6 | 6.7 | 6.7 | 6.3 | 6 | 6 | 6.46 |
| Stanley Mcchrystal | law_policy | 7.3 | 5.7 | 6 | 6.3 | 6.7 | 6 | 6 | 6 | 6.36 |
| Tom Tugendhat | law_policy | 7 | 5.3 | 5.3 | 6.7 | 7 | 6.3 | 6.7 | 6.3 | 6.31 |
| Leopoldo Lopez | law_policy | 7 | 6 | 5 | 6.7 | 6 | 6 | 6 | 6 | 6.13 |
| John O Brennan | law_policy | 7 | 5 | 4.3 | 6 | 7.3 | 6 | 6.3 | 6 | 5.99 |
| Alex Ross | lit_arts | 7.7 | 6 | 6 | 7 | 7 | 6.7 | 7.7 | 7 | 6.75 |
| Andy Weir | lit_arts | 7 | 6 | 6 | 7 | 7 | 7 | 6.3 | 6.3 | 6.68 |
| Margaret Atwood | lit_arts | 7 | 5.7 | 6.7 | 6.3 | 7.3 | 6.7 | 7 | 7 | 6.63 |
| Fuchsia Dunlop | lit_arts | 7 | 6 | 6.3 | 7 | 7 | 6 | 7.3 | 6.7 | 6.57 |
| Cynthia Haven | lit_arts | 7 | 5.7 | 5.7 | 6.7 | 7.3 | 6.3 | 6.7 | 6 | 6.47 |
| Brian Koppelman | lit_arts | 7 | 5 | 6 | 6.7 | 7.3 | 6.7 | 7.3 | 7 | 6.47 |
| Dana Gioia | lit_arts | 7.7 | 4.7 | 6 | 7 | 6.7 | 6.3 | 7.3 | 7 | 6.42 |
| Benjamin Moser | lit_arts | 7 | 5.3 | 6 | 6 | 6.7 | 6.3 | 6.3 | 6 | 6.25 |
| Emily St John Mandel | lit_arts | 6 | 4.3 | 5.7 | 6 | 7 | 6.3 | 6 | 6.3 | 5.91 |
| Camille Paglia | lit_arts | 7 | 3.3 | 6.7 | 6 | 4.7 | 5.7 | 7.3 | 6.7 | 5.57 |
| Ana Vidovic | lit_arts | 6 | 3.3 | 4 | 6 | 6 | 6 | 6 | 6 | 5.25 |
| Agnes Callard | philosophy | 7.7 | 6 | 6.7 | 7 | 7.7 | 7 | 6.7 | 6 | 7.03 |
| Rabbi David Wolpe | philosophy | 8 | 5.3 | 6.3 | 7 | 8 | 7 | 7.7 | 7 | 6.99 |
| John Gray | philosophy | 8 | 6 | 6 | 7 | 7.7 | 7 | 7.7 | 6.7 | 6.98 |
| Amia Srinivasan | philosophy | 8 | 6 | 6.3 | 7 | 7.7 | 6.7 | 7 | 6.7 | 6.98 |
| Rebecca Kukla | philosophy | 7.7 | 5.7 | 6.7 | 7 | 7.7 | 7 | 6.3 | 6.7 | 6.97 |
| Elijah Millgram | philosophy | 7.7 | 6 | 6.7 | 6.7 | 7.3 | 7.3 | 7 | 6.3 | 6.97 |
| Noam Chomsky | philosophy | 8 | 6.3 | 6.3 | 7.3 | 7 | 6.7 | 7 | 7 | 6.97 |
| William Macaskill | philosophy | 7.7 | 6 | 5.7 | 7 | 8 | 7 | 6.7 | 6.3 | 6.93 |
| David Bentley Hart | philosophy | 7.7 | 6 | 6 | 7 | 7.7 | 7 | 7.7 | 7.3 | 6.92 |
| Kwame Anthony Appiah | philosophy | 7.7 | 6 | 6 | 7 | 7.3 | 6.7 | 6.3 | 7 | 6.81 |
| Slavoj Zizek | philosophy | 7 | 5.3 | 6.7 | 6.7 | 7 | 6.3 | 6.7 | 6 | 6.52 |
| Peter Singer | philosophy | 7.7 | 5.7 | 5.3 | 6.7 | 7.3 | 6 | 6.7 | 6.3 | 6.49 |
| Daniel Kahneman | social_science | 8 | 6 | 7 | 7 | 8 | 7.3 | 6.3 | 6.3 | 7.25 |
| Henry Farrell | social_science | 8 | 6 | 6.3 | 7 | 8 | 7 | 7.3 | 7 | 7.09 |
| Joseph Henrich | social_science | 7.3 | 6 | 6.7 | 7 | 8 | 7 | 6.7 | 6 | 7.03 |
| Daniel Carpenter | social_science | 7.7 | 6 | 6 | 7 | 8 | 7 | 6.7 | 6.7 | 6.98 |
| Philip E Tetlock | social_science | 7.7 | 6 | 6.3 | 7 | 7.3 | 7.3 | 6.7 | 6 | 6.97 |
| Jonathan Haidt | social_science | 7.3 | 6 | 6 | 7 | 7.3 | 7 | 6.3 | 6.7 | 6.80 |
| Arthur Brooks | social_science | 7.7 | 5 | 6.3 | 7 | 7 | 7 | 7 | 6.3 | 6.69 |
| Chris Blattman | social_science | 7.3 | 5.3 | 6 | 7 | 7.3 | 6.3 | 6 | 6 | 6.59 |
| Harvey Mansfield | social_science | 7.7 | 6 | 6 | 6.3 | 7 | 6.3 | 7 | 6 | 6.59 |
| Eric Kaufmann | social_science | 7.3 | 6 | 5.7 | 6.7 | 7 | 6 | 6 | 6 | 6.47 |
| Ashley Mears | social_science | 7.3 | 5 | 5.7 | 7 | 7 | 6 | 6 | 6 | 6.37 |
| Vitalik Buterin | tech_entrepreneurship | 8 | 6 | 7 | 7 | 8 | 7 | 7 | 6.3 | 7.20 |
| Audrey Tang | tech_entrepreneurship | 7.7 | 5.7 | 7.3 | 7.3 | 7.3 | 6.3 | 7 | 7 | 6.97 |
| Balaji Srinivasan | tech_entrepreneurship | 7.7 | 6 | 7 | 6.7 | 7 | 7 | 7 | 6.3 | 6.91 |
| Daniel Gross | tech_entrepreneurship | 7 | 6 | 6 | 7 | 8 | 7 | 6 | 6.3 | 6.86 |
| Chris Dixon | tech_entrepreneurship | 7.3 | 6 | 6 | 7 | 7.7 | 6.7 | 7.3 | 6.3 | 6.81 |
| Blake Scholl | tech_entrepreneurship | 7.3 | 6 | 7 | 7 | 6.7 | 6.7 | 6 | 6 | 6.79 |
| Marc Andreessen | tech_entrepreneurship | 7.3 | 6 | 6 | 6.7 | 7.7 | 6.7 | 6.3 | 6 | 6.75 |
| Sam Altman | tech_entrepreneurship | 7.3 | 6 | 6 | 6.7 | 7.3 | 7 | 6 | 6.3 | 6.75 |
| Brian Armstrong | tech_entrepreneurship | 7 | 5.7 | 6 | 6.7 | 7.7 | 6.7 | 6 | 6.3 | 6.64 |
| Patrick Collison | tech_entrepreneurship | 7 | 6 | 6 | 6.3 | 6.7 | 6 | 6 | 6 | 6.35 |
| David Rubenstein | tech_entrepreneurship | 7 | 4.7 | 4.3 | 6 | 7 | 6 | 6 | 6 | 5.88 |
Appendix A.3: Mistral Large scores (n = 99)
| Guest | Cell | Abs | Cmp | Ori | CC | EC | GSM | Voc | Syn | VRI |
|---|---|---|---|---|---|---|---|---|---|---|
| Daron Acemoglu | economics | 8 | 7 | 8 | 9 | 8 | 8 | 8 | 7 | 8.00 |
| Russ Roberts | economics | 8 | 7 | 8 | 8 | 8 | 7.7 | 8 | 7 | 7.79 |
| Nassim Nicholas Taleb | economics | 8 | 7 | 8.3 | 8 | 7.7 | 7.3 | 9 | 7 | 7.73 |
| Alain Bertaud | economics | 8 | 7.3 | 8 | 8.3 | 7.3 | 7.3 | 8 | 7 | 7.72 |
| Abhijit Banerjee | economics | 8 | 7 | 7.3 | 8 | 8.3 | 7.3 | 8 | 7 | 7.69 |
| Paul Krugman | economics | 8 | 7 | 7.7 | 8 | 8 | 7 | 8 | 7 | 7.63 |
| Larry Summers | economics | 8 | 7.3 | 7 | 8 | 8 | 7 | 8 | 7 | 7.57 |
| Raj Chetty | economics | 8 | 7 | 7 | 8 | 8 | 7 | 8 | 7 | 7.52 |
| Simon Johnson | economics | 7.7 | 6.7 | 7 | 8 | 7.7 | 7 | 8 | 7 | 7.35 |
| Alan Taylor | economics | 7 | 6 | 6 | 7.3 | 7 | 6.3 | 7.3 | 6.7 | 6.63 |
| Michelle Dawson | hard_science | 9 | 8 | 9 | 9 | 9 | 8.3 | 9 | 8 | 8.73 |
| David Deutsch | hard_science | 9 | 8 | 9 | 9 | 9 | 8 | 9 | 8 | 8.68 |
| Alison Gopnik | hard_science | 9 | 8 | 9 | 9 | 8.3 | 8 | 8.7 | 7.7 | 8.56 |
| Steven Pinker | hard_science | 9 | 8 | 8 | 9 | 8.7 | 7.7 | 9 | 8 | 8.41 |
| Ed Boyden | hard_science | 8 | 7 | 9 | 8.3 | 8 | 8 | 8 | 7 | 8.05 |
| Philip Ball | hard_science | 8 | 7 | 8 | 8.3 | 8.7 | 8 | 8 | 7 | 8.01 |
| Michael Nielsen | hard_science | 8 | 7 | 8 | 8 | 8.3 | 8 | 8 | 7 | 7.90 |
| Paul Bloom | hard_science | 8 | 7 | 8 | 8 | 8 | 8 | 8 | 7 | 7.84 |
| Jess Wade | hard_science | 8 | 7 | 8 | 8.7 | 8 | 7.3 | 8.3 | 7.3 | 7.84 |
| Ezekiel Emanuel | hard_science | 8 | 7 | 8 | 8 | 7.3 | 7 | 8 | 7 | 7.56 |
| Atul Gawande | hard_science | 8 | 7 | 8 | 8 | 7 | 7.3 | 8 | 7 | 7.55 |
| Helen Castor | history | 8 | 7 | 8 | 9 | 8 | 8 | 9 | 8 | 8.00 |
| Ada Palmer | history | 8 | 7 | 8.3 | 8.7 | 8 | 7.7 | 9 | 8 | 7.95 |
| Roy Foster | history | 8 | 7 | 8 | 9 | 8 | 7.7 | 9 | 8 | 7.95 |
| Reza Aslan | history | 8 | 7 | 8 | 9 | 8 | 7.3 | 9 | 8 | 7.89 |
| Diarmaid Macculloch | history | 8 | 7 | 8 | 9 | 8 | 7 | 9 | 8 | 7.84 |
| Adam Tooze | history | 8 | 7 | 8 | 8 | 8 | 7 | 9 | 8 | 7.68 |
| Paul Gillingham | history | 8 | 7 | 8 | 8.3 | 7.7 | 7 | 8.3 | 7.3 | 7.67 |
| Jill Lepore | history | 8 | 7 | 8 | 8 | 7 | 7 | 8 | 7 | 7.50 |
| Niall Ferguson | history | 8 | 7 | 8 | 8 | 7 | 7 | 8.7 | 7.7 | 7.50 |
| Jennifer Burns | history | 7 | 6 | 7 | 8 | 7 | 7 | 7 | 7 | 7.00 |
| Patricia Fara | history | 7 | 6 | 7 | 8 | 7 | 6 | 8 | 7 | 6.84 |
| Nate Silver | journalism_public | 8 | 7 | 8 | 8 | 8 | 8 | 8 | 7 | 7.84 |
| Larissa Macfarquhar | journalism_public | 8 | 7 | 8 | 8.3 | 8 | 7.7 | 8 | 7 | 7.84 |
| Ezra Klein | journalism_public | 8 | 7 | 8 | 8 | 8.3 | 7.3 | 8 | 7 | 7.79 |
| Andrew Sullivan | journalism_public | 8 | 7 | 8 | 8 | 7.7 | 8 | 8.3 | 7.3 | 7.78 |
| Annie Jacobsen | journalism_public | 8 | 7 | 8 | 8.7 | 7.7 | 7.3 | 8.3 | 7 | 7.78 |
| Barkha Dutt | journalism_public | 8 | 7 | 7.3 | 8.3 | 8 | 7.3 | 8.3 | 7.3 | 7.68 |
| Malcolm Gladwell | journalism_public | 8 | 7 | 8 | 8 | 7 | 8 | 8 | 7 | 7.66 |
| Ben Thompson | journalism_public | 8 | 7 | 7.7 | 8.3 | 7.7 | 7 | 8 | 7 | 7.62 |
| David Brooks | journalism_public | 8 | 7 | 8 | 8 | 7.3 | 7.3 | 8 | 7 | 7.61 |
| Andrew Ross Sorkin | journalism_public | 7 | 6 | 7 | 8 | 7 | 7.3 | 7.7 | 7 | 7.05 |
| Ben Westhoff | journalism_public | 7 | 6 | 7 | 8 | 7 | 7 | 7 | 7 | 7.00 |
| Cass Sunstein | law_policy | 8 | 7 | 8 | 8.3 | 7.7 | 8 | 8 | 7 | 7.83 |
| Jennifer Pahlka | law_policy | 8 | 7 | 8 | 8.7 | 7.3 | 8 | 8 | 7 | 7.83 |
| Leopoldo Lopez | law_policy | 8 | 7 | 8 | 8.3 | 8 | 7.3 | 8 | 7 | 7.79 |
| Samantha Power | law_policy | 8 | 7 | 8 | 8 | 8 | 7.3 | 8 | 7 | 7.73 |
| Ben Sasse | law_policy | 8 | 7 | 8 | 8.3 | 7.7 | 7.3 | 8 | 7 | 7.73 |
| Bruno Macaes | law_policy | 8 | 7 | 8 | 8 | 7 | 7.3 | 8 | 7 | 7.55 |
| Jamal Greene | law_policy | 8 | 7 | 7 | 8 | 8 | 7 | 8 | 7 | 7.52 |
| Rachel Harmon | law_policy | 8 | 7 | 7 | 8 | 8 | 7 | 7.7 | 7 | 7.52 |
| Tom Tugendhat | law_policy | 7.3 | 6.7 | 7 | 8 | 7.7 | 7 | 8 | 7.3 | 7.29 |
| Stanley Mcchrystal | law_policy | 7.3 | 6.3 | 7 | 8 | 7.3 | 7 | 7 | 6.7 | 7.17 |
| John O Brennan | law_policy | 7 | 6 | 6 | 7 | 7.3 | 7 | 8 | 7 | 6.74 |
| Brian Koppelman | lit_arts | 8 | 7 | 8 | 8.7 | 8 | 8 | 8 | 7 | 7.95 |
| Camille Paglia | lit_arts | 8.3 | 7.3 | 9 | 8.3 | 7.3 | 7.3 | 9 | 8 | 7.94 |
| Andy Weir | lit_arts | 8 | 7 | 8 | 8.7 | 7.7 | 7.3 | 8 | 7 | 7.78 |
| Margaret Atwood | lit_arts | 8 | 7 | 8 | 8 | 8 | 7 | 8.3 | 7.3 | 7.68 |
| Dana Gioia | lit_arts | 8 | 7 | 8 | 8.3 | 7.7 | 7 | 8.7 | 7.7 | 7.67 |
| Alex Ross | lit_arts | 8 | 7 | 8 | 8 | 7.7 | 7 | 8.7 | 7.3 | 7.62 |
| Benjamin Moser | lit_arts | 8 | 7 | 8 | 8 | 7 | 7 | 8.3 | 7.3 | 7.50 |
| Cynthia Haven | lit_arts | 8 | 7 | 8 | 8 | 7 | 7 | 8 | 7 | 7.50 |
| Fuchsia Dunlop | lit_arts | 7.3 | 6.3 | 7.3 | 8.3 | 7.3 | 7 | 8.3 | 7 | 7.28 |
| Emily St John Mandel | lit_arts | 7 | 6 | 7 | 7.7 | 7 | 7 | 7 | 7 | 6.95 |
| Ana Vidovic | lit_arts | 7 | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 6.68 |
| Elijah Millgram | philosophy | 9 | 8 | 9 | 9 | 9 | 9 | 9 | 8 | 8.84 |
| John Gray | philosophy | 9 | 8 | 9 | 9 | 9 | 8 | 9 | 8 | 8.68 |
| David Bentley Hart | philosophy | 9 | 8 | 8.7 | 9 | 9 | 8 | 9 | 8 | 8.63 |
| Amia Srinivasan | philosophy | 9 | 8 | 9 | 9 | 8.3 | 8 | 9 | 8 | 8.56 |
| Noam Chomsky | philosophy | 9 | 8 | 9 | 9 | 8 | 8 | 9 | 8 | 8.50 |
| Rebecca Kukla | philosophy | 8.7 | 7.7 | 8.7 | 8.7 | 7.7 | 8 | 8.7 | 7.7 | 8.22 |
| Kwame Anthony Appiah | philosophy | 8 | 7 | 8 | 8.3 | 9 | 8 | 8.3 | 7.3 | 8.07 |
| William Macaskill | philosophy | 8 | 7 | 8 | 8.3 | 8 | 8 | 8 | 7 | 7.89 |
| Rabbi David Wolpe | philosophy | 8 | 7 | 8 | 8 | 8 | 7 | 8 | 7 | 7.68 |
| Peter Singer | philosophy | 8 | 7 | 7.3 | 8.3 | 8 | 7.3 | 8 | 7 | 7.68 |
| Slavoj Zizek | philosophy | 8 | 7 | 8.3 | 8 | 7 | 7.7 | 8.7 | 7 | 7.66 |
| Agnes Callard | philosophy | 8 | 7 | 8 | 8 | 7.7 | 7 | 8 | 7 | 7.62 |
| Daniel Kahneman | social_science | 8.3 | 7.3 | 8.3 | 8.3 | 9 | 8 | 8.3 | 7.3 | 8.24 |
| Arthur Brooks | social_science | 8.7 | 7.7 | 8 | 9 | 8 | 7.7 | 9 | 8 | 8.17 |
| Harvey Mansfield | social_science | 8.3 | 7.3 | 8 | 9 | 7.3 | 8 | 8.3 | 7.3 | 7.99 |
| Philip E Tetlock | social_science | 8 | 7 | 8 | 8.7 | 8.3 | 7.7 | 8 | 7 | 7.95 |
| Daniel Carpenter | social_science | 8 | 7 | 7.7 | 8 | 8.3 | 7.7 | 8.3 | 7 | 7.79 |
| Jonathan Haidt | social_science | 8 | 7 | 8 | 8.3 | 8 | 7 | 8 | 7 | 7.73 |
| Henry Farrell | social_science | 8 | 7 | 8 | 8 | 7.7 | 7.7 | 8.3 | 7.3 | 7.73 |
| Chris Blattman | social_science | 8 | 7 | 7.3 | 8 | 8.3 | 7.3 | 8 | 7 | 7.69 |
| Joseph Henrich | social_science | 8 | 7 | 8 | 8.3 | 7.3 | 7 | 8 | 7 | 7.61 |
| Ashley Mears | social_science | 7.7 | 6.7 | 7.7 | 8 | 7 | 7 | 8 | 7 | 7.33 |
| Eric Kaufmann | social_science | 7.3 | 6.3 | 7 | 8 | 7.3 | 7 | 7.7 | 6.7 | 7.17 |
| Audrey Tang | tech_entrepreneurship | 8.3 | 7.7 | 8.7 | 8.7 | 8 | 8.3 | 9 | 8 | 8.27 |
| Blake Scholl | tech_entrepreneurship | 8 | 7.7 | 9 | 9 | 8 | 8 | 8 | 7 | 8.27 |
| Marc Andreessen | tech_entrepreneurship | 8 | 7 | 8 | 9 | 8 | 8 | 8 | 7 | 8.00 |
| Vitalik Buterin | tech_entrepreneurship | 8.3 | 7.3 | 8 | 8.3 | 8 | 7.3 | 8.3 | 7.3 | 7.90 |
| Sam Altman | tech_entrepreneurship | 8 | 7 | 8 | 8.3 | 7.7 | 7.3 | 8 | 7 | 7.73 |
| Patrick Collison | tech_entrepreneurship | 8 | 7 | 8 | 8.7 | 7.7 | 7 | 8 | 7 | 7.73 |
| Chris Dixon | tech_entrepreneurship | 8 | 7 | 8 | 8.3 | 7.3 | 7 | 8 | 7 | 7.61 |
| Daniel Gross | tech_entrepreneurship | 8 | 7 | 8 | 8 | 7 | 7.3 | 8 | 7 | 7.55 |
| Balaji Srinivasan | tech_entrepreneurship | 8 | 7 | 8 | 8 | 7 | 7 | 8 | 7 | 7.50 |
| Brian Armstrong | tech_entrepreneurship | 8 | 7 | 8 | 8 | 7 | 7 | 7.3 | 7 | 7.50 |
| David Rubenstein | tech_entrepreneurship | 7 | 6 | 6.3 | 7.3 | 7 | 7 | 7.7 | 7 | 6.79 |