Charisma vs. Intelligence: What D&D Teaches Us About LLM Evaluation

This week: three papers that use Dungeons & Dragons, theory of mind tests, and persona consistency challenges to reveal where AI capabilities diverge from actual reliability. Turns out the gap between "passes the benchmark" and "works in practice" is where all the interesting questions live.

<<Support my work: book a keynote or briefing!>> Want to support my work but don't need a keynote from a mad scientist? Become a paid subscriber to this newsletter and recommend to friends!

Research Roundup

Can AI Play D&D?

Here’s a silly question: 𝐂𝐚𝐧 𝐀𝐈 𝐏𝐥𝐚𝐲 𝐃&𝐃? A new paper proposes this question as an AI benchmark, which is either brilliant or proof that researchers will find any excuse to expense their hobbies. Possibly both.

As silly as it may sound, D&D stress-tests exactly what current benchmarks miss. Answering trivia questions is easy. Maintaining a coherent dwarf cleric personality while calculating attack modifiers, tracking 17 NPCs' hit points, adjudicating whether a rust monster can corrode a magic sword, and planning 3 turns ahead in combat? That's where the wheels come off.

The researchers evaluated leading models across 6 dimensions: “function usage, parameter accuracy, acting quality, tactical decision-making, state tracking, and function efficiency”. Claude led on most axes with the most reliable tool use, though "most reliable" is doing heavy lifting when we're talking about AI that occasionally forgets how hit points work.

What makes this framework valuable isn't just the rankings—it's the auditable traces. You can watch exactly where models fail—wrong function calls, lost state tracking, tactically nonsensical decisions wrapped in eloquent narration. The gap between "sounds like a DM" and "actually is a functional DM" maps directly onto the gap between "sounds intelligent" and "reliably executes complex tasks."

For those of us interested in hybrid intelligence rather than the AI-replaces-everything narrative, this matters. The question isn't whether Claude can DM unsupervised (it mostly can't). It's what happens when you combine human judgment with machine computation in cognitively demanding, rule-bound collaborative tasks.

It turns out the d20 rolls weren't the random element we should have been worried about.

Fast & Furious AI: Persona Drift

One of the most annoying failure modes in AI agents is persona drift. Any sustained human-AI interaction where the model needs to maintain a consistent personality—customer service agents, therapeutic chatbots, creative writing assistants—faces this problem. LLMs are excellent at generating plausible text *in the moment* but surprisingly bad at remembering who they're supposed to be across a long conversation.

Autonomous agents in role-playing games already reveal this problem: you start a conversation with a gruff dwarven warrior but later in the game they're speaking like a courtly diplomat. The character's name stayed the same; everything else wandered off.

A new paper introduces “Persona-Aware Contrastive Learning (PCL)”, a framework that essentially teaches models to self-interrogate: "Wait, would my character actually say this?" The technique uses a "role chain" method where the model questions its own outputs based on established character traits and dialogue context, then iteratively refines its role-playing strategy by contrasting responses that use character knowledge versus those that ignore it.

The clever part: it's annotation-free. No expensive human labeling of "good dwarf dialogue" versus "bad dwarf dialogue". The model learns to maintain persona consistency by playing itself off against a version that deliberately disregards character constraints—basically training through adversarial self-play, a sort of hard-coded model-based learning

Results show significant improvements in character consistency, conversational ability, and role-playing quality under both automated metrics and human expert evaluation. Models equipped with PCL stay in character. Vanilla models... don't.

Persona consistency isn't just about immersion. It's about whether the AI can be a *reliable* collaborator in cognitively demanding creative work. In humans, perspective taking is a strong predictor of collective intelligence. It's also a predictor of hybrid intelligence. I guarantee the AI perspective taking is just as important, but it‘s a harder problem than it looks.

Take My Perspective…Please!

Do LLMs actually possess perspective taking and theory of mind—the ability to track other people's mental states, beliefs, and intentions?

An experiment from 2024 put GPT-4, LLaMA2, and 1,907 humans through a comprehensive battery of tests: false beliefs, indirect requests, irony, misdirection, and faux pas detection. These are the kind of social reasoning challenges that separate "please pass the salt" from "wow, this food could really use some salt" while staring pointedly at your dinner companion.

While the open source LLaMa2 struggled, GPT-4 matched or exceeded human performance on most tasks—false beliefs, indirect requests, misdirection. In 2024 I found this interesting; in 2026, I’m suspicious.

Can LLMs do theory of mind, as the paper seems to show? Given new research, including by me the answer is, “Kind of. Sometimes. In specific contexts. With caveats that vary by model architecture and training approach.”

As I discuss in my paid newsletter this week, if you've been playing D&D with an Agentic DM and watched it repeatedly fail to model what your character knows versus what you the player knows, tracks perfectly. The lab benchmarks are too often a ship in a bottle; the actual gameplay reveals reliability gaps that matter enormously for human-AI collaboration.

Turns out "can pass the test" and "can actually use this ability in complex, sustained interaction" remain frustratingly different questions.

Media Mentions

If you haven't already, check out my conversation with the People Managing People podcast. It's full of great stuff!

Follow me on LinkedIn or join my growing Bluesky! Or even..hey whats this...Instagram?

SciFi, Fantasy, & Me

Before everyone was arguing about whether ChatGPT has feelings, Ted Chiang wrote a novella about the unglamorous reality of raising AI. “𝐓𝐡𝐞 𝐋𝐢𝐟𝐞𝐜𝐲𝐜𝐥𝐞 𝐨𝐟 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐎𝐛𝐣𝐞𝐜𝐭𝐬” follows Ana and Derek as they nurture "digients"—digital entities that learn, develop personalities, and form genuine attachments.

It's basically the persona consistency problem, except Chiang wrote it in 2010. It’s also the antithesis of fast & shallow scaling (more data, more parameters, instant results), Chiang argues that true intelligence requires slow & deep gestation.

In the story, maintaining consistent AI personhood across platform migrations, corporate bankruptcies, and hardware obsolescence requires years of patient human labor. Digient's charming personality traits turn out to be fragile emergent properties that don't survive the upgrade to new infrastructure.

This isn't a story about whether AI can pass the Turing test. It's about what happens after—when you've invested years in a relationship with something that might be genuinely conscious, or might just be very good at seeming like it, and you're not sure the distinction matters anymore. The digients exhibit theory of mind, maintain character consistency, coordinate in social groups... until they don't, and you're left debugging whether that's a technical failure or developmental regression.

Chiang does what he always does: takes a Big Question and grounds it in the tedious, heartbreaking specifics of actual sustained interaction. No robot uprisings. Just the question of whether you keep paying the server fees for a digital creature that might love you back.

If you're spending your weeks analyzing D&D transcripts with AI and wondering what genuine collaboration versus convincing performance looks like, this one will hurt in the best way. Read it in one sitting, then stare at your Claude conversation history differently.

Stage & Screen

March 4, Basel: I'm in Basel right now, just about to give a keynote at the Health.Tech Global Conference 2026: "Robot-Proof: How Human Agency Drives Hybrid Intelligence & Discovery"
March 8, LA: I'll be at UCLA talking about AI and teen mental health at the Semel Institute for Neuroscience and Human Behavior.
March 12, Santa Barbara: Economic development on the Central Coast.
March 14, Online: The book launch! Robot-Proof: When Machines Have All The Answers, Build Better People is will finally be inflicted on the world.
Boston, NYC, DC, & Everywhere Along the Acela line: We're putting together a book tour for you! Stay tuned...
Late March/Early April, UK & EU: Book Tour!
- March 30, Amsterdam: What else: AI and human I--together is better!
- plus London, Zurich, Basel, Copenhagen, and many other cities in development.
April 14, Seattle: Ill be keynoting at the AACSB Business School Conference.
May 12, Online: I'll be reading from Robot-Proof for the The Library Speakers Consortium.
June, Stockholm: The Smartest Thing on the Planet: Hybrid Collective Intelligence
October, Toronto: The Future of Work...in the Future

Vivienne L'Ecuyer Ming

Follow more of my work at
Socos Labs	The Human Trust
Possibility Institute	Optoceutics
Kennedy Human Rights Center	UCSD Cognitive Science
Crisis Venture Studios	Inclusion Impact Index
Neurotech Collider Hub at UC Berkeley	UCL Business School of Global Health