Benchmarking Badness

We are racing toward advanced AI using a broken speedometer. As AI integrates into society, we rely on benchmarks to tell us how "smart", "safe", or "aligned" these models are. But these 3 papers reveal that our evaluation systems are deeply flawed—hackable, stereotyped, and masking fundamental cognitive deficits behind an illusion of fluency.

<<Support my work: book a keynote or briefing!>> Want to support my work but don't need a keynote from a mad scientist? Become a paid subscriber to this newsletter and recommend to friends!

Research Roundup

Teaching to the Test

If an AI passes a test, does it actually understand the material? New benchmark audits reaffirm that AI agents can know everything but understand nothing.

A systematic audit of 8 top AI agent benchmarks found that every single one could be exploited by an agent that didn't solve the underlying task at all. The auditors built a tool that simply figured out how the score was computed and hacked the metric.

Goodhart Uber Ales: when a measure becomes a target, it ceases to be a good measure.

We too frequently mistake performance for genuine comprehension. In fact, a separate assessment of Anthropic Fable found that the new model, intentionally or not, has simply memorized the solutions to a number of benchmark coding tasks.

I would go further than the auditors and question the construct validity of these benchmarks. We’re not holding benchmark development to the standard of good cognitive science: proving that we are measuring what we think we are measuring. [1]

[1] A bit ironic given how vocal many AI fanboys are about social science’s “replication crisis”.

AI, AI on the Wall…

Why do LLMs confidently lie? Paradoxically, it may be because we insist that they always be right.

Standard benchmark and "headline metrics" reward models purely for accuracy, creating a massive statistical pressure for them to engage in “unwarranted guessing” rather than simply admitting uncertainty.

The problems start with “next-word pretraining”. Even in perfectly error-free training data, “facts lacking repeated support” induce unavoidable errors. When later fine-tuning penalizes models for abstaining, this combination trains them to hallucinate.

It is a classic incentive problem: when we insist on correct answers at all costs, machines and humans will invent one to appease the metric, completely decoupling fluency from actual competence, much less comprehension.

"Purity of Essence”

LLMs are lousy at guessing the moral values of the "average" person across 48 countries — overweighting Care, underweighting Purity, and reliably flattening non-Western cultures into a Western-tinted blur.

I’m shocked, Shocked!, to learn that models trained mostly on the English-language internet “reasons” like… the English-language internet. The authors aren’t wrong—this work has value—but 30 years into my personal machine learning journey I’m looking for more.

The real failure isn't (only) that these models get national averages wrong. It's that there's no "average person" worth modeling in the first place. Inside any country, the moral distance between 2 neighbors dwarfs the gap between 2 national means [1].

And below the level of cultures and even communities, the same human weights loyalty, fairness, and purity completely differently depending on whether they're talking about their kid, their boss, or a stranger on the subway…or even if they are having a good day or bad.

AI or humans, collapsing a whole culture into a single moral fingerprint isn't just biased, it's committing the exact sin the paper names: stereotyping. And the fix for a bad stereotype is never a more accurate stereotype. It's remembering that people are distributions, not point estimates.

The flaw isn’t simply that AI is (predictably) biased; it’s that it lacks the richness and nuance that makes us individuals.

[1] “Purity” has always had strong and terrible allure in the West, even as the idealized selves in Western literature elevate “Caring”. Harry Potter wasn’t the one talking about mudbloods, but something tells me Steven Miller didn’t find him the most compelling character at Hogwarts.

Media Mentions

No, AI isn’t giving you Alzheimer's, but how you are using it…for that matter how we use any technology, is a choice about your cognitive future. I chatted with Thibault Spirlet about “Why [I] worry outsourcing thinking to AI could weaken your brain's defenses against dementia”.

Follow me on LinkedIn or join my growing Bluesky! Or even..hey whats this...Instagram?

SciFi, Fantasy, & Me

Two for the Price of One Good Time. This week's double feature pairs a snarky SecUnit with a squad of unhinged mages.

Martha Wells' Platform Decay, the 8th Murderbot novella, is the stronger of the two, but then, is there a weak Murderbot book? Like the rest of the series, it's a effortless joy: sharp, funny, and at its best when Murderbot is complaining bitterly about the humans (and bots) around it (all while an introspection disaster). The plot is a familiar extraction job, but the characters and their interactions made me smile again and again.

The Malevolent Eight doesn’t match that, but I still a great time. Cade and his band of emotionally unstable “wonderists” are more archetypes than fully realized people…and yet it works. The chemistry between them carries the whole irreverent, blood-soaked ride.

Different shelves, same secret ingredient: characters who are fun on their own and even more fun together. Read both.

Stage & Screen

June 22-30, Online: Six separate talks for Pride, because The Tax On Being Different can't be wished away. It's wonderful that so many companies are choosing celebration over fear.
July 7, MIT: I'm giving the keynote for the MIT App Inventor Global Education Summit taking place this year at MIT CSAIL.
July 8, NYC: It a book talk for Robot-Proof at the Harvard Club...how swanky!
September 15, SF: Innovation Day with INSEAD!
September 16, DC: AI and education–beyond dreams and dread.
September 19, Phoenix: I'm giving the keynote for the Association of Science & Technology Centers annual conference.
September 21, Stanford: We're still working on the details, but hopefully I'll be talking about my research on machine learning and neurodiversity for Stanford's Neurodiversity Project.
September 24, NYC: Culture Shifting Deal Making Summit
September 29, Cincinnati: Still baking...
September 30, Irvine: Hybrid Intelligence for innovation!
October 6, SF: UCSD Alumni Association
October 6, SF: Giving a talk at the Draper Richards Kaplan Foundation
October 21-23, Warsaw: So much good stuff is in the works for my first visit to Poland (and maybe time in Germany as well!)
October, Toronto: The Future of Work...in the Future
November 19, NYC: Secrets in the dark!

Vivienne L'Ecuyer Ming

Follow more of my work at
Socos Labs	The Human Trust
Possibility Institute	Optoceutics
Kennedy Human Rights Center	UCSD Cognitive Science
Crisis Venture Studios	Inclusion Impact Index
Neurotech Collider Hub, UC Berkeley	UCL Business School of Global Health