Replicants Are Replicating

In this week's Research Roundup let's simulate humanity.
Means, Medians, & Modes...Oh My!
What happens when you use an LLM to model 216 different humans? As it turns out, you get a surprisingly good model of the collective behavior of a market.
Endowed with hundreds of "distinct, real-world investor personasâ, LLM agents end up disagreeing about âS&P500 news headlinesâ along predictable lines: political divides spiked around elections, income divides around tax cuts, and race-based divides around the George Floyd protests.
This simulated disagreement wasn't just a party trick. It correlated "strongly with abnormal trading volumeâ and predicted market returns, outperforming existing measures.
While LLMs may not capture the messy, dynamic generator function of a single human mind, they do seem to capture the aggregate distribution of our collective (textual) behavior. By giving it personas, the researchers are sampling from different parts of that distributionâa far better approximation of a complex system than just asking for the "average" opinion.
Perhaps we can just simulate the next 10 years of market returns on a server farm overnight and wake-up in the future.
Quality Is That Which Is Missing
Humans are irrational and angry and tribal. Letâs build a digital world of AIs and let them show us how to get along. Yes, Iâm talking about the big dream: letâs simulate Twitter!
Some researchers did just that, creating a âgenerative social simulation â that embeds [LLMs] within Agent-Based Models to create socially rich synthetic platformsâŠwhere agents can post, repost, and follow others.â
The results were...not encouraging. Without any special prompting, the AI agents organically recreated the digital misery Musk knows and loves:
- âpartisan echo chambersâ,
- the concentration of âinfluence among a tiny eliteâ, and
- the âamplification of the polarized voicesâ.
These aren't bugs; they are emergent features of social networksâ core architecture.
Simulating the failures of social media isnât the end. If the bots are screaming at each other just like us, the simulation can test interventions to fix the problems. In this case, researchers tested six common proposed interventions. Hereâs what they found:
Chronological ordering: âremoving engagement-based rankingâ reduced inequality (fewer social media celebrities) but it actually made partisanship worse while reducing engagement.
Downplaying dominant voices: promoting âposts with fewer reposts also reduced inequalityâ but it âhad no measurable effect on partisan amplification or homophily.â
Boosting out-partisan content: âhad little impact across any outcome dimension.â Disappointing.
Hiding social statistics and hiding biographies:people (and bots) do rely on social signals to pick what to share (lazy, cowardly, and terrible), and so hiding these does increase diversity of sharing. But unfortunately neither intervention reduces âhomophily, inequality, and partisan amplificationâ.
Bridging attributes: this intervention promotes âhigh-quality, constructive contentâ. Only bridging attributes broke the âlink between partisanship and engagementâ. It also âmodestly increased cross-partisan connectionsâ. But you canât have it all: the intervention also increased inequality as âvisibility became concentrated among a narrow set of high-scoring postsâ.
I choose âbridging attributesâ. Inequality of attention isn't great, but it's a different class of problem than a society that can no longer agree on a shared reality. I'm far more concerned about the erosion of pluralism and the rise of toxic polarization than I am about some users getting more retweets than others.
Replicants Are Mean
AI is a perfect student of psychology, and that's a problem.
Using LLMs to simulate research participants, a new study attempted to replicate â156 psychological experiments from top social science journalsâ. The simulations replicated the main effects of most studies with ease (73-81% success).
From there things get messy because the AIs were too good. They consistently âproduced larger effect sizes than humansâ and reproduced more complex interaction effects only â46â63% of the time. And when original studies found no significant effect, the LLM simulations produced one a remarkable 68-83% of the time.
And then there's the tell: the AI failed spectacularly on the very topics that are most humanânuanced, socially sensitive topics involving race, gender, and ethics. So what's going on here?
This pattern of results suggests these LLMs aren't actually simulating a person so much as a mean. It has learned the clean, statistically average signal from the mountain of text it was trained on, but it has none of the noisy, messy, lived experience that defines an individual.
It represents the static, aggregate distribution implied by a function of our collective behavior, not the dynamic reality implied by a collection of our individual functions.
This makes LLMs an incredible new tool for rapidly pilot-testing the "average" human responseâa massive value-add for human research of all types. But for understanding the complex, sensitive, and beautifully idiosyncratic reality of individual human lives? For that, we still need humans.