Fifty Percent More of the Same

The argument for AI agents usually starts with output. The same team can produce more work, move more drafts through the system, test more variants, and get more done without adding headcount. It’s a useful measure, which is why it shows up so often in budget conversations and executive decks.

A new field experiment gives that metric some support, but it also shows why output alone can be a misleading way to judge the work.

(A note: this matters more now than it would have a year ago. As tokens get scarcer and more expensive, the pull toward pushing more of the work through AI only grows, and the study below is about what that volume can cost in variety. I’ll have a longer position paper on the economics later this week.)

Harang Ju at Johns Hopkins and Sinan Aral at MIT built a platform, called Pairit, for assigning people to teams and having them produce real ad creative under experimental conditions. They used it to compare human-human teams with human-AI teams across 2,234 participants and 11,024 ads for a think tank.

The study didn’t stop with internal scoring or lab ratings. The researchers ran the ads on X, then measured their performance across roughly 5 million real impressions.

The productivity result was clear. Human-AI teams produced 50 percent more ads per worker, and their copy was rated higher. If the study stopped there, the finding would be easy to summarize: more output from the same team, and better text.

But that’s where the cleaner story breaks down.

The human-AI teams produced more ads, but the ads were also more alike. The paper calls this “diversity collapse.” For ad creative that’s a real cost; more output only helps if the additional work gives you more to learn from, more angles to test, or more ways to find something the market responds to. If the extra volume clusters around the same ideas, the count goes up without covering new ground.

The split was also visible in the work itself. Human-AI teams produced better text, while human-human teams produced better images. The authors call this a “jagged frontier,” and that seems like the right framing. AI improved part of the team’s output, but which part depended on what you measured.

When the ads went into the field, the impact was clear. The stronger text from the human-AI teams helped click-through rates, while the stronger images from the human-human teams helped cost-per-click. Those effects roughly offset each other, which left overall ad performance about the same across team types.

The teams did produce 50 percent more, so that part of the productivity story held. It just didn’t translate cleanly in the market.

The timing of this study is important because the economics around AI are starting to change. When tokens feel cheap and unlimited, the temptation is to use AI anywhere it can increase output. As that assumption weakens, the better question is where AI actually belongs in the process. Some work may benefit from speed and scale. Some work may depend more on judgment, variation, and human shaping. Cost and quality are the usual terms of that trade-off. This study adds another: the range of work a team produces.

Volume is the easy part. Every tool reports it: more drafts, more variants, more tickets, more responses. Variety is harder to see, and almost no one has a dashboard that warns them when the team’s work is becoming more self-similar. Productivity gains show up immediately, while the cost of diversity tends to appear later, when a competitor’s feed looks like yours, or when the work stops standing out and no one can say exactly when that happened.

I’d be careful about stretching one ad-creative study too far. Contract review, financial analysis, software development, and customer response all have different failure modes. The pattern is still common enough to take seriously, though, because the mechanism is not specific to advertising. In the study, people on the AI teams delegated more and edited less. That’s an understandable trade when the goal is speed, but it also means less human shaping happens before the work leaves the system.

That same trade can show up anywhere AI becomes the first draft machine. A team may produce more summaries, more tickets, more code, and more policy language, while the reporting layer mostly sees activity. The numbers will look good because the metrics are built to count movement through the system.

What the metrics may miss is the shape of the work itself. A team can produce more while the range of ideas gets narrower. It can move faster while exploring fewer possibilities. Expert judgment can still be present, but mostly as a light edit on the same machine-shaped first draft.

The number on the slide may be true. The question is whether anyone is measuring what it cost to get it.

📄 Full paper: https://arxiv.org/abs/2503.18238

Algorithm & Blues publishes Sundays.

Fifty Percent More of the Same

Get the next issue in your inbox