Coding Agents are Good First-Time User Testers

TL;DR: Coding agents make surprisingly good first-time user testers, especially for CLIs. When they get confused, it’s usually because the interface is asking too much of a new user.

Traditional user testing is expensive and slow. You need fresh testers to avoid bias, you need to be present while they work, and you still have to interpret half-formed feedback after the fact. As a result, many engineers either skip user testing entirely or defer it until it’s already too late.

I wasn’t thinking about user testing at all. I just wanted Claude Code to solve a problem using my tool, and it struggled in places I hadn’t anticipated.

When I looked at why it struggled, it was clear that my tool wasn’t as obvious as I thought.

LLMs have been trained on vast amounts of software: command-line tools, websites, error messages, help text. As a result, they develop strong expectations about how software usually behaves.

In practice, reducing confusion in an interface looks remarkably similar to reducing perplexity. If the next step is obvious, both humans and agents benefit.¹

Coding agents are cheap, resettable first-time user proxies that can explain exactly where their expectations broke.

Below are two areas where this has already worked well for me.

CLI

Earlier in the week I was building a CLI for interacting with complex file formats that LLMs normally struggle to reason about. Just enough flags, modes and constraints to make the interface non-trivial.

I wrote a simple README describing the available flags, added a few examples, and asked Claude Code to complete a task using it.

It was somewhat humbling seeing how much it struggled with the initial version: it misinterpreted several subtle flags, ran commands in a convoluted order, gave inputs slightly outside the expected range which the CLI promptly rejected, and occasionally gave up and tried to write Python instead.

At this point there are two obvious options: keep patching the README and prompt to cover every edge case, or change the interface to be more intuitive.

In addition to reviewing each step in the chain myself, I asked Claude to explain what it found unexpected at each step, roughly analogous to a tester narrating their thought process during a usability session.

After simplifying the CLI, the same task went from ~20 steps to ~11. Same prompt, same final output, only the interface changed.

Web

This works for websites too, though with one important caveat: the model should operate primarily on screenshots. Otherwise, you risk optimising for a clean DOM rather than actual user experience.

The setup is similar. I used Claude Code with its Chrome plugin, though something like Playwright would work just as well.

After logging the agent in and providing minimal context, I asked it to complete a task: create a project with a complex configuration.

It struggled to understand how certain properties were configured and navigated to pre-existing projects to infer how things worked before returning. It was confused by two similar looking buttons with overlapping functionality.

After the session, I asked the agent to list moments where the interface broke its expectations and to suggest possible improvements.

One piece of feedback was to remove one of the overlapping buttons and add a text label to an ambiguous icon.

I wouldn’t trust agents with visual taste or delightful experiences, but they are useful at detecting ambiguous actions, unclear states. The kind of thing that confuses new users.

A Different Kind of Regression Test

A crude but surprisingly useful proxy for how intuitive an interface is: how many steps an agent needs to complete a task. Fewer steps generally indicate clearer affordances and better defaults.²

In principle, this could run as part of CI. Just as we guard against regressions in performance or logic, we could also guard against intuition regressions:

const result = agent("Sign up to the homepage and create a project")
assert result.succeeded 
assert result.steps <= 15

This doesn't replace human user testing. Beyond their limited sense of taste, agents can fail for reasons unrelated to UX: Javascript quirks, timeouts, or hallucinations.

But as a first pass, it provides an affordable, resettable, always-available proxy for identifying where intuition breaks, long before a real user ever has to.

Code agents are compressed representations of software conventions and expectations. When they get confused, it's often because your interface is asking too much of a first-time user.

Intuition to perplexity maps better to text-based interfaces and not necessarily visual UIs. ↩
Bounded, deterministic tasks work best. ↩

CLI

Web

A Different Kind of Regression Test

Like what you see?