·8 min read

Can AI Agents Actually Use Your CLI?

Why help text, error messages, and output format matter more than you think — and how to measure it.

Something fundamental changed in how developers interact with CLIs. A year ago, humans were the primary audience. Today, AI agents are.

Claude Code, Cursor, GitHub Copilot, Windsurf — these tools don't use GUIs. They read help text, execute commands, parse output, and recover from errors. Your CLI's help page is now an API contract with an LLM.

The question isn't whether agents will use your CLI. They already are. The question is: how well does your CLI work when the user isn't human?

The agent-readiness gap

We benchmarked dozens of CLIs by giving real LLM agents real tasks — “create a deployment,” “list running services,” “configure a webhook” — and measured whether they could complete them. The results were surprising.

Some CLIs that developers love scored below 50%. Not because the tools are bad, but because they were designed for humans who can read between the lines, scan a man page, or recognize a pattern from muscle memory. Agents can't do any of that. They depend entirely on what your CLI tells them.

Three patterns showed up again and again in CLIs that agents struggle with.

1. Ambiguous help text

When an agent runs your-cli --help, the output it gets is the single most important piece of context it has. If the help text is vague, incomplete, or uses inconsistent naming, the agent guesses. And it guesses wrong.

What agents struggle with
$ deploy --help
Usage: deploy [options] [target]

Deploy your application.

Options:
  -e, --env    Environment
  -f, --force  Force deploy

What's a “target”? A URL? A filename? A service name? What values does --env accept? The human knows from experience. The agent has to guess.

What agents can work with
$ deploy --help
Usage: deploy [options] <service-name>

Deploy a service to the specified environment.

Arguments:
  service-name    Name of the service (e.g. "api", "web", "worker")

Options:
  -e, --env <name>   Target environment: staging | production (default: staging)
  -f, --force        Skip confirmation prompt and deploy immediately

Examples:
  deploy api --env production
  deploy web --env staging --force

The fix is almost always the same: explicit argument names, enumerated option values, and examples. This isn't just good for agents — it's good for humans too.

2. Error messages that don't explain the fix

When a command fails, the error message is the agent's only signal for what to do next. A message like Error: invalid configuration is a dead end. The agent will retry the same command, try random variations, or give up.

Dead-end error
$ deploy api --env prod
Error: invalid environment
Actionable error
$ deploy api --env prod
Error: Unknown environment "prod". Valid environments: staging, production.
Hint: Did you mean --env production?

The second version gives the agent everything it needs to self-correct in one turn. The first version burns 3-5 turns of retries before the agent either figures it out or fails. In benchmarks, this single pattern accounts for the majority of “agent needed too many turns” failures.

3. Output that's hard to parse

Agents need to use the output of one command as input to the next. Pretty tables with box-drawing characters, colored output, and alignment spaces look great in a terminal but are painful to extract data from.

Hard to parse
$ services list
╔══════════╦══════════╦═══════════╗
║ Service  ║ Status   ║ Replicas  ║
╠══════════╬══════════╬═══════════╣
║ api      ║ running  ║ 3         ║
║ web      ║ running  ║ 2         ║
╚══════════╩══════════╩═══════════╝
Easy to pipe
$ services list --json
[
  {"name": "api", "status": "running", "replicas": 3},
  {"name": "web", "status": "running", "replicas": 2}
]

The best CLIs support both: pretty output for humans by default, and a --json or --output json flag for machines. GitHub's gh CLI does this well — nearly every command supports --json with field selection.

Measuring agent-readiness

Fixing these patterns is straightforward once you know where they are. The harder question is: how do you know your CLI's current state? Manual testing against one model doesn't scale. Models behave differently — Claude might handle ambiguous help text by asking clarifying questions, while GPT-4o might just guess.

This is why we built CLIWatch. It runs your CLI against real LLM agents with real tasks and measures three things:

  • Pass/fail — did the agent complete the task? This is your headline metric. A task that fails across every model is a CLI problem, not a model problem.
  • Number of turns — how many back-and-forth exchanges did the agent need? A task that passes in 2 turns is well-designed. A task that passes in 12 turns means the agent is fighting your CLI's UX — retrying commands, reading help text multiple times, recovering from bad error messages.
  • Token usage — how much context did the agent consume? High token usage on a simple task usually means verbose or unstructured output is flooding the agent's context window. This is the --json flag problem: a pretty table costs 10x more tokens than the equivalent JSON.
Get started in 60 seconds
$ npm install -g @cliwatch/cli
$ cliwatch init
$ cliwatch bench --dry-run

You define tasks (what should the agent accomplish?), CLIWatch runs them against multiple models, and uploads the results to your dashboard. Run it in CI and you catch regressions before they ship — just like you catch bugs with tests.

Beyond pass/fail: turns and tokens tell the real story

Pass rate gets you started, but turns and tokens are where you find the optimizations. Consider two tasks that both pass:

Both pass — but one is 6x more expensive
Task: "List running services and get the API service's URL"

  Claude Sonnet   ✓  2 turns   1,200 tokens
  GPT-4o          ✓  8 turns   7,400 tokens

Same task, same CLI, both pass. But GPT-4o needed 8 turns and burned 6x more tokens because it couldn't parse the output format on the first try. The CLIWatch dashboard surfaces exactly this — per-task, per-model breakdowns of turns and token usage — so you know which commands to prioritize.

Over time, you track these across releases. Did that help text rewrite actually reduce turns? Did adding --json cut token usage? The trend charts answer these questions without guesswork.

Show your score

Once you're benchmarking, you can add a badge to your README that shows your pass rate. It updates automatically on every run.

One line of Markdown
[![AI Bench](https://app.cliwatch.com/api/v1/public/badges/YOUR_TOKEN)](https://app.cliwatch.com)

Think of it like the “CI passing” badge, but for agent-readiness. It signals to your users (and their agents) that your CLI is designed to work with AI tools.

Why this matters now

The shift is happening fast. A year ago, “agent-readiness” wasn't a concept. Today, developers evaluate CLIs partly on how well they work with their AI coding assistant. A CLI that requires ten turns to do what a competitor does in two will lose users — not because it's worse, but because it's harder for agents to operate.

The good news: the fixes are usually small. Better help text, clearer error messages, a --json flag. The hard part was knowing where to look. Now you can measure it.

Evaluate your CLI today

Free for individuals and open-source projects. See how your CLI scores against real LLM agents.

Get Started FreeComing soon