Can AI Agents Actually Use Your CLI?
Why help text, error messages, and output format matter more than you think — and how to measure it.
Something fundamental changed in how developers interact with CLIs. A year ago, humans were the primary audience. Today, AI agents are.
Claude Code, Cursor, GitHub Copilot, Windsurf — these tools don't use GUIs. They read help text, execute commands, parse output, and recover from errors. Your CLI's help page is now an API contract with an LLM.
The question isn't whether agents will use your CLI. They already are. The question is: how well does your CLI work when the user isn't human?
The agent-readiness gap
We benchmarked dozens of CLIs by giving real LLM agents real tasks — “create a deployment,” “list running services,” “configure a webhook” — and measured whether they could complete them. The results were surprising.
Some CLIs that developers love scored below 50%. Not because the tools are bad, but because they were designed for humans who can read between the lines, scan a man page, or recognize a pattern from muscle memory. Agents can't do any of that. They depend entirely on what your CLI tells them.
Three patterns showed up again and again in CLIs that agents struggle with.
1. Ambiguous help text
When an agent runs your-cli --help, the output it gets is the single most important piece of context it has. If the help text is vague, incomplete, or uses inconsistent naming, the agent guesses. And it guesses wrong.
$ deploy --help Usage: deploy [options] [target] Deploy your application. Options: -e, --env Environment -f, --force Force deploy
What's a “target”? A URL? A filename? A service name? What values does --env accept? The human knows from experience. The agent has to guess.
$ deploy --help Usage: deploy [options] <service-name> Deploy a service to the specified environment. Arguments: service-name Name of the service (e.g. "api", "web", "worker") Options: -e, --env <name> Target environment: staging | production (default: staging) -f, --force Skip confirmation prompt and deploy immediately Examples: deploy api --env production deploy web --env staging --force
The fix is almost always the same: explicit argument names, enumerated option values, and examples. This isn't just good for agents — it's good for humans too.
2. Error messages that don't explain the fix
When a command fails, the error message is the agent's only signal for what to do next. A message like Error: invalid configuration is a dead end. The agent will retry the same command, try random variations, or give up.
$ deploy api --env prod Error: invalid environment
$ deploy api --env prod Error: Unknown environment "prod". Valid environments: staging, production. Hint: Did you mean --env production?
The second version gives the agent everything it needs to self-correct in one turn. The first version burns 3-5 turns of retries before the agent either figures it out or fails. In benchmarks, this single pattern accounts for the majority of “agent needed too many turns” failures.
3. Output that's hard to parse
Agents need to use the output of one command as input to the next. Pretty tables with box-drawing characters, colored output, and alignment spaces look great in a terminal but are painful to extract data from.
$ services list ╔══════════╦══════════╦═══════════╗ ║ Service ║ Status ║ Replicas ║ ╠══════════╬══════════╬═══════════╣ ║ api ║ running ║ 3 ║ ║ web ║ running ║ 2 ║ ╚══════════╩══════════╩═══════════╝
$ services list --json
[
{"name": "api", "status": "running", "replicas": 3},
{"name": "web", "status": "running", "replicas": 2}
]The best CLIs support both: pretty output for humans by default, and a --json or --output json flag for machines. GitHub's gh CLI does this well — nearly every command supports --json with field selection.
Measuring agent-readiness
Fixing these patterns is straightforward once you know where they are. The harder question is: how do you know your CLI's current state? Manual testing against one model doesn't scale. Models behave differently — Claude might handle ambiguous help text by asking clarifying questions, while GPT-4o might just guess.
This is why we built CLIWatch. It runs your CLI against real LLM agents with real tasks and measures three things:
- Pass/fail — did the agent complete the task? This is your headline metric. A task that fails across every model is a CLI problem, not a model problem.
- Number of turns — how many back-and-forth exchanges did the agent need? A task that passes in 2 turns is well-designed. A task that passes in 12 turns means the agent is fighting your CLI's UX — retrying commands, reading help text multiple times, recovering from bad error messages.
- Token usage — how much context did the agent consume? High token usage on a simple task usually means verbose or unstructured output is flooding the agent's context window. This is the
--jsonflag problem: a pretty table costs 10x more tokens than the equivalent JSON.
$ npm install -g @cliwatch/cli $ cliwatch init $ cliwatch bench --dry-run
You define tasks (what should the agent accomplish?), CLIWatch runs them against multiple models, and uploads the results to your dashboard. Run it in CI and you catch regressions before they ship — just like you catch bugs with tests.
Beyond pass/fail: turns and tokens tell the real story
Pass rate gets you started, but turns and tokens are where you find the optimizations. Consider two tasks that both pass:
Task: "List running services and get the API service's URL" Claude Sonnet ✓ 2 turns 1,200 tokens GPT-4o ✓ 8 turns 7,400 tokens
Same task, same CLI, both pass. But GPT-4o needed 8 turns and burned 6x more tokens because it couldn't parse the output format on the first try. The CLIWatch dashboard surfaces exactly this — per-task, per-model breakdowns of turns and token usage — so you know which commands to prioritize.
Over time, you track these across releases. Did that help text rewrite actually reduce turns? Did adding --json cut token usage? The trend charts answer these questions without guesswork.
Show your score
Once you're benchmarking, you can add a badge to your README that shows your pass rate. It updates automatically on every run.
[](https://app.cliwatch.com)
Think of it like the “CI passing” badge, but for agent-readiness. It signals to your users (and their agents) that your CLI is designed to work with AI tools.
Why this matters now
The shift is happening fast. A year ago, “agent-readiness” wasn't a concept. Today, developers evaluate CLIs partly on how well they work with their AI coding assistant. A CLI that requires ten turns to do what a competitor does in two will lose users — not because it's worse, but because it's harder for agents to operate.
The good news: the fixes are usually small. Better help text, clearer error messages, a --json flag. The hard part was knowing where to look. Now you can measure it.
Evaluate your CLI today
Free for individuals and open-source projects. See how your CLI scores against real LLM agents.