What Does It Actually Cost When an Agent Uses Your CLI?
We ran 4 models on the same Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.
Every time an agent runs docker build or docker inspect, it consumes tokens. Input tokens to read help text, output tokens to form commands, more of both when it retries on errors. Those tokens cost money.
Most teams pick a model based on list price per million tokens and call it a day. But that's like choosing a contractor by hourly rate without asking how many hours the job takes. The metric that actually matters is cost per successful task.
The “cheapest model” trap
Here's a comparison of the four models we benchmarked, by list price:
| Model | Input | Output | “Cheapest?” |
|---|---|---|---|
| Gemini 3 Flash | $0.50/M | $3/M | Looks like it |
| Claude Haiku 4.5 | $0.80/M | $4/M | Close second |
| GPT-5.2 | $1.75/M | $14/M | Mid-range |
| Claude Opus 4.6 | $5/M | $25/M | Premium |
Based on this, you'd pick Gemini 3 Flash for everything. It's 10x cheaper per token than Opus. Case closed?
Not even close.
Cost per successful task
Cheaper models fail more, retry more, and burn more tokens on the attempts that do fail. To get the real cost, you need to account for all of this:
cost_per_success = avg_tokens_per_attempt × price_per_token
÷ pass_rate
# Failed attempts aren't free — you pay for every retry.
# And failed attempts often use MORE tokens (retries, backtracking).We ran 4 Docker CLI tasks (build, run, inspect, volume-mount) across all four models, each repeated 3 times for a total of 48 evaluations. Here's what the real numbers look like:
| Model | Pass rate | Avg turns | Avg tokens | Est. cost / success |
|---|---|---|---|---|
| Gemini 3 Flash | 75% | 2.0 | 4,753 | $0.007 |
| Claude Haiku 4.5 | 92% | 2.7 | 8,068 | $0.014 |
| GPT-5.2 | 92% | 1.9 | 3,705 | $0.019 |
| Claude Opus 4.6 | 83% | 3.0 | 8,288 | $0.100 |
Cost estimates assume ~75% input / 25% output token split, typical for CLI agent workloads. Data from CLIWatch benchmark run #88.
Gemini 3 Flash is cheapest per success at $0.007 — but it fails 25% of tasks. The “premium” Claude Opus 4.6 costs 14x more per success and has a worse pass rate (83%). Meanwhile GPT-5.2 and Haiku 4.5 both hit 92% at moderate cost. The cheapest per token and the most expensive per token both underperform the mid-range options.
Failed tasks aren't free
The table above only counts token costs. In practice, a failed agent task has three costs:
- Wasted tokens — you pay for the attempt even though it produced nothing useful. In our benchmark, failed runs often consumed more tokens than successes because the agent retried, backtracked, and hit the max turn limit.
- Human intervention — someone has to finish the task manually. If the agent was running in CI, a developer gets paged. Developer time dwarfs any token cost.
- Pipeline latency — the agent spent 5 turns and 30 seconds getting nowhere. In a CI pipeline, that's blocking time. In an interactive session, it's frustrating.
Once you factor in human escalation cost, the economics flip. GPT-5.2 at $0.019/task with 92% success is almost always cheaper than Gemini 3 Flash at $0.007/task that needs a human 25% of the time. One minute of developer time costs more than a thousand successful agent tasks.
What surprised us
More expensive doesn't mean more reliable. Claude Opus 4.6 is the most expensive model we tested — 14x the cost per success of Gemini 3 Flash — yet it has a worse pass rate (83% vs 75%). On Docker tasks, Opus spent 8,288 tokens per task without improving outcomes over the mid-range models.
The sweet spot is in the middle. GPT-5.2 and Claude Haiku 4.5 both hit 92% pass rate. GPT-5.2 did it in just 1.9 average turns with the fewest tokens (3,705). Haiku used more tokens (8,068) but costs less per token. Either is a better choice than the cheapest or the most expensive option.
Token count varies wildly by model architecture. Anthropic models used roughly 2x the tokens of other providers for the same tasks. This isn't necessarily wasteful — Haiku's extra tokens bought it a 92% pass rate — but it means per-token pricing comparisons across providers are misleading.
Your CLI is the biggest cost lever
Here's the part most teams miss: the cost difference between models shrinks dramatically when the CLI is well-designed. A CLI with clear help text, actionable error messages, and --json output needs fewer turns on every model.
In our benchmark, the “run-and-capture” task (simple, clear intent) completed in 1 turn on every model. The “inspect-container” task (more ambiguous, requires chaining commands) took 2–5 turns and was the primary source of failures across all models.
run-and-capture (clear intent): 1 turn · 3,682 tokens · 100% pass rate inspect-container (ambiguous, multi-step): 2–5 turns · 6,089–10,842 tokens · 67% pass rate Same model. Clearer task = 3x fewer tokens.
That improvement applies to every model, every task, every run. It compounds. Over thousands of CI runs, better CLI design saves more money than model selection ever will.
So which model should you pick?
It depends on the failure tolerance of the workload:
- CI gates (must pass) — use GPT-5.2 or Claude Haiku 4.5. At 92% pass rate, one automatic retry gives you >99% effective success. The cost difference is pennies; the cost of a blocked pipeline is hours.
- Bulk / batch operations — Claude Haiku 4.5 hits the sweet spot. 92% pass rate at $0.014/task. If you're running 500 tasks overnight, automatic retry covers the 8% failures cheaply.
- Cost-sensitive, high-volume — Gemini 3 Flash at $0.007/task is hard to beat on raw cost. Just plan for the 25% that need retries or fallback to a stronger model.
CLIWatch tracks all of this per model, per task, per CLI — so you can make the choice with data instead of gut feel.
The takeaway
Stop comparing models by token price alone. Start comparing them by cost per successful task — and then invest in the thing that actually moves the needle: making your CLI easier for agents to use. Better help text, clearer errors, and structured output reduce cost on every model, every run, forever.
Track cost per task for your CLI
CLIWatch measures pass rate, turns, and token usage per model. See what your CLI actually costs agents to use.