February 13, 2026·5 min read

What Does It Actually Cost When an Agent Uses Your CLI?

We ran 4 models on the same Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.

Every time an agent runs docker build or docker inspect, it consumes tokens. Input tokens to read help text, output tokens to form commands, more of both when it retries on errors. Those tokens cost money.

Most teams pick a model based on list price per million tokens and call it a day. But that's like choosing a contractor by hourly rate without asking how many hours the job takes. The metric that actually matters is cost per successful task.

The “cheapest model” trap

Here's a comparison of the four models we benchmarked, by list price:

Model	Input	Output	“Cheapest?”
Gemini 3 Flash	$0.50/M	$3/M	Looks like it
Claude Haiku 4.5	$0.80/M	$4/M	Close second
GPT-5.2	$1.75/M	$14/M	Mid-range
Claude Opus 4.6	$5/M	$25/M	Premium

Based on this, you'd pick Gemini 3 Flash for everything. It's 10x cheaper per token than Opus. Case closed?

Not even close.

Cost per successful task

Cheaper models fail more, retry more, and burn more tokens on the attempts that do fail. To get the real cost, you need to account for all of this:

The formula

cost_per_success = avg_tokens_per_attempt × price_per_token
                   ÷ pass_rate

# Failed attempts aren't free — you pay for every retry.
# And failed attempts often use MORE tokens (retries, backtracking).

We ran 4 Docker CLI tasks (build, run, inspect, volume-mount) across all four models, each repeated 3 times for a total of 48 evaluations. Here's what the real numbers look like:

Model	Pass rate	Avg turns	Avg tokens	Est. cost / success
Gemini 3 Flash	75%	2.0	4,753	$0.007
Claude Haiku 4.5	92%	2.7	8,068	$0.014
GPT-5.2	92%	1.9	3,705	$0.019
Claude Opus 4.6	83%	3.0	8,288	$0.100

Cost estimates assume ~75% input / 25% output token split, typical for CLI agent workloads. Data from CLIWatch benchmark run #88.

Gemini 3 Flash is cheapest per success at $0.007 — but it fails 25% of tasks. The “premium” Claude Opus 4.6 costs 14x more per success and has a worse pass rate (83%). Meanwhile GPT-5.2 and Haiku 4.5 both hit 92% at moderate cost. The cheapest per token and the most expensive per token both underperform the mid-range options.

Failed tasks aren't free

The table above only counts token costs. In practice, a failed agent task has three costs:

Wasted tokens — you pay for the attempt even though it produced nothing useful. In our benchmark, failed runs often consumed more tokens than successes because the agent retried, backtracked, and hit the max turn limit.
Human intervention — someone has to finish the task manually. If the agent was running in CI, a developer gets paged. Developer time dwarfs any token cost.
Pipeline latency — the agent spent 5 turns and 30 seconds getting nowhere. In a CI pipeline, that's blocking time. In an interactive session, it's frustrating.

Once you factor in human escalation cost, the economics flip. GPT-5.2 at $0.019/task with 92% success is almost always cheaper than Gemini 3 Flash at $0.007/task that needs a human 25% of the time. One minute of developer time costs more than a thousand successful agent tasks.

What surprised us

More expensive doesn't mean more reliable. Claude Opus 4.6 is the most expensive model we tested — 14x the cost per success of Gemini 3 Flash — yet it has a worse pass rate (83% vs 75%). On Docker tasks, Opus spent 8,288 tokens per task without improving outcomes over the mid-range models.

The sweet spot is in the middle. GPT-5.2 and Claude Haiku 4.5 both hit 92% pass rate. GPT-5.2 did it in just 1.9 average turns with the fewest tokens (3,705). Haiku used more tokens (8,068) but costs less per token. Either is a better choice than the cheapest or the most expensive option.

Token count varies wildly by model architecture. Anthropic models used roughly 2x the tokens of other providers for the same tasks. This isn't necessarily wasteful — Haiku's extra tokens bought it a 92% pass rate — but it means per-token pricing comparisons across providers are misleading.

Your CLI is the biggest cost lever

Here's the part most teams miss: the cost difference between models shrinks dramatically when the CLI is well-designed. A CLI with clear help text, actionable error messages, and --json output needs fewer turns on every model.

In our benchmark, the “run-and-capture” task (simple, clear intent) completed in 1 turn on every model. The “inspect-container” task (more ambiguous, requires chaining commands) took 2–5 turns and was the primary source of failures across all models.

Same model, two tasks — Claude Haiku 4.5

run-and-capture (clear intent):
  1 turn · 3,682 tokens · 100% pass rate

inspect-container (ambiguous, multi-step):
  2–5 turns · 6,089–10,842 tokens · 67% pass rate

Same model. Clearer task = 3x fewer tokens.

That improvement applies to every model, every task, every run. It compounds. Over thousands of CI runs, better CLI design saves more money than model selection ever will.

So which model should you pick?

It depends on the failure tolerance of the workload:

CI gates (must pass) — use GPT-5.2 or Claude Haiku 4.5. At 92% pass rate, one automatic retry gives you >99% effective success. The cost difference is pennies; the cost of a blocked pipeline is hours.
Bulk / batch operations — Claude Haiku 4.5 hits the sweet spot. 92% pass rate at $0.014/task. If you're running 500 tasks overnight, automatic retry covers the 8% failures cheaply.
Cost-sensitive, high-volume — Gemini 3 Flash at $0.007/task is hard to beat on raw cost. Just plan for the 25% that need retries or fallback to a stronger model.

CLIWatch tracks all of this per model, per task, per CLI — so you can make the choice with data instead of gut feel.

The takeaway

Stop comparing models by token price alone. Start comparing them by cost per successful task — and then invest in the thing that actually moves the needle: making your CLI easier for agents to use. Better help text, clearer errors, and structured output reduce cost on every model, every run, forever.

Track cost per task for your CLI

CLIWatch measures pass rate, turns, and token usage per model. See what your CLI actually costs agents to use.

Get Started Free