Blog

Thoughts on agent-readiness, CLI design, and evaluating tools against real LLM agents.

# commands that don't exist - seen in BOTH runs
cliwatch runs list --limit N    # both models tried
cliwatch auth status            # both guessed before 'whoami'
cliwatch validate --config FILE # both used --config, not --file
 
4.7 made the same mistakes, just verified less after

Analysis6 min read

Opus 4.7 vs 4.6 on our own CLI: what the traces actually say

Same 12 tasks, same CLI, 100% pass on both. The headline is a 36% drop in turns. Read the traces and both models hallucinated the exact same commands.

April 18, 2026Read post →

# docker benchmark: 4 models × 3 approaches
Model          Upfront  --help  Skills
GPT-5-nano       83%     67%     67%
GPT-5.2          83%     33%     50%
Gemini 3 Flash  100%    100%    100%
Haiku 4.5       100%    100%    100%
 
Progressive --help breaks on complex CLIs

Research5 min read

Designing a CLI Skills Protocol for AI Agents

We benchmarked three CLI discovery approaches across four models. Progressive --help breaks down on complex CLIs. A structured skills command recovers accuracy and cuts tokens.

February 16, 2026Read post →

# cost per successful task — real benchmark data
Model            Pass    Avg tokens  Cost/success
Gemini 3 Flash   75%     4,753       $0.007
Claude Haiku     92%     8,068       $0.014
GPT-5.2          92%     3,705       $0.019
Claude Opus 4.6  83%     8,288       $0.100
 
Cheapest per token ≠ best value per task

Analysis5 min read

What Does It Actually Cost When an Agent Uses Your CLI?

We benchmarked 4 models on Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.

February 13, 2026Read post →

# progressive discovery vs upfront loading
$ kubectl --help             # 809 tokens
$ kubectl rollout --help     # 247 tokens
$ kubectl rollout restart    # 493 tokens
 
Total: ~1,500 tokens
Full docs: ~30,000 tokens     95% saved

Research4 min read

SKILLS Docs vs CLI --help: Where Do Agent Tokens Actually Go?

We measured token consumption for two approaches to giving agents CLI knowledge. The answer depends on whether the model has already seen your CLI.

February 9, 2026Read post →

# what agents see vs what humans see
$ deploy --help
Usage: deploy <service-name>
 
Options:
  -e, --env <name>   staging | production
 
$ deploy api --env production
✓ Deployed api to production

Guide8 min read

Can AI Agents Actually Use Your CLI?

Why help text, error messages, and output format matter more than you think — and how to measure it.

February 8, 2026Read post →