Blog
Thoughts on agent-readiness, CLI design, and evaluating tools against real LLM agents.
# commands that don't exist - seen in BOTH runs cliwatch runs list --limit N # both models tried cliwatch auth status # both guessed before 'whoami' cliwatch validate --config FILE # both used --config, not --file 4.7 made the same mistakes, just verified less after
Opus 4.7 vs 4.6 on our own CLI: what the traces actually say
Same 12 tasks, same CLI, 100% pass on both. The headline is a 36% drop in turns. Read the traces and both models hallucinated the exact same commands.
# docker benchmark: 4 models × 3 approaches Model Upfront --help Skills GPT-5-nano 83% 67% 67% GPT-5.2 83% 33% 50% Gemini 3 Flash 100% 100% 100% Haiku 4.5 100% 100% 100% Progressive --help breaks on complex CLIs
Designing a CLI Skills Protocol for AI Agents
We benchmarked three CLI discovery approaches across four models. Progressive --help breaks down on complex CLIs. A structured skills command recovers accuracy and cuts tokens.
# cost per successful task — real benchmark data Model Pass Avg tokens Cost/success Gemini 3 Flash 75% 4,753 $0.007 Claude Haiku 92% 8,068 $0.014 GPT-5.2 92% 3,705 $0.019 Claude Opus 4.6 83% 8,288 $0.100 Cheapest per token ≠ best value per task
What Does It Actually Cost When an Agent Uses Your CLI?
We benchmarked 4 models on Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.
# progressive discovery vs upfront loading $ kubectl --help # 809 tokens $ kubectl rollout --help # 247 tokens $ kubectl rollout restart # 493 tokens Total: ~1,500 tokens Full docs: ~30,000 tokens 95% saved
SKILLS Docs vs CLI --help: Where Do Agent Tokens Actually Go?
We measured token consumption for two approaches to giving agents CLI knowledge. The answer depends on whether the model has already seen your CLI.
# what agents see vs what humans see $ deploy --help Usage: deploy <service-name> Options: -e, --env <name> staging | production $ deploy api --env production ✓ Deployed api to production
Can AI Agents Actually Use Your CLI?
Why help text, error messages, and output format matter more than you think — and how to measure it.