Blog

Thoughts on agent-readiness, CLI design, and evaluating tools against real LLM agents.

# docker benchmark: 4 models × 3 approaches
Model          Upfront  --help  Skills
GPT-5-nano       83%     67%     67%
GPT-5.2          83%     33%     50%
Gemini 3 Flash  100%    100%    100%
Haiku 4.5       100%    100%    100%
 
Progressive --help breaks on complex CLIs
Research5 min read

Designing a CLI Skills Protocol for AI Agents

We benchmarked three CLI discovery approaches across four models. Progressive --help breaks down on complex CLIs. A structured skills command recovers accuracy and cuts tokens.

Read post →
# cost per successful task — real benchmark data
Model            Pass    Avg tokens  Cost/success
Gemini 3 Flash   75%     4,753       $0.007
Claude Haiku     92%     8,068       $0.014
GPT-5.2          92%     3,705       $0.019
Claude Opus 4.6  83%     8,288       $0.100
 
Cheapest per token ≠ best value per task
Analysis5 min read

What Does It Actually Cost When an Agent Uses Your CLI?

We benchmarked 4 models on Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.

Read post →
# progressive discovery vs upfront loading
$ kubectl --help             # 809 tokens
$ kubectl rollout --help     # 247 tokens
$ kubectl rollout restart    # 493 tokens
 
Total: ~1,500 tokens
Full docs: ~30,000 tokens     95% saved
Research4 min read

SKILLS Docs vs CLI --help: Where Do Agent Tokens Actually Go?

We measured token consumption for two approaches to giving agents CLI knowledge. The answer depends on whether the model has already seen your CLI.

Read post →
# what agents see vs what humans see
$ deploy --help
Usage: deploy <service-name>
 
Options:
  -e, --env <name>   staging | production
 
$ deploy api --env production
✓ Deployed api to production
Guide8 min read

Can AI Agents Actually Use Your CLI?

Why help text, error messages, and output format matter more than you think — and how to measure it.

Read post →