Blog
Thoughts on agent-readiness, CLI design, and evaluating tools against real LLM agents.
# docker benchmark: 4 models × 3 approaches Model Upfront --help Skills GPT-5-nano 83% 67% 67% GPT-5.2 83% 33% 50% Gemini 3 Flash 100% 100% 100% Haiku 4.5 100% 100% 100% Progressive --help breaks on complex CLIs
Research5 min read
Designing a CLI Skills Protocol for AI Agents
We benchmarked three CLI discovery approaches across four models. Progressive --help breaks down on complex CLIs. A structured skills command recovers accuracy and cuts tokens.
Read post →
# cost per successful task — real benchmark data Model Pass Avg tokens Cost/success Gemini 3 Flash 75% 4,753 $0.007 Claude Haiku 92% 8,068 $0.014 GPT-5.2 92% 3,705 $0.019 Claude Opus 4.6 83% 8,288 $0.100 Cheapest per token ≠ best value per task
Analysis5 min read
What Does It Actually Cost When an Agent Uses Your CLI?
We benchmarked 4 models on Docker CLI tasks. The cheapest model per token isn't always the smartest choice — and your CLI's design is the biggest cost lever.
Read post →
# progressive discovery vs upfront loading $ kubectl --help # 809 tokens $ kubectl rollout --help # 247 tokens $ kubectl rollout restart # 493 tokens Total: ~1,500 tokens Full docs: ~30,000 tokens 95% saved
Research4 min read
SKILLS Docs vs CLI --help: Where Do Agent Tokens Actually Go?
We measured token consumption for two approaches to giving agents CLI knowledge. The answer depends on whether the model has already seen your CLI.
Read post →
# what agents see vs what humans see $ deploy --help Usage: deploy <service-name> Options: -e, --env <name> staging | production $ deploy api --env production ✓ Deployed api to production
Guide8 min read
Can AI Agents Actually Use Your CLI?
Why help text, error messages, and output format matter more than you think — and how to measure it.
Read post →