Designing a CLI Skills Protocol for AI Agents
Part 2: --help wasn't designed for agents. What if CLIs shipped a discovery mechanism that was?
In Part 1, we showed that progressive --help discovery can save 82-98% of help text tokens compared to loading full docs upfront. But we also showed the catch: those savings only hold when help text is descriptive enough for the agent to navigate in one step. Vague descriptions like “Manage things” cause backtracking that burns 3x the tokens and wastes turns.
The deeper problem is that --help is freeform text written for humans. It can't express workflows, it doesn't distinguish between common and advanced commands, and every CLI formats it differently. All of that text lands in the context window, and the agent has to reason through it before it can decide what to do.
What if there was a standard way for CLIs to tell agents what they can do?
What --help can't express
Consider a developer who wants to deploy a Cloudflare Worker. A human would scan the docs, notice the “Getting Started” guide, and follow the steps. An agent running wrangler --help sees a flat list of 20+ commands with no sense of which ones matter for the task.
Commands: docs Open Wrangler's command documentation init Create a new project generate ... dev Start a local development server deploy Deploy to Cloudflare delete Delete a project tail Start a live tail session secret Manage secrets ...14 more commands
An agent can probably figure out init then deploy from these descriptions. But does it need secret before deploy? Should it use dev to test first? What flags does deploy need? The help text lists commands but not the order, prerequisites, or which flags are required vs. optional. For multi-step workflows, the agent has to read help for each subcommand and piece together the sequence itself.
Three things are missing from --help that agents need:
- Workflows - “to deploy a Worker, run init then deploy”
- Priority - which commands are the common path vs. advanced/rare
- Structure - machine-readable output that doesn't require parsing freeform text
The skills command
We propose a convention: CLIs ship a skills subcommand (or --skills flag) that returns structured JSON describing what the CLI can do. It's progressive, like --help, but designed for machines.
Level 0: Overview. Running mycli skills returns a compact summary of capabilities, grouped by workflow.
{
"name": "mycli",
"version": "2.1.0",
"skills": [
{
"id": "deploy",
"summary": "Deploy a service to staging or production",
"workflow": ["build", "deploy"],
"common": true
},
{
"id": "manage-secrets",
"summary": "Create, rotate, and delete encrypted secrets",
"workflow": ["secret set", "secret list"],
"common": false
},
{
"id": "logs",
"summary": "Tail or search logs for a running service",
"workflow": ["logs tail", "logs search"],
"common": true
}
]
}This is about 180 tokens. The agent now knows: there are 3 capabilities, “deploy” and “logs” are common, deploying involves build then deploy. No guessing. No backtracking.
Level 1: Skill detail. The agent picks a skill and drills in with mycli skills deploy.
{
"id": "deploy",
"summary": "Deploy a service to staging or production",
"steps": [
{
"command": "mycli build <service>",
"description": "Build the service image. Required before deploy.",
"args": {
"service": "Service name from mycli services list"
}
},
{
"command": "mycli deploy <service> --env <environment>",
"description": "Deploy the built image.",
"args": {
"service": "Same service name used in build",
"environment": "staging | production"
},
"flags": {
"--force": "Skip confirmation prompt",
"--dry-run": "Show what would be deployed without deploying"
}
}
],
"examples": [
"mycli build api && mycli deploy api --env staging",
"mycli deploy api --env production --force"
],
"output_format": "JSON on success, exit code 1 with error message on failure"
}About 250 tokens total. After reading it, the agent knows the exact command sequence, valid argument values, available flags, and what to expect from the output. A --short flag can cut that further by minifying keys and removing whitespace. Compare that to running mycli build --help and mycli deploy --help separately, parsing freeform text, and hoping the flag names are consistent.
Why not just add this to --help? Because --help serves humans, and changing its output risks breaking existing scripts, docs generators, and workflows. You could add --help --json, but then you are building a structured agent interface anyway - just with a different name. The skills command makes the intent explicit: this is for agents, it is workflow-oriented, and it follows a convention that transfers across CLIs.
Benchmarking it for real: Docker
Theory is nice. Let's look at real data. We used cli-bench, our open-source CLI benchmarking tool, to run the experiment. cli-bench gives an LLM agent shell access, sends it a task intent, and scores completion by checking which commands were run and what the output contained. We ran 6 Docker tasks across 4 models with three different starting contexts:
- Full docs upfront - cli-bench crawls
docker --helpand every subcommand's help text, then injects it all into the system prompt before the agent starts. Complete knowledge, high token cost. - Progressive --help - the agent starts with no CLI documentation. It can run
docker --helpanddocker <cmd> --helpon demand during the task, discovering commands as it goes. - Skills protocol - same as progressive, but the system prompt includes an AGENTS.md-style hint: “Docker supports the skills protocol. Run
docker skillsfor an overview.” We wrapped Docker with a thindocker skillssubcommand returning structured JSON about workflows, relevant flags, and examples.
| Model | Full docs upfront | Progressive --help | Skills protocol |
|---|---|---|---|
| GPT-5-nano | 83% / 4.2 turns / 16K tok | 67% / 3.8 turns / 10K tok | 67% / 3.5 turns / 10K tok |
| GPT-5.2 | 83% / 5.6 turns / 8K tok | 33% / 8.5 turns / 13K tok | 50% / 7.0 turns / 8K tok |
| Gemini 3 Flash | 100% / 4.3 turns / 8K tok | 100% / 3.8 turns / 3K tok | 100% / 4.2 turns / 3K tok |
| Haiku 4.5 | 100% / 4.7 turns / 13K tok | 100% / 4.7 turns / 9K tok | 100% / 4.8 turns / 7K tok |
Pass rate / avg turns / avg tokens. 6 tasks per cell, single run (no repeats). A single task changes the pass rate by ~17 percentage points, so treat these as directional signals.
ⓘ We plan to re-run these benchmarks with more repeats and additional models. Check back for updated numbers.
View the 6 Docker tasks and how they are scored
Each task is defined in a cli-bench.yaml file with an intent (what the agent is told to do) and assertions (how we check if it succeeded). Here is one example:
- id: network-ping
intent: "Create a Docker network called 'bench-net'.
Run an nginx container named 'bench-web' on that
network in detached mode. Then run an Alpine
container on the same network and use
'wget -qO- http://bench-web/' to verify connectivity."
difficulty: medium
max_turns: 10
assert:
- ran: "docker network"
- ran: "docker run"
- output_contains: "nginx"A task passes when all assertions are met: the agent ran the expected commands and the terminal output contained the expected strings. The full list:
1. build-and-run (easy) - Create a Dockerfile, build an image, run it
2. detach-and-logs (easy) - Run a detached container, retrieve its logs
3. network-ping (medium) - Create a network, run nginx, verify connectivity from another container
4. volume-share (medium) - Write to a named volume, read from a second container
5. build-tag-inspect (medium) - Build with labels, tag the image, inspect metadata
6. full-stack (hard) - Create network + volume, run nginx, fetch page via wget, persist to volume, verify
Let's be honest: this is not a clean “skills protocol wins” story. The data tells a more nuanced, and arguably more interesting, one.
Strong models do not need help discovering Docker. Gemini 3 Flash and Haiku 4.5 hit 100% on all three approaches. Docker is one of the most extensively documented CLIs in existence - tutorials, Stack Overflow answers, and Dockerfile examples are ubiquitous in training corpora. These models do not need --help or skills to know that docker network create comes before docker run --network.
Progressive --help breaks down, hard. GPT-5.2 drops from 83% to 33% when forced to discover Docker through --help. Docker's help output is dense, with 90+ flags on docker run alone and nested subcommand hierarchies. The agent gets lost in help text, burns turns parsing irrelevant flags, and runs out of attempts on the harder tasks. GPT-5-nano shows a smaller but consistent drop from 83% to 67%.
Skills partially recovers the damage. The skills protocol brings GPT-5.2 back from 33% to 50% and cuts its average tokens from 13K to 8K. It does not fully close the gap to upfront help, but it gives the agent workflows and relevant flags without the noise of the full help tree.
The takeaway is not “skills protocol beats everything.” It is that progressive --help is fragile on complex CLIs, and a structured discovery mechanism makes that fragility less painful. For well-known CLIs, strong models compensate with training knowledge. For less well-known CLIs, the gap should be wider.
Start with AGENTS.md today
You do not need the full skills protocol to improve things. There is a natural ladder:
Step 1: Add a one-liner to AGENTS.md. Tell agents how to discover your CLI. Two lines, about 40 tokens. This alone eliminates blind guessing.
## mycli Data pipeline CLI. Run `mycli --help` for commands, then `mycli <command> --help` for flags and examples.
Step 2: Add a skills command. When you are ready, ship mycli skills and update the pointer. The implementation is a single JSON file shipped with your binary - you can generate it from existing command definitions in commander, yargs, or click.
This pattern is already gaining traction. Vercel's Skills.sh registry takes the complementary approach: a central directory where agents discover tools externally. The skills command is the local counterpart - built into the CLI, versioned with the binary. Both layers work together.
Coming in Part 3
Docker gave us real data, but it also revealed the limitation of benchmarking well-known tools: strong models compensate with training knowledge. In Part 3, we will repeat this experiment with CLIs that models have never seen - tools with zero presence in training data. That is where the skills protocol should have the biggest impact, and where progressive --help should struggle the most.
Measure your CLI's agent efficiency
Track turns and tokens per task across models. See how your CLI's discoverability stacks up.