·5 min read

Designing a CLI Skills Protocol for AI Agents

Part 2: --help wasn't designed for agents. What if CLIs shipped a discovery mechanism that was?

In Part 1, we showed that progressive --help discovery can save 82-98% of help text tokens compared to loading full docs upfront. But we also showed the catch: those savings only hold when help text is descriptive enough for the agent to navigate in one step. Vague descriptions like “Manage things” cause backtracking that burns 3x the tokens and wastes turns.

The deeper problem is that --help is freeform text written for humans. It can't express workflows, it doesn't distinguish between common and advanced commands, and every CLI formats it differently. All of that text lands in the context window, and the agent has to reason through it before it can decide what to do.

What if there was a standard way for CLIs to tell agents what they can do?

What --help can't express

Consider a developer who wants to deploy a Cloudflare Worker. A human would scan the docs, notice the “Getting Started” guide, and follow the steps. An agent running wrangler --help sees a flat list of 20+ commands with no sense of which ones matter for the task.

What an agent sees from wrangler --help
Commands:
  docs        Open Wrangler's command documentation
  init        Create a new project
  generate    ...
  dev         Start a local development server
  deploy      Deploy to Cloudflare
  delete      Delete a project
  tail        Start a live tail session
  secret      Manage secrets
  ...14 more commands

An agent can probably figure out init then deploy from these descriptions. But does it need secret before deploy? Should it use dev to test first? What flags does deploy need? The help text lists commands but not the order, prerequisites, or which flags are required vs. optional. For multi-step workflows, the agent has to read help for each subcommand and piece together the sequence itself.

Three things are missing from --help that agents need:

  • Workflows - “to deploy a Worker, run init then deploy”
  • Priority - which commands are the common path vs. advanced/rare
  • Structure - machine-readable output that doesn't require parsing freeform text

The skills command

We propose a convention: CLIs ship a skills subcommand (or --skills flag) that returns structured JSON describing what the CLI can do. It's progressive, like --help, but designed for machines.

Level 0: Overview. Running mycli skills returns a compact summary of capabilities, grouped by workflow.

$ mycli skills
{
  "name": "mycli",
  "version": "2.1.0",
  "skills": [
    {
      "id": "deploy",
      "summary": "Deploy a service to staging or production",
      "workflow": ["build", "deploy"],
      "common": true
    },
    {
      "id": "manage-secrets",
      "summary": "Create, rotate, and delete encrypted secrets",
      "workflow": ["secret set", "secret list"],
      "common": false
    },
    {
      "id": "logs",
      "summary": "Tail or search logs for a running service",
      "workflow": ["logs tail", "logs search"],
      "common": true
    }
  ]
}

This is about 180 tokens. The agent now knows: there are 3 capabilities, “deploy” and “logs” are common, deploying involves build then deploy. No guessing. No backtracking.

Level 1: Skill detail. The agent picks a skill and drills in with mycli skills deploy.

$ mycli skills deploy
{
  "id": "deploy",
  "summary": "Deploy a service to staging or production",
  "steps": [
    {
      "command": "mycli build <service>",
      "description": "Build the service image. Required before deploy.",
      "args": {
        "service": "Service name from mycli services list"
      }
    },
    {
      "command": "mycli deploy <service> --env <environment>",
      "description": "Deploy the built image.",
      "args": {
        "service": "Same service name used in build",
        "environment": "staging | production"
      },
      "flags": {
        "--force": "Skip confirmation prompt",
        "--dry-run": "Show what would be deployed without deploying"
      }
    }
  ],
  "examples": [
    "mycli build api && mycli deploy api --env staging",
    "mycli deploy api --env production --force"
  ],
  "output_format": "JSON on success, exit code 1 with error message on failure"
}

About 250 tokens total. After reading it, the agent knows the exact command sequence, valid argument values, available flags, and what to expect from the output. A --short flag can cut that further by minifying keys and removing whitespace. Compare that to running mycli build --help and mycli deploy --help separately, parsing freeform text, and hoping the flag names are consistent.

Why not just add this to --help? Because --help serves humans, and changing its output risks breaking existing scripts, docs generators, and workflows. You could add --help --json, but then you are building a structured agent interface anyway - just with a different name. The skills command makes the intent explicit: this is for agents, it is workflow-oriented, and it follows a convention that transfers across CLIs.

Benchmarking it for real: Docker

Theory is nice. Let's look at real data. We used cli-bench, our open-source CLI benchmarking tool, to run the experiment. cli-bench gives an LLM agent shell access, sends it a task intent, and scores completion by checking which commands were run and what the output contained. We ran 6 Docker tasks across 4 models with three different starting contexts:

  • Full docs upfront - cli-bench crawls docker --help and every subcommand's help text, then injects it all into the system prompt before the agent starts. Complete knowledge, high token cost.
  • Progressive --help - the agent starts with no CLI documentation. It can run docker --help and docker <cmd> --help on demand during the task, discovering commands as it goes.
  • Skills protocol - same as progressive, but the system prompt includes an AGENTS.md-style hint: “Docker supports the skills protocol. Run docker skills for an overview.” We wrapped Docker with a thin docker skills subcommand returning structured JSON about workflows, relevant flags, and examples.
ModelFull docs upfrontProgressive --helpSkills protocol
GPT-5-nano83% / 4.2 turns / 16K tok67% / 3.8 turns / 10K tok67% / 3.5 turns / 10K tok
GPT-5.283% / 5.6 turns / 8K tok33% / 8.5 turns / 13K tok50% / 7.0 turns / 8K tok
Gemini 3 Flash100% / 4.3 turns / 8K tok100% / 3.8 turns / 3K tok100% / 4.2 turns / 3K tok
Haiku 4.5100% / 4.7 turns / 13K tok100% / 4.7 turns / 9K tok100% / 4.8 turns / 7K tok

Pass rate / avg turns / avg tokens. 6 tasks per cell, single run (no repeats). A single task changes the pass rate by ~17 percentage points, so treat these as directional signals.

ⓘ We plan to re-run these benchmarks with more repeats and additional models. Check back for updated numbers.

View the 6 Docker tasks and how they are scored

Each task is defined in a cli-bench.yaml file with an intent (what the agent is told to do) and assertions (how we check if it succeeded). Here is one example:

- id: network-ping
  intent: "Create a Docker network called 'bench-net'.
    Run an nginx container named 'bench-web' on that
    network in detached mode. Then run an Alpine
    container on the same network and use
    'wget -qO- http://bench-web/' to verify connectivity."
  difficulty: medium
  max_turns: 10
  assert:
    - ran: "docker network"
    - ran: "docker run"
    - output_contains: "nginx"

A task passes when all assertions are met: the agent ran the expected commands and the terminal output contained the expected strings. The full list:

1. build-and-run (easy) - Create a Dockerfile, build an image, run it

2. detach-and-logs (easy) - Run a detached container, retrieve its logs

3. network-ping (medium) - Create a network, run nginx, verify connectivity from another container

4. volume-share (medium) - Write to a named volume, read from a second container

5. build-tag-inspect (medium) - Build with labels, tag the image, inspect metadata

6. full-stack (hard) - Create network + volume, run nginx, fetch page via wget, persist to volume, verify

Let's be honest: this is not a clean “skills protocol wins” story. The data tells a more nuanced, and arguably more interesting, one.

Strong models do not need help discovering Docker. Gemini 3 Flash and Haiku 4.5 hit 100% on all three approaches. Docker is one of the most extensively documented CLIs in existence - tutorials, Stack Overflow answers, and Dockerfile examples are ubiquitous in training corpora. These models do not need --help or skills to know that docker network create comes before docker run --network.

Progressive --help breaks down, hard. GPT-5.2 drops from 83% to 33% when forced to discover Docker through --help. Docker's help output is dense, with 90+ flags on docker run alone and nested subcommand hierarchies. The agent gets lost in help text, burns turns parsing irrelevant flags, and runs out of attempts on the harder tasks. GPT-5-nano shows a smaller but consistent drop from 83% to 67%.

Skills partially recovers the damage. The skills protocol brings GPT-5.2 back from 33% to 50% and cuts its average tokens from 13K to 8K. It does not fully close the gap to upfront help, but it gives the agent workflows and relevant flags without the noise of the full help tree.

The takeaway is not “skills protocol beats everything.” It is that progressive --help is fragile on complex CLIs, and a structured discovery mechanism makes that fragility less painful. For well-known CLIs, strong models compensate with training knowledge. For less well-known CLIs, the gap should be wider.

Start with AGENTS.md today

You do not need the full skills protocol to improve things. There is a natural ladder:

Step 1: Add a one-liner to AGENTS.md. Tell agents how to discover your CLI. Two lines, about 40 tokens. This alone eliminates blind guessing.

AGENTS.md
## mycli
Data pipeline CLI. Run `mycli --help` for commands,
then `mycli <command> --help` for flags and examples.

Step 2: Add a skills command. When you are ready, ship mycli skills and update the pointer. The implementation is a single JSON file shipped with your binary - you can generate it from existing command definitions in commander, yargs, or click.

This pattern is already gaining traction. Vercel's Skills.sh registry takes the complementary approach: a central directory where agents discover tools externally. The skills command is the local counterpart - built into the CLI, versioned with the binary. Both layers work together.

Coming in Part 3

Docker gave us real data, but it also revealed the limitation of benchmarking well-known tools: strong models compensate with training knowledge. In Part 3, we will repeat this experiment with CLIs that models have never seen - tools with zero presence in training data. That is where the skills protocol should have the biggest impact, and where progressive --help should struggle the most.

Measure your CLI's agent efficiency

Track turns and tokens per task across models. See how your CLI's discoverability stacks up.