It’s 2026. Your software has a new type of user. Agents.

Millions of developers let AI agents choose and run their tools. We test how well agents can use yours, with real evals, real models, and results you can act on.

“Run npm install -g @cliwatch/cli, then run cliwatch skills to read the setup docs.”

Paste into any AI coding assistant

Claude CodeCursorOpenAI Codex

See CLIWatch in action

We evaluated 19 popular CLIs against real LLM agents on tasks derived from their own documentation. Imagine this for your CLI.

Terraform CLI

HashiCorp

Pass Rate

94%

Tasks

18

Source

Docs

Tasks generated from HashiCorp's official documentation, tested across multiple LLM models.

View the full benchmark resultsOpen dashboard →

Want to know how well AI agents can use your CLI?

Track pass rates, compare models, and catch regressions across releases.

Git v2.53.0
Based on run #6 · 2h ago · 1 model
Releases
ModelsAgent Harnesses
Pass rate
claude-opus-4.6100%
ReleasesPass rate
v2.53.02h ago100%
v2.52.0Feb 1392%
Stability
Stable4
Since v2.52.0
create-branch-and-mergeimproved
Activity 6
Releasev2.53.0100%2h ago
CIa3f2c1e100%3h ago
CIe8b4d2a95%5h ago
Releasev2.52.092%Feb 13
CI1c9a7f388%Feb 12
CI7d5e0b885%Feb 10

Track agent-readiness over time

See pass rates, model comparisons, and regressions at a glance. Run evals in CI and track your CLI's agent compatibility across releases.

Task suites

Write tasks yourself, or let CLIWatch generate them automatically from your docs. See pass/fail results per task, and click any task to inspect the full agent transcript.

Run #8v0.4.195h ago2668ef3main
Evaluated against superfly/docs docs
17 source files · 19 tasks generated · HEAD
Results 18/19Source Docs 17<> Task Suite 6 files / 422 lines
ø 1 task fails on ALL models
fly_launch.md7/8
launch-and-inspectPassed
launch-dockerfile-appPassed
launch-empty-dirPassed
launch-go-appPassed
launch-node-appPassed
launch-static-siteFailed
launch-name-conflictPassed
launch-with-custom-portPassed
fly_config.md3/3
create-fly-tomlPassed
launch-then-customizePassed
multi-process-configPassed

Test in CI, catch regressions in PRs

Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.

  • Auto PR comments with per-model pass rates and deltas
  • GitHub Check Runs that block merge on regressions
  • Configurable thresholds to gate CI on pass rate targets
# .github/workflows/cliwatch.yml
name: CLIWatch Evals
on:
  pull_request:
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm i -g @cliwatch/cli-bench
      - run: cli-bench
        env:
          CLIWATCH_API_KEY:    ${{ secrets.CLIWATCH_API_KEY }}
          AI_GATEWAY_API_KEY:  ${{ secrets.AI_GATEWAY_API_KEY }}
cliwatchbotcommented 2 hours ago · edited

CLIWatch Eval Results

ModelPass RateDeltaTurnsTokensStatus
Claude Sonnet 4.695% (19/20)+5%1.48.2k
GPT-5.290% (18/20)+5%1.610.8k
Gemini 3 Flash85% (17/20)-5%1.76.2k

1 regression · 1 improved · 1 unchanged

View full comparison

AI agents are running your CLI

Developers ask Claude Code to “deploy with vercel” and Cursor to “check kubectl pods.” Your CLI is being used by AI agents, whether you’ve optimized for it or not.

Test with intents, not scripts

Each task is a natural language intent: “build a Docker image tagged bench:latest.” Derive them from your docs, or write your own. The agent figures out the commands.

Agent-ready CLIs win adoption

CLIs that work well with AI get recommended by coding assistants and adopted by teams. Those that don’t get replaced by alternatives that do.

Real data, open methodology

Each CLI has its own task suite, tested across the models you choose. Task suites are open source. Results are reproducible.

20+
CLIs evaluated
250+
Real-world tasks
Any LLM
via Vercel AI Gateway

Evaluating CLIs from

DockerKubernetesTerraformGitGitHub CLIGoCargonpmpnpmVercelStripeSupabasecurlPythonPostgreSQL

Want to know how well AI agents can use your CLI?

Track pass rates, compare models, and catch regressions across releases.