It’s 2026. Your software has a new type of user. Agents.

Millions of developers let AI agents choose and run their tools. We test how well agents can use yours, with real evals, real models, and results you can act on.

Get Started Free See Evals

“Run npm install -g @cliwatch/cli, then run cliwatch skills to read the setup docs.”

Paste into any AI coding assistant

See CLIWatch in action

We evaluated 19 popular CLIs against real LLM agents on tasks derived from their own documentation. Imagine this for your CLI.

Terraform CLI

HashiCorp

Pass Rate

94%

Tasks

Source

Docs

Tasks generated from HashiCorp's official documentation, tested across multiple LLM models.

View the full benchmark resultsOpen dashboard →

Want to know how well AI agents can use your CLI?

Track pass rates, compare models, and catch regressions across releases.

Book a demo→

View all 19 CLI evals →

Git v2.53.0

Based on run #6 · 2h ago · 1 model

Releases

ModelsAgent Harnesses

Pass rate

claude-opus-4.6100%

ReleasesPass rate

v2.53.02h ago100%

v2.52.0Feb 1392%

Stability

Stable4

Since v2.52.0

create-branch-and-mergeimproved

Activity 6

Releasev2.53.0Release v2.53.0100%2h ago

CIa3f2c1efix: rebase edge case for empty trees100%3h ago

CIe8b4d2achore: update test fixtures95%5h ago

Releasev2.52.0Release v2.52.092%Feb 13

CI1c9a7f3feat: add sparse-checkout support88%Feb 12

CI7d5e0b8fix: merge conflict with binary files85%Feb 10

Track agent-readiness over time

See pass rates, model comparisons, and regressions at a glance. Run evals in CI and track your CLI's agent compatibility across releases.

Task suites

Write tasks yourself, or let CLIWatch generate them automatically from your docs. See pass/fail results per task, and click any task to inspect the full agent transcript.

Run #8v0.4.195h ago2668ef3refs/heads/main

Evaluated against superfly/docs docs

17 source files · 19 tasks generated · HEAD

Results 18/19Source Docs 17<> Task Suite 6 files / 422 lines

DocCategoryDifficulty

All files8/9

fly_launch.md7/8

fly_config.md3/3

fly_checks.md1/1

fly_apps_list.md1/1

fly_status.md1/1

fly_config_validate.md1/1

fly_doctor.md2/2

fly_version.md1/1

fly_auth_whoami.md1/1

fly.md

fly_apps.md

fly_apps_create.md

fly_apps_destroy.md

ø 1 task fails on ALL models

fly_launch.mdflyctl/cmd/fly_launch.md7/8

launch-and-inspectPassed

launch-dockerfile-appPassed

launch-empty-dirPassed

launch-go-appPassed

launch-node-appPassed

launch-static-siteFailed

launch-name-conflictPassed

launch-with-custom-portPassed

fly_config.mdflyctl/cmd/fly_config.md3/3

create-fly-tomlPassed

launch-then-customizePassed

multi-process-configPassed

Test in CI, catch regressions in PRs

Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.

Auto PR comments with per-model pass rates and deltas
GitHub Check Runs that block merge on regressions
Configurable thresholds to gate CI on pass rate targets

# .github/workflows/cliwatch.yml

name: CLIWatch Evals
on:
  pull_request:
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm i -g @cliwatch/cli-bench
      - run: cli-bench
        env:
          CLIWATCH_API_KEY:    ${{ secrets.CLIWATCH_API_KEY }}
          AI_GATEWAY_API_KEY:  ${{ secrets.AI_GATEWAY_API_KEY }}

cliwatchbotcommented 2 hours ago · edited

CLIWatch Eval Results

Model	Pass Rate	Delta	Turns	Tokens	Status
Claude Sonnet 4.6	95% (19/20)	+5%	1.4	8.2k	✅
GPT-5.2	90% (18/20)	+5%	1.6	10.8k	✅
Gemini 3 Flash	85% (17/20)	-5%	1.7	6.2k	✅

1 regression · 1 improved · 1 unchanged

View full comparison

AI agents are running your CLI

Developers ask Claude Code to “deploy with vercel” and Cursor to “check kubectl pods.” Your CLI is being used by AI agents, whether you’ve optimized for it or not.

Test with intents, not scripts

Each task is a natural language intent: “build a Docker image tagged bench:latest.” Derive them from your docs, or write your own. The agent figures out the commands.