CLI Evals

Evals for the agent interface layer

Your CLI is being used by AI agents. Are you testing that experience? CLIWatch runs structured benchmarks across models and gives you pass rates, regressions, and task-level results — in CI.

Start Benchmarking How It Works

What the evals measure

Pass Rate

What percentage of tasks does each model complete successfully with your CLI?

Turns & Tokens

How efficiently can agents use your CLI? Fewer turns = better discoverability.

Task-level Breakdown

See exactly which tasks pass and fail. Pinpoint broken help text, bad error messages, or missing flags.

How it works

Define your task suite

Write YAML tasks that describe what agents should accomplish with your CLI. "Create a new project", "Deploy to staging", "List all running services."

Run in CI or locally

A single GitHub Actions workflow benchmarks every PR. Or run locally with `cli-bench` to iterate fast.

Get results per model

Pass rates, turn counts, and token usage across Claude, GPT, and Gemini. PR comments show deltas so you catch regressions before merge.

Track over time

See trends across releases. Know if your v2.3 help text changes helped or hurt agent success rates.

Tasks are plain YAML

Describe what an agent should accomplish in natural language. CLIWatch handles the model orchestration, sandboxing, and result validation.

# cli-bench.yaml

tasks:
  - name: list-services
    prompt: >
      List all running services
      using the mycli tool
    expected:
      - stdout_contains: "web-api"
      - exit_code: 0

  - name: create-project
    prompt: >
      Create a new project called
      "demo" with the typescript template
    expected:
      - file_exists: "demo/package.json"
      - stdout_contains: "created"

CLIWatch evals vs. manual testing

	Manual	CLIWatch
Test runner	Bash scripts + vibes	Structured YAML tasks with validation
Models tested	Whatever you have access to	Claude, GPT, Gemini via AI Gateway
CI integration	Custom, fragile	One workflow, PR comments, check runs
Regression detection	"Did something break?"	Automatic deltas on every PR
Historical tracking	None	Per-model trends over time
Cost	Engineering time	Free tier, 5 min setup

See CLIWatch in action

We benchmark our own CLI with CLIWatch. 12 tasks test whether AI agents can navigate runs, read docs, and set up new projects.

CLIWatch CLI

Dogfood benchmark

Pass Rate

100%

Tasks

Models

Opus 4.7

Real user intents: “Show me my latest pass rate”, “How do I write assertions?”, “Set up benchmarks for my CLI.”

Explore the live dashboardOpen dashboard →

Want to know how well AI agents can use your CLI?

Track pass rates, compare models, and catch regressions across releases.

Book a demo→

Test in CI, catch regressions in PRs

Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.

Auto PR comments with per-model pass rates and deltas
GitHub Check Runs that block merge on regressions
Configurable thresholds to gate CI on pass rate targets

# .github/workflows/cliwatch.yml

name: CLIWatch Evals
on:
  pull_request:
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm i -g @cliwatch/cli-bench
      - run: cli-bench
        env:
          CLIWATCH_API_KEY:    ${{ secrets.CLIWATCH_API_KEY }}
          AI_GATEWAY_API_KEY:  ${{ secrets.AI_GATEWAY_API_KEY }}

cliwatchbotcommented 2 hours ago · edited

CLIWatch Eval Results

Model	Pass Rate	Delta	Turns	Tokens	Status
Claude Sonnet 4.6	95% (19/20)	+5%	1.4	8.2k	✅
GPT-5.2	90% (18/20)	+5%	1.6	10.8k	✅
Gemini 3 Flash	85% (17/20)	-5%	1.7	6.2k	✅

1 regression · 1 improved · 1 unchanged

View full comparison

Your first eval in 5 minutes

Free for individuals. No credit card. Real results.

Start Benchmarking