CLI Evals

Evals for the agent interface layer

Your CLI is being used by AI agents. Are you testing that experience? CLIWatch runs structured benchmarks across models and gives you pass rates, regressions, and task-level results — in CI.

What the evals measure

Pass Rate

What percentage of tasks does each model complete successfully with your CLI?

Turns & Tokens

How efficiently can agents use your CLI? Fewer turns = better discoverability.

Task-level Breakdown

See exactly which tasks pass and fail. Pinpoint broken help text, bad error messages, or missing flags.

How it works

01

Define your task suite

Write YAML tasks that describe what agents should accomplish with your CLI. "Create a new project", "Deploy to staging", "List all running services."

02

Run in CI or locally

A single GitHub Actions workflow benchmarks every PR. Or run locally with `cli-bench` to iterate fast.

03

Get results per model

Pass rates, turn counts, and token usage across Claude, GPT, and Gemini. PR comments show deltas so you catch regressions before merge.

04

Track over time

See trends across releases. Know if your v2.3 help text changes helped or hurt agent success rates.

Tasks are plain YAML

Describe what an agent should accomplish in natural language. CLIWatch handles the model orchestration, sandboxing, and result validation.

# cli-bench.yaml
tasks:
  - name: list-services
    prompt: >
      List all running services
      using the mycli tool
    expected:
      - stdout_contains: "web-api"
      - exit_code: 0

  - name: create-project
    prompt: >
      Create a new project called
      "demo" with the typescript template
    expected:
      - file_exists: "demo/package.json"
      - stdout_contains: "created"

CLIWatch evals vs. manual testing

ManualCLIWatch
Test runnerBash scripts + vibesStructured YAML tasks with validation
Models testedWhatever you have access toClaude, GPT, Gemini via AI Gateway
CI integrationCustom, fragileOne workflow, PR comments, check runs
Regression detection"Did something break?"Automatic deltas on every PR
Historical trackingNonePer-model trends over time
CostEngineering timeFree tier, 5 min setup

See CLIWatch in action

We evaluated 19 popular CLIs against real LLM agents on tasks derived from their own documentation. Imagine this for your CLI.

Terraform CLI

HashiCorp

Pass Rate

94%

Tasks

18

Source

Docs

Tasks generated from HashiCorp's official documentation, tested across multiple LLM models.

View the full benchmark resultsOpen dashboard →

Want to know how well AI agents can use your CLI?

Track pass rates, compare models, and catch regressions across releases.

Test in CI, catch regressions in PRs

Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.

  • Auto PR comments with per-model pass rates and deltas
  • GitHub Check Runs that block merge on regressions
  • Configurable thresholds to gate CI on pass rate targets
# .github/workflows/cliwatch.yml
name: CLIWatch Evals
on:
  pull_request:
  push:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm i -g @cliwatch/cli-bench
      - run: cli-bench
        env:
          CLIWATCH_API_KEY:    ${{ secrets.CLIWATCH_API_KEY }}
          AI_GATEWAY_API_KEY:  ${{ secrets.AI_GATEWAY_API_KEY }}
cliwatchbotcommented 2 hours ago · edited

CLIWatch Eval Results

ModelPass RateDeltaTurnsTokensStatus
Claude Sonnet 4.695% (19/20)+5%1.48.2k
GPT-5.290% (18/20)+5%1.610.8k
Gemini 3 Flash85% (17/20)-5%1.76.2k

1 regression · 1 improved · 1 unchanged

View full comparison

Your first eval in 5 minutes

Free for individuals. No credit card. Real results.

Start Benchmarking