CLI Evals
Evals for the agent interface layer
Your CLI is being used by AI agents. Are you testing that experience? CLIWatch runs structured benchmarks across models and gives you pass rates, regressions, and task-level results — in CI.
What the evals measure
Pass Rate
What percentage of tasks does each model complete successfully with your CLI?
Turns & Tokens
How efficiently can agents use your CLI? Fewer turns = better discoverability.
Task-level Breakdown
See exactly which tasks pass and fail. Pinpoint broken help text, bad error messages, or missing flags.
How it works
Define your task suite
Write YAML tasks that describe what agents should accomplish with your CLI. "Create a new project", "Deploy to staging", "List all running services."
Run in CI or locally
A single GitHub Actions workflow benchmarks every PR. Or run locally with `cli-bench` to iterate fast.
Get results per model
Pass rates, turn counts, and token usage across Claude, GPT, and Gemini. PR comments show deltas so you catch regressions before merge.
Track over time
See trends across releases. Know if your v2.3 help text changes helped or hurt agent success rates.
Tasks are plain YAML
Describe what an agent should accomplish in natural language. CLIWatch handles the model orchestration, sandboxing, and result validation.
tasks:
- name: list-services
prompt: >
List all running services
using the mycli tool
expected:
- stdout_contains: "web-api"
- exit_code: 0
- name: create-project
prompt: >
Create a new project called
"demo" with the typescript template
expected:
- file_exists: "demo/package.json"
- stdout_contains: "created"CLIWatch evals vs. manual testing
| Manual | CLIWatch | |
|---|---|---|
| Test runner | Bash scripts + vibes | Structured YAML tasks with validation |
| Models tested | Whatever you have access to | Claude, GPT, Gemini via AI Gateway |
| CI integration | Custom, fragile | One workflow, PR comments, check runs |
| Regression detection | "Did something break?" | Automatic deltas on every PR |
| Historical tracking | None | Per-model trends over time |
| Cost | Engineering time | Free tier, 5 min setup |
See CLIWatch in action
We evaluated 19 popular CLIs against real LLM agents on tasks derived from their own documentation. Imagine this for your CLI.
Terraform CLI
HashiCorp
Pass Rate
94%
Tasks
18
Source
Docs
Tasks generated from HashiCorp's official documentation, tested across multiple LLM models.
Want to know how well AI agents can use your CLI?
Track pass rates, compare models, and catch regressions across releases.
Test in CI, catch regressions in PRs
Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.
- Auto PR comments with per-model pass rates and deltas
- GitHub Check Runs that block merge on regressions
- Configurable thresholds to gate CI on pass rate targets
name: CLIWatch Evals
on:
pull_request:
push:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm i -g @cliwatch/cli-bench
- run: cli-bench
env:
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}CLIWatch Eval Results
| Model | Pass Rate | Delta | Turns | Tokens | Status |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 95% (19/20) | +5% | 1.4 | 8.2k | ✅ |
| GPT-5.2 | 90% (18/20) | +5% | 1.6 | 10.8k | ✅ |
| Gemini 3 Flash | 85% (17/20) | -5% | 1.7 | 6.2k | ✅ |
1 regression · 1 improved · 1 unchanged
View full comparison
