It’s 2026. Your software has a new type of user. Agents.
Millions of developers let AI agents choose and run their tools. We test how well agents can use yours, with real evals, real models, and results you can act on.
“Run npm install -g @cliwatch/cli, then run cliwatch skills to read the setup docs.”
Paste into any AI coding assistant
See CLIWatch in action
We evaluated 19 popular CLIs against real LLM agents on tasks derived from their own documentation. Imagine this for your CLI.
Terraform CLI
HashiCorp
Pass Rate
94%
Tasks
18
Source
Docs
Tasks generated from HashiCorp's official documentation, tested across multiple LLM models.
Want to know how well AI agents can use your CLI?
Track pass rates, compare models, and catch regressions across releases.
Track agent-readiness over time
See pass rates, model comparisons, and regressions at a glance. Run evals in CI and track your CLI's agent compatibility across releases.
Task suites
Write tasks yourself, or let CLIWatch generate them automatically from your docs. See pass/fail results per task, and click any task to inspect the full agent transcript.
Test in CI, catch regressions in PRs
Add a single GitHub Actions workflow and CLIWatch evaluates your CLI on every pull request. The GitHub App posts a comment with pass rates, regressions, and a link to the full comparison dashboard.
- Auto PR comments with per-model pass rates and deltas
- GitHub Check Runs that block merge on regressions
- Configurable thresholds to gate CI on pass rate targets
name: CLIWatch Evals
on:
pull_request:
push:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm i -g @cliwatch/cli-bench
- run: cli-bench
env:
CLIWATCH_API_KEY: ${{ secrets.CLIWATCH_API_KEY }}
AI_GATEWAY_API_KEY: ${{ secrets.AI_GATEWAY_API_KEY }}CLIWatch Eval Results
| Model | Pass Rate | Delta | Turns | Tokens | Status |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 95% (19/20) | +5% | 1.4 | 8.2k | ✅ |
| GPT-5.2 | 90% (18/20) | +5% | 1.6 | 10.8k | ✅ |
| Gemini 3 Flash | 85% (17/20) | -5% | 1.7 | 6.2k | ✅ |
1 regression · 1 improved · 1 unchanged
View full comparison
AI agents are running your CLI
Developers ask Claude Code to “deploy with vercel” and Cursor to “check kubectl pods.” Your CLI is being used by AI agents, whether you’ve optimized for it or not.
Test with intents, not scripts
Each task is a natural language intent: “build a Docker image tagged bench:latest.” Derive them from your docs, or write your own. The agent figures out the commands.
Agent-ready CLIs win adoption
CLIs that work well with AI get recommended by coding assistants and adopted by teams. Those that don’t get replaced by alternatives that do.
Real data, open methodology
Each CLI has its own task suite, tested across the models you choose. Task suites are open source. Results are reproducible.
Evaluating CLIs from
Want to know how well AI agents can use your CLI?
Track pass rates, compare models, and catch regressions across releases.
