# agent creates a PR $ gh pr create --title "Fix auth bug" --body "..." https://github.com/org/repo/pull/42 $ gh pr view 42 --json state,checks {"state": "OPEN", "checks": [...]}
Can AI agents use gh?
GitHub's official CLI. Agents use it to create PRs, manage issues, trigger workflows, and query repository data.
See the latest run →gh eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 75% | 3.7 | 5.5k |
gh task results by model
| Task | gpt-5-nano |
|---|---|
list-reposeasy List the public repositories for the 'cli' GitHub organization, showing just the name and description. Limit to 5 results. | ✓3t |
view-repo-detailseasy Show details about the 'cli/cli' repository including its description, star count, and primary language. | ✗1t |
search-issuesmedium Search for open issues in the 'vercel/next.js' repository that contain the word 'build' in the title. Show the top 3 results with their number and title. | ✓4t |
api-querymedium Use the gh api command to get the latest release tag name for the 'docker/cli' repository. | ✓4t |
Task suite source42 lines · YAML
- id: list-repos
intent: List the public repositories for the 'cli' GitHub organization, showing
just the name and description. Limit to 5 results.
assert:
- ran: gh.*repo.*list|gh.*api
- exit_code: 0
setup: []
max_turns: 3
difficulty: easy
category: query
- id: view-repo-details
intent: Show details about the 'cli/cli' repository including its description,
star count, and primary language.
assert:
- ran: gh.*repo.*view|gh.*api
- output_contains: cli
setup: []
max_turns: 3
difficulty: easy
category: query
- id: search-issues
intent: Search for open issues in the 'vercel/next.js' repository that contain
the word 'build' in the title. Show the top 3 results with their number and
title.
assert:
- ran: gh.*search.*issues|gh.*issue.*list|gh.*api
- exit_code: 0
setup: []
max_turns: 4
difficulty: medium
category: search
- id: api-query
intent: Use the gh api command to get the latest release tag name for the
'docker/cli' repository.
assert:
- ran: gh api
- exit_code: 0
setup: []
max_turns: 4
difficulty: medium
category: api
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for gh — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.