# agent checks repo status and history $ git status --porcelain M src/index.ts ?? new-file.ts $ git log --oneline -3 a1b2c3d fix: resolve auth timeout
Can AI agents use Git?
The ubiquitous version control system. Agents use it to commit changes, manage branches, resolve conflicts, and navigate repository history.
See the latest run →Git eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 100% | 4.3 | 6.6k |
Git task results by model
| Task | gpt-5-nano |
|---|---|
init-and-commiteasy Create a file called hello.txt with the content 'Hello World', stage it, and commit with the message 'initial commit' | ✓3t |
create-branch-and-mergemedium Create a file README.md with content 'v1' and commit it. Then create a branch called 'feature', add a file feature.txt with content 'new feature', and commit it. Switch back to main and merge the feature branch. | ✓9t |
log-formattingeasy Show the git log with a custom format showing only the short hash and commit message, one line per commit. | ✓1t |
stash-and-applymedium There are uncommitted changes to dirty.txt. Stash them, verify the working directory is clean, then pop the stash to restore the changes. | ✓4t |
Task suite source76 lines · YAML
- id: init-and-commit
intent: Create a file called hello.txt with the content 'Hello World', stage it,
and commit with the message 'initial commit'
assert:
- file_exists: hello.txt
- file_contains:
path: hello.txt
text: Hello World
- verify:
run: git log --oneline
output_contains: initial commit
setup:
- git init -b main
- git config user.email 'test@test.com'
- git config user.name 'Test'
max_turns: 5
difficulty: easy
category: basics
- id: create-branch-and-merge
intent: Create a file README.md with content 'v1' and commit it. Then create a
branch called 'feature', add a file feature.txt with content 'new feature',
and commit it. Switch back to main and merge the feature branch.
assert:
- file_exists: feature.txt
- file_contains:
path: feature.txt
text: new feature
- verify:
run: git log --oneline
output_contains: feature
- verify:
run: git branch
output_contains: main
setup:
- git init -b main
- git config user.email 'test@test.com'
- git config user.name 'Test'
max_turns: 10
difficulty: medium
category: branching
- id: log-formatting
intent: Show the git log with a custom format showing only the short hash and
commit message, one line per commit.
assert:
- ran: git log.*--pretty|git log.*--format|git log.*--oneline
- output_contains: add a
- output_contains: add c
setup:
- git init -b main
- git config user.email 'test@test.com'
- git config user.name 'Test'
- echo 'first' > a.txt && git add a.txt && git commit -m 'add a'
- echo 'second' > b.txt && git add b.txt && git commit -m 'add b'
- echo 'third' > c.txt && git add c.txt && git commit -m 'add c'
max_turns: 4
difficulty: easy
category: query
- id: stash-and-apply
intent: There are uncommitted changes to dirty.txt. Stash them, verify the
working directory is clean, then pop the stash to restore the changes.
assert:
- ran: git stash
- ran: git stash pop|git stash apply
- file_contains:
path: dirty.txt
text: modified content
setup:
- git init -b main
- git config user.email 'test@test.com'
- git config user.name 'Test'
- echo 'original' > dirty.txt && git add dirty.txt && git commit -m 'init'
- echo 'modified content' > dirty.txt
max_turns: 5
difficulty: medium
category: workflow
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for Git — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.