# agent checks repo status and history
$ git status --porcelain
  M  src/index.ts
  ?? new-file.ts
 
$ git log --oneline -3
  a1b2c3d fix: resolve auth timeout

Can AI agents use Git?

The ubiquitous version control system. Agents use it to commit changes, manage branches, resolve conflicts, and navigate repository history.

See the latest run →
100% overall pass rate1 model tested4 tasksv2.53.03/6/2026

Git eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano100%4.36.6k

Git task results by model

Taskgpt-5-nano
init-and-commiteasy
Create a file called hello.txt with the content 'Hello World', stage it, and commit with the message 'initial commit'
3t
create-branch-and-mergemedium
Create a file README.md with content 'v1' and commit it. Then create a branch called 'feature', add a file feature.txt with content 'new feature', and commit it. Switch back to main and merge the feature branch.
9t
log-formattingeasy
Show the git log with a custom format showing only the short hash and commit message, one line per commit.
1t
stash-and-applymedium
There are uncommitted changes to dirty.txt. Stash them, verify the working directory is clean, then pop the stash to restore the changes.
4t
Task suite source76 lines · YAML
- id: init-and-commit
  intent: Create a file called hello.txt with the content 'Hello World', stage it,
    and commit with the message 'initial commit'
  assert:
    - file_exists: hello.txt
    - file_contains:
        path: hello.txt
        text: Hello World
    - verify:
        run: git log --oneline
        output_contains: initial commit
  setup:
    - git init -b main
    - git config user.email 'test@test.com'
    - git config user.name 'Test'
  max_turns: 5
  difficulty: easy
  category: basics
- id: create-branch-and-merge
  intent: Create a file README.md with content 'v1' and commit it. Then create a
    branch called 'feature', add a file feature.txt with content 'new feature',
    and commit it. Switch back to main and merge the feature branch.
  assert:
    - file_exists: feature.txt
    - file_contains:
        path: feature.txt
        text: new feature
    - verify:
        run: git log --oneline
        output_contains: feature
    - verify:
        run: git branch
        output_contains: main
  setup:
    - git init -b main
    - git config user.email 'test@test.com'
    - git config user.name 'Test'
  max_turns: 10
  difficulty: medium
  category: branching
- id: log-formatting
  intent: Show the git log with a custom format showing only the short hash and
    commit message, one line per commit.
  assert:
    - ran: git log.*--pretty|git log.*--format|git log.*--oneline
    - output_contains: add a
    - output_contains: add c
  setup:
    - git init -b main
    - git config user.email 'test@test.com'
    - git config user.name 'Test'
    - echo 'first' > a.txt && git add a.txt && git commit -m 'add a'
    - echo 'second' > b.txt && git add b.txt && git commit -m 'add b'
    - echo 'third' > c.txt && git add c.txt && git commit -m 'add c'
  max_turns: 4
  difficulty: easy
  category: query
- id: stash-and-apply
  intent: There are uncommitted changes to dirty.txt. Stash them, verify the
    working directory is clean, then pop the stash to restore the changes.
  assert:
    - ran: git stash
    - ran: git stash pop|git stash apply
    - file_contains:
        path: dirty.txt
        text: modified content
  setup:
    - git init -b main
    - git config user.email 'test@test.com'
    - git config user.name 'Test'
    - echo 'original' > dirty.txt && git add dirty.txt && git commit -m 'init'
    - echo 'modified content' > dirty.txt
  max_turns: 5
  difficulty: medium
  category: workflow

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Git see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals