·3 min read

Opus 4.7 vs 4.6 on our own CLI: what the traces actually say

Same 12 tasks, same CLI, 100% pass on both. The headline is a 36% drop in turns. The traces tell a different story: the two models made different specific mistakes, but revealed the same underlying CLI gaps.

We dogfood CLIWatch on our own CLI. Anthropic shipped Opus 4.7 this week so we ran the 12-task suite on 4.7 and compared it to 4.6 on the same commit. Public runs #13 (4.6) and #14 (4.7).

MetricOpus 4.6Opus 4.7Delta
Pass rate12/1212/12tie
Avg turns / task4.332.75-36%
Avg tokens / task7,8797,957+1%
Failed commands (exit != 0)46+2

Different hallucinations, same underlying gaps

Only one hallucination was identical across both runs: cliwatch auth status. Beyond that, the models guessed different ways to do the same things, and both guesses were wrong.

Same underlying gap, different guess per model
# Filtering 'cliwatch runs' output
4.6  cliwatch runs list --limit 1        # subcommand guess
4.7  cliwatch runs --limit 5              # flag guess

# Telling 'validate' what file to read
4.6  cliwatch validate cli-bench.yaml     # positional guess
4.7  cliwatch validate --config FILE      # --config guess

# Identifying a CLI in run drill-down
4.7  cliwatch runs 13 --cli "CLIWatch CLI"  # display name, needs slug

None of these are wrong by any outside standard. Pagination flags, positional file args, display-name-in-quotes: all conventions from other CLIs. Our CLI just doesn't accept them. Swapping 4.6 for 4.7 doesn't remove the tax; it shifts where the agent trips over it.

The turn gap is about verification, not intelligence

Why the 36% fewer turns, then? Read the two setup-full-flow traces side by side:

setup-full-flow: 4.6 (9 turns) vs 4.7 (4 turns)
4.6:                                       4.7:
1. cliwatch skills setup                   1. cliwatch skills setup
2. cliwatch --version                      2. cliwatch init --cli cliwatch --ci
3. cliwatch --help                         3. cliwatch validate --config FILE  # wrong
4. cliwatch init --cli cliwatch --ci       4. cliwatch validate --file FILE    # ok
5. cat cliwatch/tasks/02-usage.yaml
6. cat cliwatch/tasks/01-basics.yaml
7. cat cliwatch/cli-bench.yaml
8. cat .github/workflows/cliwatch.yml
9. cliwatch validate --file FILE

4.6 runs init, then cats four files to check what was written. 4.7 trusts init's success message and goes straight to validate. 4.7 isn't smarter on this task; it's less paranoid. On a CLI with clean success output that's efficiency. On a CLI with silent partial failures it would be reckless. The same trace signal can read either way depending on your CLI.

The one task where 4.7 took more turns was latest-failures: 4.7 drilled deeper and found --cli rejects display names (“CLIWatch CLI” needs to be the slug cliwatch). 4.6 never got that far.

What we're fixing

  • Alias auth status to whoami. The one exact-shared hallucination.
  • Make cliwatch runs filterable: support --limit and map runs list to runs.
  • Accept both forms on validate: positional and --config as an alias for --file.
  • Better error when --cli gets a non-slug value.

None of these are 4.7-specific. They're things either run would have flagged. A better model papers over a bad CLI. It doesn't fix it.

Run it on your CLI

Compare Opus 4.6 vs 4.7 on your CLI
npx @cliwatch/cli-bench init
npx @cliwatch/cli-bench \
  --models anthropic/claude-opus-4.6,anthropic/claude-opus-4.7

Don't look at the pass rate. Look at which commands each model tried that didn't exist, and which flags they guessed wrong. That's your next quarter's CLI roadmap.

n=1 per model; same commit, same suite, back to back. Judge: Haiku 4.5. Minor caveat on tokens: run #14 executed after run #13 was persisted, so cliwatch runs returned one extra row for 4.7 - a handful of tokens, biases the +1% delta marginally higher.

See what agents hallucinate on your CLI

CLIWatch aggregates every command agents try and flags the ones that don't exist. The next quarter of your CLI roadmap is probably in that list.