# agent lists EC2 instances
$ aws ec2 describe-instances --query
    'Reservations[].Instances[].{Id:InstanceId,
     State:State.Name}' --output json
  [{"Id":"i-0a1b2c","State":"running"}]

Can AI agents use AWS?

Amazon Web Services CLI. Agents manage cloud resources, configure services, and query infrastructure across AWS.

Docs →GitHub →

See the latest run →

100% overall pass rate1 model tested4 tasksv2.34.03/6/2026

AWS eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	100%	3.3	5.5k

AWS task results by model

Task	gpt-5-nano
help-navigationeasy Show the help text for the 'aws s3 cp' command.	✓3t3 turns · 3.9k tokens
generate-skeletonmedium Generate the CLI input skeleton JSON for the 'aws ec2 run-instances' command and save it to skeleton.json.	✓3t3 turns · 7.3k tokens
configure-profileeasy Use 'aws configure set' to set the region to 'us-east-1' and the output format to 'json' for a profile called 'bench'.	✓4t4 turns · 4.3k tokens
list-public-s3medium List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake' using --no-sign-request. Show just the first few results.	✓3t3 turns · 6.7k tokens

Task suite source44 lines · YAML

- id: help-navigation
  intent: Show the help text for the 'aws s3 cp' command.
  assert:
    - ran: aws s3 cp help|aws help|aws s3 cp --help
    - exit_code: 0
  setup: []
  max_turns: 3
  difficulty: easy
  category: query
- id: generate-skeleton
  intent: Generate the CLI input skeleton JSON for the 'aws ec2 run-instances'
    command and save it to skeleton.json.
  assert:
    - file_exists: skeleton.json
    - file_contains:
        path: skeleton.json
        text: InstanceType
  setup: []
  max_turns: 5
  difficulty: medium
  category: generate
- id: configure-profile
  intent: Use 'aws configure set' to set the region to 'us-east-1' and the output
    format to 'json' for a profile called 'bench'.
  assert:
    - ran: aws configure set
    - verify:
        run: aws configure get region --profile bench
        output_contains: us-east-1
  setup: []
  max_turns: 4
  difficulty: easy
  category: config
- id: list-public-s3
  intent: List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake'
    using --no-sign-request. Show just the first few results.
  assert:
    - ran: aws s3.*--no-sign-request
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: s3

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for AWS — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

fly