# agent lists EC2 instances
$ aws ec2 describe-instances --query
    'Reservations[].Instances[].{Id:InstanceId,
     State:State.Name}' --output json
  [{"Id":"i-0a1b2c","State":"running"}]

Can AI agents use AWS?

Amazon Web Services CLI. Agents manage cloud resources, configure services, and query infrastructure across AWS.

See the latest run →
100% overall pass rate1 model tested4 tasksv2.34.03/6/2026

AWS eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano100%3.35.5k

AWS task results by model

Taskgpt-5-nano
help-navigationeasy
Show the help text for the 'aws s3 cp' command.
3t
generate-skeletonmedium
Generate the CLI input skeleton JSON for the 'aws ec2 run-instances' command and save it to skeleton.json.
3t
configure-profileeasy
Use 'aws configure set' to set the region to 'us-east-1' and the output format to 'json' for a profile called 'bench'.
4t
list-public-s3medium
List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake' using --no-sign-request. Show just the first few results.
3t
Task suite source44 lines · YAML
- id: help-navigation
  intent: Show the help text for the 'aws s3 cp' command.
  assert:
    - ran: aws s3 cp help|aws help|aws s3 cp --help
    - exit_code: 0
  setup: []
  max_turns: 3
  difficulty: easy
  category: query
- id: generate-skeleton
  intent: Generate the CLI input skeleton JSON for the 'aws ec2 run-instances'
    command and save it to skeleton.json.
  assert:
    - file_exists: skeleton.json
    - file_contains:
        path: skeleton.json
        text: InstanceType
  setup: []
  max_turns: 5
  difficulty: medium
  category: generate
- id: configure-profile
  intent: Use 'aws configure set' to set the region to 'us-east-1' and the output
    format to 'json' for a profile called 'bench'.
  assert:
    - ran: aws configure set
    - verify:
        run: aws configure get region --profile bench
        output_contains: us-east-1
  setup: []
  max_turns: 4
  difficulty: easy
  category: config
- id: list-public-s3
  intent: List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake'
    using --no-sign-request. Show just the first few results.
  assert:
    - ran: aws s3.*--no-sign-request
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: s3

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for AWS see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals