# agent lists EC2 instances $ aws ec2 describe-instances --query 'Reservations[].Instances[].{Id:InstanceId, State:State.Name}' --output json [{"Id":"i-0a1b2c","State":"running"}]
Can AI agents use AWS?
Amazon Web Services CLI. Agents manage cloud resources, configure services, and query infrastructure across AWS.
See the latest run →AWS eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 100% | 3.3 | 5.5k |
AWS task results by model
| Task | gpt-5-nano |
|---|---|
help-navigationeasy Show the help text for the 'aws s3 cp' command. | ✓3t |
generate-skeletonmedium Generate the CLI input skeleton JSON for the 'aws ec2 run-instances' command and save it to skeleton.json. | ✓3t |
configure-profileeasy Use 'aws configure set' to set the region to 'us-east-1' and the output format to 'json' for a profile called 'bench'. | ✓4t |
list-public-s3medium List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake' using --no-sign-request. Show just the first few results. | ✓3t |
Task suite source44 lines · YAML
- id: help-navigation
intent: Show the help text for the 'aws s3 cp' command.
assert:
- ran: aws s3 cp help|aws help|aws s3 cp --help
- exit_code: 0
setup: []
max_turns: 3
difficulty: easy
category: query
- id: generate-skeleton
intent: Generate the CLI input skeleton JSON for the 'aws ec2 run-instances'
command and save it to skeleton.json.
assert:
- file_exists: skeleton.json
- file_contains:
path: skeleton.json
text: InstanceType
setup: []
max_turns: 5
difficulty: medium
category: generate
- id: configure-profile
intent: Use 'aws configure set' to set the region to 'us-east-1' and the output
format to 'json' for a profile called 'bench'.
assert:
- ran: aws configure set
- verify:
run: aws configure get region --profile bench
output_contains: us-east-1
setup: []
max_turns: 4
difficulty: easy
category: config
- id: list-public-s3
intent: List the contents of the public S3 bucket 's3://aws-roda-hcls-datalake'
using --no-sign-request. Show just the first few results.
assert:
- ran: aws s3.*--no-sign-request
- exit_code: 0
setup: []
max_turns: 5
difficulty: medium
category: s3
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for AWS — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.