# agent retrieves a customer $ stripe customers list --limit 1 {"data": [{"id": "cus_abc123", "email": "user@example.com"}]}
Can AI agents use Stripe?
Payment infrastructure CLI. Agents listen to webhooks, trigger test events, and manage Stripe resources.
See the latest run →Stripe eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 86% | 4.8 | 13.8k |
Stripe task results by model
| Task | gpt-5-nano |
|---|---|
check-balancemedium I just configured the Stripe CLI. Check my test account balance to verify everything is connected. | ✓2t |
explore-trigger-eventsmedium I'm building a checkout flow and need to test my webhook handler. What payment-related events can I trigger with the Stripe CLI? | ✗0t |
trigger-payment-eventmedium Tool call result: the user has a webhook handler running locally and wants to test it. Trigger a payment_intent.succeeded event using the Stripe CLI. | ✓5t |
create-customermedium I need a test customer for local dev. Create one named 'bench-Dev Customer' with email 'bench-dev@example.com'. | ✓1t |
update-producthard I need to update my product description. List my Stripe products, find the one named 'bench-Update Me', and change its description to 'Updated by benchmark test'. | ✓3t |
create-product-with-pricemedium Tool call result: the user's app needs a subscription product. Create a Stripe product named 'bench-Pro Plan' and then create a recurring monthly price of $29 for that product. | ✓5t |
create-couponmedium I'm launching soon and want a discount code. Create a Stripe coupon called 'bench-LAUNCH20' that gives 20% off, valid for 3 months. | ✗5t |
list-customers-limitmedium Show me the 3 most recent customers on my Stripe test account. | ✓2t |
list-recent-eventshard Something weird happened with payments. Show me the 5 most recent events on my Stripe test account. | ✓4t |
refund-chargehard Tool call result: a customer is requesting a refund. List the most recent charges on the account, find the latest one, and create a full refund for it. | ✓6t |
create-and-update-customerhard Tool call result: the user wants a test customer with metadata. Create a Stripe customer named 'bench-Update Test' with email 'bench-update@example.com', then update that customer to add metadata with key 'tier' and value 'premium'. | ✓5t |
product-catalog-setuphard Tool call result: the user needs a pricing page with two tiers. Create a product named 'bench-Starter' with a monthly price of $9, and a product named 'bench-Growth' with a monthly price of $49. | ✓7t |
full-payment-flowhard Tool call result: the user is building a checkout flow and needs test data. Create a customer named 'bench-Checkout User' with email 'bench-checkout@example.com', then create a product named 'bench-Widget', then create a one-time price of $25 (2500 cents USD) for it, then create a payment intent for that amount attached to the customer. | ✓6t |
subscription-setuphard Tool call result: the user wants to test their subscription flow. Create a customer named 'bench-Sub User' with email 'bench-sub@example.com', create a product named 'bench-Monthly' with a recurring monthly price of $15, then subscribe the customer to that price. | ✓12t |
Task suite source168 lines · YAML
- id: check-balance
intent: I just configured the Stripe CLI. Check my test account balance to
verify everything is connected.
assert:
- ran: stripe balance retrieve
- exit_code: 0
setup: []
max_turns: 5
difficulty: medium
category: human
- id: explore-trigger-events
intent: I'm building a checkout flow and need to test my webhook handler. What
payment-related events can I trigger with the Stripe CLI?
assert:
- ran: stripe trigger.*(--help|--list|-l)
- exit_code: 0
setup: []
max_turns: 5
difficulty: medium
category: human
- id: trigger-payment-event
intent: "Tool call result: the user has a webhook handler running locally and
wants to test it. Trigger a payment_intent.succeeded event using the Stripe
CLI."
assert:
- ran: stripe trigger.*payment_intent
setup: []
max_turns: 5
difficulty: medium
category: agent
- id: create-customer
intent: I need a test customer for local dev. Create one named 'bench-Dev
Customer' with email 'bench-dev@example.com'.
assert:
- ran: stripe customers create
- output_contains: bench-Dev Customer
- output_contains: bench-dev@example.com
setup: []
max_turns: 5
difficulty: medium
category: human
- id: create-product-with-price
intent: "Tool call result: the user's app needs a subscription product. Create a
Stripe product named 'bench-Pro Plan' and then create a recurring monthly
price of $29 for that product."
assert:
- ran: stripe products create
- ran: stripe prices create
setup: []
max_turns: 8
difficulty: medium
category: agent
- id: create-coupon
intent: I'm launching soon and want a discount code. Create a Stripe coupon
called 'bench-LAUNCH20' that gives 20% off, valid for 3 months.
assert:
- ran: stripe coupons create
- output_contains: bench-LAUNCH20
setup: []
max_turns: 5
difficulty: medium
category: human
- id: list-customers-limit
intent: Show me the 3 most recent customers on my Stripe test account.
assert:
- ran: stripe customers list
- ran: .*(-l|--limit).*3
setup: []
max_turns: 5
difficulty: medium
category: human
- id: list-recent-events
intent: Something weird happened with payments. Show me the 5 most recent events
on my Stripe test account.
assert:
- ran: stripe events list
- ran: .*(-l|--limit).*5
setup: []
max_turns: 5
difficulty: hard
category: human
- id: refund-charge
intent: "Tool call result: a customer is requesting a refund. List the most
recent charges on the account, find the latest one, and create a full refund
for it."
assert:
- ran: stripe charges list
- ran: stripe refunds create
setup:
- stripe charges create --amount 2000 --currency usd --source tok_visa
2>/dev/null || true
max_turns: 8
difficulty: hard
category: agent
- id: update-product
intent: I need to update my product description. List my Stripe products, find
the one named 'bench-Update Me', and change its description to 'Updated by
benchmark test'.
assert:
- ran: stripe products list
- ran: stripe products update
- output_contains: Updated by benchmark test
setup:
- stripe products create --name 'bench-Update Me' --description 'Original
description' 2>/dev/null || true
max_turns: 8
difficulty: hard
category: human
- id: create-and-update-customer
intent: "Tool call result: the user wants a test customer with metadata. Create
a Stripe customer named 'bench-Update Test' with email
'bench-update@example.com', then update that customer to add metadata with
key 'tier' and value 'premium'."
assert:
- ran: stripe customers create
- ran: stripe customers update
- output_contains: premium
setup: []
max_turns: 10
difficulty: hard
category: agent
- id: product-catalog-setup
intent: "Tool call result: the user needs a pricing page with two tiers. Create
a product named 'bench-Starter' with a monthly price of $9, and a product
named 'bench-Growth' with a monthly price of $49."
assert:
- ran: stripe products create
- ran: stripe prices create
- run_count:
pattern: stripe products create
min: 2
- run_count:
pattern: stripe prices create
min: 2
setup: []
max_turns: 12
difficulty: hard
category: agent
- id: full-payment-flow
intent: "Tool call result: the user is building a checkout flow and needs test
data. Create a customer named 'bench-Checkout User' with email
'bench-checkout@example.com', then create a product named 'bench-Widget',
then create a one-time price of $25 (2500 cents USD) for it, then create a
payment intent for that amount attached to the customer."
assert:
- ran: stripe customers create
- ran: stripe products create
- ran: stripe prices create
- ran: stripe payment_intents create
setup: []
max_turns: 12
difficulty: hard
category: agent
- id: subscription-setup
intent: "Tool call result: the user wants to test their subscription flow.
Create a customer named 'bench-Sub User' with email 'bench-sub@example.com',
create a product named 'bench-Monthly' with a recurring monthly price of
$15, then subscribe the customer to that price."
assert:
- ran: stripe customers create
- ran: stripe products create
- ran: stripe prices create
- ran: stripe subscriptions create
setup: []
max_turns: 12
difficulty: hard
category: agent
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 14 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for Stripe — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.