# agent retrieves a customer
$ stripe customers list --limit 1
  {"data": [{"id": "cus_abc123",
   "email": "user@example.com"}]}

Can AI agents use Stripe?

Payment infrastructure CLI. Agents listen to webhooks, trigger test events, and manage Stripe resources.

Docs →GitHub →

See the latest run →

86% overall pass rate1 model tested14 tasksv1.37.23/6/2026

Stripe eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	86%	4.8	13.8k

Stripe task results by model

Task	gpt-5-nano
check-balancemedium I just configured the Stripe CLI. Check my test account balance to verify everything is connected.	✓2t2 turns · 4.3k tokens
explore-trigger-eventsmedium I'm building a checkout flow and need to test my webhook handler. What payment-related events can I trigger with the Stripe CLI?	✗0t0 turns · 4.0k tokens
trigger-payment-eventmedium Tool call result: the user has a webhook handler running locally and wants to test it. Trigger a payment_intent.succeeded event using the Stripe CLI.	✓5t5 turns · 16.6k tokens
create-customermedium I need a test customer for local dev. Create one named 'bench-Dev Customer' with email 'bench-dev@example.com'.	✓1t1 turn · 2.8k tokens
update-producthard I need to update my product description. List my Stripe products, find the one named 'bench-Update Me', and change its description to 'Updated by benchmark test'.	✓3t3 turns · 7.4k tokens
create-product-with-pricemedium Tool call result: the user's app needs a subscription product. Create a Stripe product named 'bench-Pro Plan' and then create a recurring monthly price of $29 for that product.	✓5t5 turns · 11.6k tokens
create-couponmedium I'm launching soon and want a discount code. Create a Stripe coupon called 'bench-LAUNCH20' that gives 20% off, valid for 3 months.	✗5t5 turns · 13.3k tokens
list-customers-limitmedium Show me the 3 most recent customers on my Stripe test account.	✓2t2 turns · 6.6k tokens
list-recent-eventshard Something weird happened with payments. Show me the 5 most recent events on my Stripe test account.	✓4t4 turns · 19.9k tokens
refund-chargehard Tool call result: a customer is requesting a refund. List the most recent charges on the account, find the latest one, and create a full refund for it.	✓6t6 turns · 26.6k tokens
create-and-update-customerhard Tool call result: the user wants a test customer with metadata. Create a Stripe customer named 'bench-Update Test' with email 'bench-update@example.com', then update that customer to add metadata with key 'tier' and value 'premium'.	✓5t5 turns · 13.6k tokens
product-catalog-setuphard Tool call result: the user needs a pricing page with two tiers. Create a product named 'bench-Starter' with a monthly price of $9, and a product named 'bench-Growth' with a monthly price of $49.	✓7t7 turns · 19.8k tokens
full-payment-flowhard Tool call result: the user is building a checkout flow and needs test data. Create a customer named 'bench-Checkout User' with email 'bench-checkout@example.com', then create a product named 'bench-Widget', then create a one-time price of $25 (2500 cents USD) for it, then create a payment intent for that amount attached to the customer.	✓6t6 turns · 18.5k tokens
subscription-setuphard Tool call result: the user wants to test their subscription flow. Create a customer named 'bench-Sub User' with email 'bench-sub@example.com', create a product named 'bench-Monthly' with a recurring monthly price of $15, then subscribe the customer to that price.	✓12t12 turns · 28.6k tokens

Task suite source168 lines · YAML

- id: check-balance
  intent: I just configured the Stripe CLI. Check my test account balance to
    verify everything is connected.
  assert:
    - ran: stripe balance retrieve
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: explore-trigger-events
  intent: I'm building a checkout flow and need to test my webhook handler. What
    payment-related events can I trigger with the Stripe CLI?
  assert:
    - ran: stripe trigger.*(--help|--list|-l)
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: trigger-payment-event
  intent: "Tool call result: the user has a webhook handler running locally and
    wants to test it. Trigger a payment_intent.succeeded event using the Stripe
    CLI."
  assert:
    - ran: stripe trigger.*payment_intent
  setup: []
  max_turns: 5
  difficulty: medium
  category: agent
- id: create-customer
  intent: I need a test customer for local dev. Create one named 'bench-Dev
    Customer' with email 'bench-dev@example.com'.
  assert:
    - ran: stripe customers create
    - output_contains: bench-Dev Customer
    - output_contains: bench-dev@example.com
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: create-product-with-price
  intent: "Tool call result: the user's app needs a subscription product. Create a
    Stripe product named 'bench-Pro Plan' and then create a recurring monthly
    price of $29 for that product."
  assert:
    - ran: stripe products create
    - ran: stripe prices create
  setup: []
  max_turns: 8
  difficulty: medium
  category: agent
- id: create-coupon
  intent: I'm launching soon and want a discount code. Create a Stripe coupon
    called 'bench-LAUNCH20' that gives 20% off, valid for 3 months.
  assert:
    - ran: stripe coupons create
    - output_contains: bench-LAUNCH20
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: list-customers-limit
  intent: Show me the 3 most recent customers on my Stripe test account.
  assert:
    - ran: stripe customers list
    - ran: .*(-l|--limit).*3
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: list-recent-events
  intent: Something weird happened with payments. Show me the 5 most recent events
    on my Stripe test account.
  assert:
    - ran: stripe events list
    - ran: .*(-l|--limit).*5
  setup: []
  max_turns: 5
  difficulty: hard
  category: human
- id: refund-charge
  intent: "Tool call result: a customer is requesting a refund. List the most
    recent charges on the account, find the latest one, and create a full refund
    for it."
  assert:
    - ran: stripe charges list
    - ran: stripe refunds create
  setup:
    - stripe charges create --amount 2000 --currency usd --source tok_visa
      2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: agent
- id: update-product
  intent: I need to update my product description. List my Stripe products, find
    the one named 'bench-Update Me', and change its description to 'Updated by
    benchmark test'.
  assert:
    - ran: stripe products list
    - ran: stripe products update
    - output_contains: Updated by benchmark test
  setup:
    - stripe products create --name 'bench-Update Me' --description 'Original
      description' 2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: human
- id: create-and-update-customer
  intent: "Tool call result: the user wants a test customer with metadata. Create
    a Stripe customer named 'bench-Update Test' with email
    'bench-update@example.com', then update that customer to add metadata with
    key 'tier' and value 'premium'."
  assert:
    - ran: stripe customers create
    - ran: stripe customers update
    - output_contains: premium
  setup: []
  max_turns: 10
  difficulty: hard
  category: agent
- id: product-catalog-setup
  intent: "Tool call result: the user needs a pricing page with two tiers. Create
    a product named 'bench-Starter' with a monthly price of $9, and a product
    named 'bench-Growth' with a monthly price of $49."
  assert:
    - ran: stripe products create
    - ran: stripe prices create
    - run_count:
        pattern: stripe products create
        min: 2
    - run_count:
        pattern: stripe prices create
        min: 2
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent
- id: full-payment-flow
  intent: "Tool call result: the user is building a checkout flow and needs test
    data. Create a customer named 'bench-Checkout User' with email
    'bench-checkout@example.com', then create a product named 'bench-Widget',
    then create a one-time price of $25 (2500 cents USD) for it, then create a
    payment intent for that amount attached to the customer."
  assert:
    - ran: stripe customers create
    - ran: stripe products create
    - ran: stripe prices create
    - ran: stripe payment_intents create
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent
- id: subscription-setup
  intent: "Tool call result: the user wants to test their subscription flow.
    Create a customer named 'bench-Sub User' with email 'bench-sub@example.com',
    create a product named 'bench-Monthly' with a recurring monthly price of
    $15, then subscribe the customer to that price."
  assert:
    - ran: stripe customers create
    - ran: stripe products create
    - ran: stripe prices create
    - ran: stripe subscriptions create
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 14 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Stripe — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws

fly