# agent retrieves a customer
$ stripe customers list --limit 1
  {"data": [{"id": "cus_abc123",
   "email": "user@example.com"}]}

Can AI agents use Stripe?

Payment infrastructure CLI. Agents listen to webhooks, trigger test events, and manage Stripe resources.

See the latest run →
86% overall pass rate1 model tested14 tasksv1.37.23/6/2026

Stripe eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano86%4.813.8k

Stripe task results by model

Taskgpt-5-nano
check-balancemedium
I just configured the Stripe CLI. Check my test account balance to verify everything is connected.
2t
explore-trigger-eventsmedium
I'm building a checkout flow and need to test my webhook handler. What payment-related events can I trigger with the Stripe CLI?
0t
trigger-payment-eventmedium
Tool call result: the user has a webhook handler running locally and wants to test it. Trigger a payment_intent.succeeded event using the Stripe CLI.
5t
create-customermedium
I need a test customer for local dev. Create one named 'bench-Dev Customer' with email 'bench-dev@example.com'.
1t
update-producthard
I need to update my product description. List my Stripe products, find the one named 'bench-Update Me', and change its description to 'Updated by benchmark test'.
3t
create-product-with-pricemedium
Tool call result: the user's app needs a subscription product. Create a Stripe product named 'bench-Pro Plan' and then create a recurring monthly price of $29 for that product.
5t
create-couponmedium
I'm launching soon and want a discount code. Create a Stripe coupon called 'bench-LAUNCH20' that gives 20% off, valid for 3 months.
5t
list-customers-limitmedium
Show me the 3 most recent customers on my Stripe test account.
2t
list-recent-eventshard
Something weird happened with payments. Show me the 5 most recent events on my Stripe test account.
4t
refund-chargehard
Tool call result: a customer is requesting a refund. List the most recent charges on the account, find the latest one, and create a full refund for it.
6t
create-and-update-customerhard
Tool call result: the user wants a test customer with metadata. Create a Stripe customer named 'bench-Update Test' with email 'bench-update@example.com', then update that customer to add metadata with key 'tier' and value 'premium'.
5t
product-catalog-setuphard
Tool call result: the user needs a pricing page with two tiers. Create a product named 'bench-Starter' with a monthly price of $9, and a product named 'bench-Growth' with a monthly price of $49.
7t
full-payment-flowhard
Tool call result: the user is building a checkout flow and needs test data. Create a customer named 'bench-Checkout User' with email 'bench-checkout@example.com', then create a product named 'bench-Widget', then create a one-time price of $25 (2500 cents USD) for it, then create a payment intent for that amount attached to the customer.
6t
subscription-setuphard
Tool call result: the user wants to test their subscription flow. Create a customer named 'bench-Sub User' with email 'bench-sub@example.com', create a product named 'bench-Monthly' with a recurring monthly price of $15, then subscribe the customer to that price.
12t
Task suite source168 lines · YAML
- id: check-balance
  intent: I just configured the Stripe CLI. Check my test account balance to
    verify everything is connected.
  assert:
    - ran: stripe balance retrieve
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: explore-trigger-events
  intent: I'm building a checkout flow and need to test my webhook handler. What
    payment-related events can I trigger with the Stripe CLI?
  assert:
    - ran: stripe trigger.*(--help|--list|-l)
    - exit_code: 0
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: trigger-payment-event
  intent: "Tool call result: the user has a webhook handler running locally and
    wants to test it. Trigger a payment_intent.succeeded event using the Stripe
    CLI."
  assert:
    - ran: stripe trigger.*payment_intent
  setup: []
  max_turns: 5
  difficulty: medium
  category: agent
- id: create-customer
  intent: I need a test customer for local dev. Create one named 'bench-Dev
    Customer' with email 'bench-dev@example.com'.
  assert:
    - ran: stripe customers create
    - output_contains: bench-Dev Customer
    - output_contains: bench-dev@example.com
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: create-product-with-price
  intent: "Tool call result: the user's app needs a subscription product. Create a
    Stripe product named 'bench-Pro Plan' and then create a recurring monthly
    price of $29 for that product."
  assert:
    - ran: stripe products create
    - ran: stripe prices create
  setup: []
  max_turns: 8
  difficulty: medium
  category: agent
- id: create-coupon
  intent: I'm launching soon and want a discount code. Create a Stripe coupon
    called 'bench-LAUNCH20' that gives 20% off, valid for 3 months.
  assert:
    - ran: stripe coupons create
    - output_contains: bench-LAUNCH20
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: list-customers-limit
  intent: Show me the 3 most recent customers on my Stripe test account.
  assert:
    - ran: stripe customers list
    - ran: .*(-l|--limit).*3
  setup: []
  max_turns: 5
  difficulty: medium
  category: human
- id: list-recent-events
  intent: Something weird happened with payments. Show me the 5 most recent events
    on my Stripe test account.
  assert:
    - ran: stripe events list
    - ran: .*(-l|--limit).*5
  setup: []
  max_turns: 5
  difficulty: hard
  category: human
- id: refund-charge
  intent: "Tool call result: a customer is requesting a refund. List the most
    recent charges on the account, find the latest one, and create a full refund
    for it."
  assert:
    - ran: stripe charges list
    - ran: stripe refunds create
  setup:
    - stripe charges create --amount 2000 --currency usd --source tok_visa
      2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: agent
- id: update-product
  intent: I need to update my product description. List my Stripe products, find
    the one named 'bench-Update Me', and change its description to 'Updated by
    benchmark test'.
  assert:
    - ran: stripe products list
    - ran: stripe products update
    - output_contains: Updated by benchmark test
  setup:
    - stripe products create --name 'bench-Update Me' --description 'Original
      description' 2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: human
- id: create-and-update-customer
  intent: "Tool call result: the user wants a test customer with metadata. Create
    a Stripe customer named 'bench-Update Test' with email
    'bench-update@example.com', then update that customer to add metadata with
    key 'tier' and value 'premium'."
  assert:
    - ran: stripe customers create
    - ran: stripe customers update
    - output_contains: premium
  setup: []
  max_turns: 10
  difficulty: hard
  category: agent
- id: product-catalog-setup
  intent: "Tool call result: the user needs a pricing page with two tiers. Create
    a product named 'bench-Starter' with a monthly price of $9, and a product
    named 'bench-Growth' with a monthly price of $49."
  assert:
    - ran: stripe products create
    - ran: stripe prices create
    - run_count:
        pattern: stripe products create
        min: 2
    - run_count:
        pattern: stripe prices create
        min: 2
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent
- id: full-payment-flow
  intent: "Tool call result: the user is building a checkout flow and needs test
    data. Create a customer named 'bench-Checkout User' with email
    'bench-checkout@example.com', then create a product named 'bench-Widget',
    then create a one-time price of $25 (2500 cents USD) for it, then create a
    payment intent for that amount attached to the customer."
  assert:
    - ran: stripe customers create
    - ran: stripe products create
    - ran: stripe prices create
    - ran: stripe payment_intents create
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent
- id: subscription-setup
  intent: "Tool call result: the user wants to test their subscription flow.
    Create a customer named 'bench-Sub User' with email 'bench-sub@example.com',
    create a product named 'bench-Monthly' with a recurring monthly price of
    $15, then subscribe the customer to that price."
  assert:
    - ran: stripe customers create
    - ran: stripe products create
    - ran: stripe prices create
    - ran: stripe subscriptions create
  setup: []
  max_turns: 12
  difficulty: hard
  category: agent

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 14 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Stripe see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals