# agent checks app status
$ fly status --json
  {"Name":"api","Status":"deployed",
   "Machines":[{"region":"iad","state":"started"}]}

Can AI agents use Fly.io?

Edge computing platform CLI. Agents deploy apps, manage machines, scale regions, and monitor deployments.

Docs →GitHub →

See the latest run →

58% overall pass rate1 model tested19 tasksv0.4.193/6/2026

Fly.io eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	58%	3.5	20.3k

Fly.io task results by model

Task	gpt-5-nano
discover-versioneasy Check what version of the Fly.io CLI is installed and print the output.	✓1t1 turn · 1.8k tokens
launch-name-conflicthard Tool call result: previous `fly launch --name fly-builder-cliwatch` failed with 'name already taken'. The user wants their Node.js app on Fly. Use `fly launch` with the name 'bench-conflict-retry' instead. Do not deploy.	✓5t5 turns · 20.3k tokens
discover-auth-whoamieasy Check which Fly.io account is currently authenticated. Print the username or email.	✓3t3 turns · 3.4k tokens
discover-apps-listeasy List all apps in the cliwatch-benchmarks organization on Fly.io.	✓4t4 turns · 5.4k tokens
discover-doctoreasy Run the Fly.io doctor command to check that the local environment is configured correctly.	✓1t1 turn · 3.0k tokens
create-fly-tomlmedium Create a fly.toml config file for an app named 'bench-api-config' in the iad region, with an HTTP service on internal port 8080, auto_stop set to 'suspend', and min_machines_running of 1. After creating it, print the file contents to verify.	✗6t6 turns · 30.8k tokens
multi-process-configmedium I need a fly.toml for 'bench-multi-proc' with two processes, a web server (node server.js on port 3000) and a background worker (node worker.js). Also need a Dockerfile and .dockerignore that skips node_modules and .git. After creating the files, show me the fly.toml contents.	✗8t8 turns · 36.5k tokens
launch-with-custom-portmedium I've got a Node service in this directory that listens on port 5000. Set it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet, just generate the config.	✗2t2 turns · 16.1k tokens
launch-node-appmedium Tool call result: `ls` returned package.json, server.js. `cat package.json` shows express dependency, start script is 'node server.js'. The user asked to deploy their API. Use `fly launch` to configure this Node.js project for Fly.io as 'bench-node-app'. Do not deploy.	✓5t5 turns · 24.4k tokens
launch-static-sitemedium Tool call result: `ls` returned index.html (no package.json, no Dockerfile). This appears to be a static HTML site. The user wants it hosted on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy yet.	✗1t1 turn · 11.3k tokens
launch-dockerfile-appmedium Tool call result: `ls` returned Dockerfile, server.js. The project already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip deployment.	✗5t5 turns · 25.3k tokens
launch-go-appmedium Tool call result: `ls` returned main.go, go.mod. `head main.go` shows net/http ListenAndServe on :8080. The user wants this Go service on Fly.io as 'bench-go-app'. Set up config only, no deploy.	✗4t4 turns · 23.6k tokens
doctor-checkhard Something feels off with my Fly setup. Can you run the doctor command and tell me if anything looks wrong?	✓1t1 turn · 3.3k tokens
status-no-apphard I'm trying to check on my app but there's no fly.toml in this directory. Run fly status anyway and tell me what the error says.	✓1t1 turn · 2.4k tokens
launch-empty-dirhard Tool call result: `ls -la` returned only '.' and '..'. The directory is empty but the user asked to deploy this project to Fly.io as 'bench-empty-app'. Attempt to configure it, handle any scanner failures and ensure a fly.toml is produced. Do not deploy.	✓6t6 turns · 33.7k tokens
launch-and-inspecthard Tool call result: `ls` returned package.json, server.js. The user asked 'set up my app on Fly and tell me what it picked'. Configure this Node.js project as 'bench-inspect-app' on Fly.io without deploying, then read the generated fly.toml and report what framework was detected, what port was set, and the auto_stop setting.	✗10t10 turns · 50.5k tokens
launch-then-customizehard Tool call result: `ls` returned package.json, server.js. The user wants their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to generate the initial config, then customize fly.toml: change internal_port to 3000, add NODE_ENV=production as an env var, and set min_machines_running to 2. Do not deploy.	✓6t6 turns · 36.8k tokens
launch-flask-with-healthcheckhard Tool call result: `ls` returned app.py, requirements.txt. `grep route app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows flask==3.0.0. The user wants this deployed with health checks. Configure for Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not deploy.	✗7t7 turns · 46.5k tokens
validate-confighard Tool call result: `ls` returned fly.toml. The user wants to verify their Fly config is valid before deploying. Use `fly config validate` to check the fly.toml, then use `fly config show` to display the resolved configuration. Report any warnings or errors.	✓5t5 turns · 11.1k tokens

Task suite source393 lines · YAML

- id: discover-version
  intent: Check what version of the Fly.io CLI is installed and print the output.
  assert:
    - ran: fly version|flyctl version
    - output_contains: fly
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_version.md#Usage
- id: discover-auth-whoami
  intent: Check which Fly.io account is currently authenticated. Print the
    username or email.
  assert:
    - ran: fly auth whoami|flyctl auth whoami
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_auth_whoami.md#Usage
- id: discover-apps-list
  intent: List all apps in the cliwatch-benchmarks organization on Fly.io.
  assert:
    - ran: fly apps list|flyctl apps list
  setup: []
  max_turns: 4
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_apps_list.md#Usage
- id: discover-doctor
  intent: Run the Fly.io doctor command to check that the local environment is
    configured correctly.
  assert:
    - ran: fly doctor|flyctl doctor
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: create-fly-toml
  intent: Create a fly.toml config file for an app named 'bench-api-config' in the
    iad region, with an HTTP service on internal port 8080, auto_stop set to
    'suspend', and min_machines_running of 1. After creating it, print the file
    contents to verify.
  assert:
    - ran: cat fly.toml
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-api-config
    - file_contains:
        path: fly.toml
        text: "8080"
    - file_contains:
        path: fly.toml
        text: min_machines_running
  setup: []
  max_turns: 6
  difficulty: medium
  category: config
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: multi-process-config
  intent: I need a fly.toml for 'bench-multi-proc' with two processes, a web
    server (node server.js on port 3000) and a background worker (node
    worker.js). Also need a Dockerfile and .dockerignore that skips node_modules
    and .git. After creating the files, show me the fly.toml contents.
  assert:
    - ran: cat fly.toml|less fly.toml|fly config
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: processes
    - file_contains:
        path: fly.toml
        text: web
    - file_contains:
        path: fly.toml
        text: worker
    - file_exists: Dockerfile
    - file_exists: .dockerignore
    - file_contains:
        path: .dockerignore
        text: node_modules
  setup: []
  max_turns: 8
  difficulty: medium
  category: config
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-with-custom-port
  intent: I've got a Node service in this directory that listens on port 5000. Set
    it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet,
    just generate the config.
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-flags-app
    - file_contains:
        path: fly.toml
        text: "5000"
  setup:
    - fly apps destroy bench-flags-app -y 2>/dev/null || true
    - echo '{"name":"bench-flags-app","scripts":{"start":"node server.js"}}' >
      package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(5000)' >
      server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-node-app
  intent: "Tool call result: `ls` returned package.json, server.js. `cat
    package.json` shows express dependency, start script is 'node server.js'.
    The user asked to deploy their API. Use `fly launch` to configure this
    Node.js project for Fly.io as 'bench-node-app'. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-node-app
  setup:
    - fly apps destroy bench-node-app -y 2>/dev/null || true
    - echo '{"name":"bench-node-app","scripts":{"start":"node
      server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
    - echo 'const http = require("http"); http.createServer((req,res) => {
      res.end("ok"); }).listen(8080);' > server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-static-site
  intent: "Tool call result: `ls` returned index.html (no package.json, no
    Dockerfile). This appears to be a static HTML site. The user wants it hosted
    on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy
    yet."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-static-site
  setup:
    - fly apps destroy bench-static-site -y 2>/dev/null || true
    - echo '<!DOCTYPE html><html><body><h1>Hello Fly</h1></body></html>' >
      index.html
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-dockerfile-app
  intent: "Tool call result: `ls` returned Dockerfile, server.js. The project
    already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get
    this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip
    deployment."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-dockerfile-app
  setup:
    - fly apps destroy bench-dockerfile-app -y 2>/dev/null || true
    - |
      cat > Dockerfile << 'DEOF'
      FROM node:20-slim
      WORKDIR /app
      COPY . .
      EXPOSE 3000
      CMD ["node", "server.js"]
      DEOF
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
      server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-go-app
  intent: "Tool call result: `ls` returned main.go, go.mod. `head main.go` shows
    net/http ListenAndServe on :8080. The user wants this Go service on Fly.io
    as 'bench-go-app'. Set up config only, no deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-go-app
  setup:
    - fly apps destroy bench-go-app -y 2>/dev/null || true
    - |
      cat > main.go << 'GOEOF'
      package main

      import (
          "fmt"
          "net/http"
      )

      func main() {
          http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
              fmt.Fprintf(w, "Hello from Go")
          })
          http.ListenAndServe(":8080", nil)
      }
      GOEOF
    - go mod init bench-go-app
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: doctor-check
  intent: Something feels off with my Fly setup. Can you run the doctor command
    and tell me if anything looks wrong?
  assert:
    - ran: fly doctor
  setup: []
  max_turns: 5
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: status-no-app
  intent: I'm trying to check on my app but there's no fly.toml in this directory.
    Run fly status anyway and tell me what the error says.
  assert:
    - ran: fly status
  setup: []
  max_turns: 5
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_status.md#Usage
- id: launch-empty-dir
  intent: "Tool call result: `ls -la` returned only '.' and '..'. The directory is
    empty but the user asked to deploy this project to Fly.io as
    'bench-empty-app'. Attempt to configure it, handle any scanner failures and
    ensure a fly.toml is produced. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
  setup:
    - fly apps destroy bench-empty-app -y 2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-name-conflict
  intent: "Tool call result: previous `fly launch --name fly-builder-cliwatch`
    failed with 'name already taken'. The user wants their Node.js app on Fly.
    Use `fly launch` with the name 'bench-conflict-retry' instead. Do not
    deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-conflict-retry
  setup:
    - fly apps destroy bench-conflict-retry -y 2>/dev/null || true
    - echo '{"name":"app","scripts":{"start":"node server.js"}}' > package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
      server.js
  max_turns: 8
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-and-inspect
  intent: "Tool call result: `ls` returned package.json, server.js. The user asked
    'set up my app on Fly and tell me what it picked'. Configure this Node.js
    project as 'bench-inspect-app' on Fly.io without deploying, then read the
    generated fly.toml and report what framework was detected, what port was
    set, and the auto_stop setting."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - ran: cat fly.toml|less fly.toml
  setup:
    - fly apps destroy bench-inspect-app -y 2>/dev/null || true
    - echo '{"name":"bench-inspect-app","scripts":{"start":"node
      server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
      server.js
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-then-customize
  intent: "Tool call result: `ls` returned package.json, server.js. The user wants
    their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to
    generate the initial config, then customize fly.toml: change internal_port
    to 3000, add NODE_ENV=production as an env var, and set min_machines_running
    to 2. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: "3000"
    - file_contains:
        path: fly.toml
        text: NODE_ENV
    - file_contains:
        path: fly.toml
        text: min_machines_running
  setup:
    - fly apps destroy bench-modify-app -y 2>/dev/null || true
    - echo '{"name":"bench-modify-app","scripts":{"start":"node server.js"}}' >
      package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
      server.js
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-flask-with-healthcheck
  intent: "Tool call result: `ls` returned app.py, requirements.txt. `grep route
    app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows
    flask==3.0.0. The user wants this deployed with health checks. Configure for
    Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not
    deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-flask-app
    - file_contains:
        path: fly.toml
        text: /health
  setup:
    - fly apps destroy bench-flask-app -y 2>/dev/null || true
    - |
      cat > app.py << 'PYEOF'
      from flask import Flask
      app = Flask(__name__)

      @app.route("/")
      def hello():
          return "Hello from Flask"

      @app.route("/health")
      def health():
          return "ok"

      if __name__ == "__main__":
          app.run(host="0.0.0.0", port=8080)
      PYEOF
    - echo 'flask==3.0.0' > requirements.txt
  max_turns: 12
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_checks.md#Usage
- id: validate-config
  intent: "Tool call result: `ls` returned fly.toml. The user wants to verify
    their Fly config is valid before deploying. Use `fly config validate` to
    check the fly.toml, then use `fly config show` to display the resolved
    configuration. Report any warnings or errors."
  assert:
    - ran: fly config validate|fly config show
  setup:
    - fly apps destroy bench-validate-app -y 2>/dev/null || true
    - |
      cat > fly.toml << 'TOMLEOF'
      app = 'bench-validate-app'
      primary_region = 'iad'

      [build]
        [build.args]
          NODE_ENV = 'production'

      [http_service]
        internal_port = 8080
        force_https = true
        auto_stop_machines = 'stop'
        auto_start_machines = true
        min_machines_running = 0

      [[vm]]
        memory = '256mb'
        cpu_kind = 'shared'
        cpus = 1
      TOMLEOF
    - fly apps create bench-validate-app --org cliwatch-benchmarks 2>/dev/null
      || true
  max_turns: 8
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_config_validate.md#Usage

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Fly.io — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws