# agent checks app status
$ fly status --json
  {"Name":"api","Status":"deployed",
   "Machines":[{"region":"iad","state":"started"}]}

Can AI agents use Fly.io?

Edge computing platform CLI. Agents deploy apps, manage machines, scale regions, and monitor deployments.

See the latest run →
58% overall pass rate1 model tested19 tasksv0.4.193/6/2026

Fly.io eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano58%3.520.3k

Fly.io task results by model

Taskgpt-5-nano
discover-versioneasy
Check what version of the Fly.io CLI is installed and print the output.
1t
launch-name-conflicthard
Tool call result: previous `fly launch --name fly-builder-cliwatch` failed with 'name already taken'. The user wants their Node.js app on Fly. Use `fly launch` with the name 'bench-conflict-retry' instead. Do not deploy.
5t
discover-auth-whoamieasy
Check which Fly.io account is currently authenticated. Print the username or email.
3t
discover-apps-listeasy
List all apps in the cliwatch-benchmarks organization on Fly.io.
4t
discover-doctoreasy
Run the Fly.io doctor command to check that the local environment is configured correctly.
1t
create-fly-tomlmedium
Create a fly.toml config file for an app named 'bench-api-config' in the iad region, with an HTTP service on internal port 8080, auto_stop set to 'suspend', and min_machines_running of 1. After creating it, print the file contents to verify.
6t
multi-process-configmedium
I need a fly.toml for 'bench-multi-proc' with two processes, a web server (node server.js on port 3000) and a background worker (node worker.js). Also need a Dockerfile and .dockerignore that skips node_modules and .git. After creating the files, show me the fly.toml contents.
8t
launch-with-custom-portmedium
I've got a Node service in this directory that listens on port 5000. Set it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet, just generate the config.
2t
launch-node-appmedium
Tool call result: `ls` returned package.json, server.js. `cat package.json` shows express dependency, start script is 'node server.js'. The user asked to deploy their API. Use `fly launch` to configure this Node.js project for Fly.io as 'bench-node-app'. Do not deploy.
5t
launch-static-sitemedium
Tool call result: `ls` returned index.html (no package.json, no Dockerfile). This appears to be a static HTML site. The user wants it hosted on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy yet.
1t
launch-dockerfile-appmedium
Tool call result: `ls` returned Dockerfile, server.js. The project already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip deployment.
5t
launch-go-appmedium
Tool call result: `ls` returned main.go, go.mod. `head main.go` shows net/http ListenAndServe on :8080. The user wants this Go service on Fly.io as 'bench-go-app'. Set up config only, no deploy.
4t
doctor-checkhard
Something feels off with my Fly setup. Can you run the doctor command and tell me if anything looks wrong?
1t
status-no-apphard
I'm trying to check on my app but there's no fly.toml in this directory. Run fly status anyway and tell me what the error says.
1t
launch-empty-dirhard
Tool call result: `ls -la` returned only '.' and '..'. The directory is empty but the user asked to deploy this project to Fly.io as 'bench-empty-app'. Attempt to configure it, handle any scanner failures and ensure a fly.toml is produced. Do not deploy.
6t
launch-and-inspecthard
Tool call result: `ls` returned package.json, server.js. The user asked 'set up my app on Fly and tell me what it picked'. Configure this Node.js project as 'bench-inspect-app' on Fly.io without deploying, then read the generated fly.toml and report what framework was detected, what port was set, and the auto_stop setting.
10t
launch-then-customizehard
Tool call result: `ls` returned package.json, server.js. The user wants their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to generate the initial config, then customize fly.toml: change internal_port to 3000, add NODE_ENV=production as an env var, and set min_machines_running to 2. Do not deploy.
6t
launch-flask-with-healthcheckhard
Tool call result: `ls` returned app.py, requirements.txt. `grep route app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows flask==3.0.0. The user wants this deployed with health checks. Configure for Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not deploy.
7t
validate-confighard
Tool call result: `ls` returned fly.toml. The user wants to verify their Fly config is valid before deploying. Use `fly config validate` to check the fly.toml, then use `fly config show` to display the resolved configuration. Report any warnings or errors.
5t
Task suite source393 lines · YAML
- id: discover-version
  intent: Check what version of the Fly.io CLI is installed and print the output.
  assert:
    - ran: fly version|flyctl version
    - output_contains: fly
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_version.md#Usage
- id: discover-auth-whoami
  intent: Check which Fly.io account is currently authenticated. Print the
    username or email.
  assert:
    - ran: fly auth whoami|flyctl auth whoami
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_auth_whoami.md#Usage
- id: discover-apps-list
  intent: List all apps in the cliwatch-benchmarks organization on Fly.io.
  assert:
    - ran: fly apps list|flyctl apps list
  setup: []
  max_turns: 4
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_apps_list.md#Usage
- id: discover-doctor
  intent: Run the Fly.io doctor command to check that the local environment is
    configured correctly.
  assert:
    - ran: fly doctor|flyctl doctor
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: create-fly-toml
  intent: Create a fly.toml config file for an app named 'bench-api-config' in the
    iad region, with an HTTP service on internal port 8080, auto_stop set to
    'suspend', and min_machines_running of 1. After creating it, print the file
    contents to verify.
  assert:
    - ran: cat fly.toml
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-api-config
    - file_contains:
        path: fly.toml
        text: "8080"
    - file_contains:
        path: fly.toml
        text: min_machines_running
  setup: []
  max_turns: 6
  difficulty: medium
  category: config
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: multi-process-config
  intent: I need a fly.toml for 'bench-multi-proc' with two processes, a web
    server (node server.js on port 3000) and a background worker (node
    worker.js). Also need a Dockerfile and .dockerignore that skips node_modules
    and .git. After creating the files, show me the fly.toml contents.
  assert:
    - ran: cat fly.toml|less fly.toml|fly config
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: processes
    - file_contains:
        path: fly.toml
        text: web
    - file_contains:
        path: fly.toml
        text: worker
    - file_exists: Dockerfile
    - file_exists: .dockerignore
    - file_contains:
        path: .dockerignore
        text: node_modules
  setup: []
  max_turns: 8
  difficulty: medium
  category: config
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-with-custom-port
  intent: I've got a Node service in this directory that listens on port 5000. Set
    it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet,
    just generate the config.
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-flags-app
    - file_contains:
        path: fly.toml
        text: "5000"
  setup:
    - fly apps destroy bench-flags-app -y 2>/dev/null || true
    - echo '{"name":"bench-flags-app","scripts":{"start":"node server.js"}}' >
      package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(5000)' >
      server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-node-app
  intent: "Tool call result: `ls` returned package.json, server.js. `cat
    package.json` shows express dependency, start script is 'node server.js'.
    The user asked to deploy their API. Use `fly launch` to configure this
    Node.js project for Fly.io as 'bench-node-app'. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-node-app
  setup:
    - fly apps destroy bench-node-app -y 2>/dev/null || true
    - echo '{"name":"bench-node-app","scripts":{"start":"node
      server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
    - echo 'const http = require("http"); http.createServer((req,res) => {
      res.end("ok"); }).listen(8080);' > server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-static-site
  intent: "Tool call result: `ls` returned index.html (no package.json, no
    Dockerfile). This appears to be a static HTML site. The user wants it hosted
    on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy
    yet."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-static-site
  setup:
    - fly apps destroy bench-static-site -y 2>/dev/null || true
    - echo '<!DOCTYPE html><html><body><h1>Hello Fly</h1></body></html>' >
      index.html
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-dockerfile-app
  intent: "Tool call result: `ls` returned Dockerfile, server.js. The project
    already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get
    this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip
    deployment."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-dockerfile-app
  setup:
    - fly apps destroy bench-dockerfile-app -y 2>/dev/null || true
    - |
      cat > Dockerfile << 'DEOF'
      FROM node:20-slim
      WORKDIR /app
      COPY . .
      EXPOSE 3000
      CMD ["node", "server.js"]
      DEOF
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
      server.js
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-go-app
  intent: "Tool call result: `ls` returned main.go, go.mod. `head main.go` shows
    net/http ListenAndServe on :8080. The user wants this Go service on Fly.io
    as 'bench-go-app'. Set up config only, no deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-go-app
  setup:
    - fly apps destroy bench-go-app -y 2>/dev/null || true
    - |
      cat > main.go << 'GOEOF'
      package main

      import (
          "fmt"
          "net/http"
      )

      func main() {
          http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
              fmt.Fprintf(w, "Hello from Go")
          })
          http.ListenAndServe(":8080", nil)
      }
      GOEOF
    - go mod init bench-go-app
  max_turns: 8
  difficulty: medium
  category: launch
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: doctor-check
  intent: Something feels off with my Fly setup. Can you run the doctor command
    and tell me if anything looks wrong?
  assert:
    - ran: fly doctor
  setup: []
  max_turns: 5
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: status-no-app
  intent: I'm trying to check on my app but there's no fly.toml in this directory.
    Run fly status anyway and tell me what the error says.
  assert:
    - ran: fly status
  setup: []
  max_turns: 5
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_status.md#Usage
- id: launch-empty-dir
  intent: "Tool call result: `ls -la` returned only '.' and '..'. The directory is
    empty but the user asked to deploy this project to Fly.io as
    'bench-empty-app'. Attempt to configure it, handle any scanner failures and
    ensure a fly.toml is produced. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
  setup:
    - fly apps destroy bench-empty-app -y 2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-name-conflict
  intent: "Tool call result: previous `fly launch --name fly-builder-cliwatch`
    failed with 'name already taken'. The user wants their Node.js app on Fly.
    Use `fly launch` with the name 'bench-conflict-retry' instead. Do not
    deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-conflict-retry
  setup:
    - fly apps destroy bench-conflict-retry -y 2>/dev/null || true
    - echo '{"name":"app","scripts":{"start":"node server.js"}}' > package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
      server.js
  max_turns: 8
  difficulty: hard
  category: error-recovery
  docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-and-inspect
  intent: "Tool call result: `ls` returned package.json, server.js. The user asked
    'set up my app on Fly and tell me what it picked'. Configure this Node.js
    project as 'bench-inspect-app' on Fly.io without deploying, then read the
    generated fly.toml and report what framework was detected, what port was
    set, and the auto_stop setting."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - ran: cat fly.toml|less fly.toml
  setup:
    - fly apps destroy bench-inspect-app -y 2>/dev/null || true
    - echo '{"name":"bench-inspect-app","scripts":{"start":"node
      server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
      server.js
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-then-customize
  intent: "Tool call result: `ls` returned package.json, server.js. The user wants
    their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to
    generate the initial config, then customize fly.toml: change internal_port
    to 3000, add NODE_ENV=production as an env var, and set min_machines_running
    to 2. Do not deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: "3000"
    - file_contains:
        path: fly.toml
        text: NODE_ENV
    - file_contains:
        path: fly.toml
        text: min_machines_running
  setup:
    - fly apps destroy bench-modify-app -y 2>/dev/null || true
    - echo '{"name":"bench-modify-app","scripts":{"start":"node server.js"}}' >
      package.json
    - echo
      'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
      server.js
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-flask-with-healthcheck
  intent: "Tool call result: `ls` returned app.py, requirements.txt. `grep route
    app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows
    flask==3.0.0. The user wants this deployed with health checks. Configure for
    Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not
    deploy."
  assert:
    - ran: fly launch
    - file_exists: fly.toml
    - file_contains:
        path: fly.toml
        text: bench-flask-app
    - file_contains:
        path: fly.toml
        text: /health
  setup:
    - fly apps destroy bench-flask-app -y 2>/dev/null || true
    - |
      cat > app.py << 'PYEOF'
      from flask import Flask
      app = Flask(__name__)

      @app.route("/")
      def hello():
          return "Hello from Flask"

      @app.route("/health")
      def health():
          return "ok"

      if __name__ == "__main__":
          app.run(host="0.0.0.0", port=8080)
      PYEOF
    - echo 'flask==3.0.0' > requirements.txt
  max_turns: 12
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_checks.md#Usage
- id: validate-config
  intent: "Tool call result: `ls` returned fly.toml. The user wants to verify
    their Fly config is valid before deploying. Use `fly config validate` to
    check the fly.toml, then use `fly config show` to display the resolved
    configuration. Report any warnings or errors."
  assert:
    - ran: fly config validate|fly config show
  setup:
    - fly apps destroy bench-validate-app -y 2>/dev/null || true
    - |
      cat > fly.toml << 'TOMLEOF'
      app = 'bench-validate-app'
      primary_region = 'iad'

      [build]
        [build.args]
          NODE_ENV = 'production'

      [http_service]
        internal_port = 8080
        force_https = true
        auto_stop_machines = 'stop'
        auto_start_machines = true
        min_machines_running = 0

      [[vm]]
        memory = '256mb'
        cpu_kind = 'shared'
        cpus = 1
      TOMLEOF
    - fly apps create bench-validate-app --org cliwatch-benchmarks 2>/dev/null
      || true
  max_turns: 8
  difficulty: hard
  category: multi-step-workflow
  docs_origin: flyctl/cmd/fly_config_validate.md#Usage

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Fly.io see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals