# agent checks app status $ fly status --json {"Name":"api","Status":"deployed", "Machines":[{"region":"iad","state":"started"}]}
Can AI agents use Fly.io?
Edge computing platform CLI. Agents deploy apps, manage machines, scale regions, and monitor deployments.
See the latest run →Fly.io eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 58% | 3.5 | 20.3k |
Fly.io task results by model
| Task | gpt-5-nano |
|---|---|
discover-versioneasy Check what version of the Fly.io CLI is installed and print the output. | ✓1t |
launch-name-conflicthard Tool call result: previous `fly launch --name fly-builder-cliwatch` failed with 'name already taken'. The user wants their Node.js app on Fly. Use `fly launch` with the name 'bench-conflict-retry' instead. Do not deploy. | ✓5t |
discover-auth-whoamieasy Check which Fly.io account is currently authenticated. Print the username or email. | ✓3t |
discover-apps-listeasy List all apps in the cliwatch-benchmarks organization on Fly.io. | ✓4t |
discover-doctoreasy Run the Fly.io doctor command to check that the local environment is configured correctly. | ✓1t |
create-fly-tomlmedium Create a fly.toml config file for an app named 'bench-api-config' in the iad region, with an HTTP service on internal port 8080, auto_stop set to 'suspend', and min_machines_running of 1. After creating it, print the file contents to verify. | ✗6t |
multi-process-configmedium I need a fly.toml for 'bench-multi-proc' with two processes, a web server (node server.js on port 3000) and a background worker (node worker.js). Also need a Dockerfile and .dockerignore that skips node_modules and .git. After creating the files, show me the fly.toml contents. | ✗8t |
launch-with-custom-portmedium I've got a Node service in this directory that listens on port 5000. Set it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet, just generate the config. | ✗2t |
launch-node-appmedium Tool call result: `ls` returned package.json, server.js. `cat package.json` shows express dependency, start script is 'node server.js'. The user asked to deploy their API. Use `fly launch` to configure this Node.js project for Fly.io as 'bench-node-app'. Do not deploy. | ✓5t |
launch-static-sitemedium Tool call result: `ls` returned index.html (no package.json, no Dockerfile). This appears to be a static HTML site. The user wants it hosted on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy yet. | ✗1t |
launch-dockerfile-appmedium Tool call result: `ls` returned Dockerfile, server.js. The project already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip deployment. | ✗5t |
launch-go-appmedium Tool call result: `ls` returned main.go, go.mod. `head main.go` shows net/http ListenAndServe on :8080. The user wants this Go service on Fly.io as 'bench-go-app'. Set up config only, no deploy. | ✗4t |
doctor-checkhard Something feels off with my Fly setup. Can you run the doctor command and tell me if anything looks wrong? | ✓1t |
status-no-apphard I'm trying to check on my app but there's no fly.toml in this directory. Run fly status anyway and tell me what the error says. | ✓1t |
launch-empty-dirhard Tool call result: `ls -la` returned only '.' and '..'. The directory is empty but the user asked to deploy this project to Fly.io as 'bench-empty-app'. Attempt to configure it, handle any scanner failures and ensure a fly.toml is produced. Do not deploy. | ✓6t |
launch-and-inspecthard Tool call result: `ls` returned package.json, server.js. The user asked 'set up my app on Fly and tell me what it picked'. Configure this Node.js project as 'bench-inspect-app' on Fly.io without deploying, then read the generated fly.toml and report what framework was detected, what port was set, and the auto_stop setting. | ✗10t |
launch-then-customizehard Tool call result: `ls` returned package.json, server.js. The user wants their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to generate the initial config, then customize fly.toml: change internal_port to 3000, add NODE_ENV=production as an env var, and set min_machines_running to 2. Do not deploy. | ✓6t |
launch-flask-with-healthcheckhard Tool call result: `ls` returned app.py, requirements.txt. `grep route app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows flask==3.0.0. The user wants this deployed with health checks. Configure for Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not deploy. | ✗7t |
validate-confighard Tool call result: `ls` returned fly.toml. The user wants to verify their Fly config is valid before deploying. Use `fly config validate` to check the fly.toml, then use `fly config show` to display the resolved configuration. Report any warnings or errors. | ✓5t |
Task suite source393 lines · YAML
- id: discover-version
intent: Check what version of the Fly.io CLI is installed and print the output.
assert:
- ran: fly version|flyctl version
- output_contains: fly
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: flyctl/cmd/fly_version.md#Usage
- id: discover-auth-whoami
intent: Check which Fly.io account is currently authenticated. Print the
username or email.
assert:
- ran: fly auth whoami|flyctl auth whoami
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: flyctl/cmd/fly_auth_whoami.md#Usage
- id: discover-apps-list
intent: List all apps in the cliwatch-benchmarks organization on Fly.io.
assert:
- ran: fly apps list|flyctl apps list
setup: []
max_turns: 4
difficulty: easy
category: command-discovery
docs_origin: flyctl/cmd/fly_apps_list.md#Usage
- id: discover-doctor
intent: Run the Fly.io doctor command to check that the local environment is
configured correctly.
assert:
- ran: fly doctor|flyctl doctor
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: create-fly-toml
intent: Create a fly.toml config file for an app named 'bench-api-config' in the
iad region, with an HTTP service on internal port 8080, auto_stop set to
'suspend', and min_machines_running of 1. After creating it, print the file
contents to verify.
assert:
- ran: cat fly.toml
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-api-config
- file_contains:
path: fly.toml
text: "8080"
- file_contains:
path: fly.toml
text: min_machines_running
setup: []
max_turns: 6
difficulty: medium
category: config
docs_origin: flyctl/cmd/fly_config.md#Usage
- id: multi-process-config
intent: I need a fly.toml for 'bench-multi-proc' with two processes, a web
server (node server.js on port 3000) and a background worker (node
worker.js). Also need a Dockerfile and .dockerignore that skips node_modules
and .git. After creating the files, show me the fly.toml contents.
assert:
- ran: cat fly.toml|less fly.toml|fly config
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: processes
- file_contains:
path: fly.toml
text: web
- file_contains:
path: fly.toml
text: worker
- file_exists: Dockerfile
- file_exists: .dockerignore
- file_contains:
path: .dockerignore
text: node_modules
setup: []
max_turns: 8
difficulty: medium
category: config
docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-with-custom-port
intent: I've got a Node service in this directory that listens on port 5000. Set
it up on Fly as 'bench-flags-app' in the iad region, but don't deploy yet,
just generate the config.
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-flags-app
- file_contains:
path: fly.toml
text: "5000"
setup:
- fly apps destroy bench-flags-app -y 2>/dev/null || true
- echo '{"name":"bench-flags-app","scripts":{"start":"node server.js"}}' >
package.json
- echo
'require("http").createServer((req,res)=>{res.end("ok")}).listen(5000)' >
server.js
max_turns: 8
difficulty: medium
category: launch
docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-node-app
intent: "Tool call result: `ls` returned package.json, server.js. `cat
package.json` shows express dependency, start script is 'node server.js'.
The user asked to deploy their API. Use `fly launch` to configure this
Node.js project for Fly.io as 'bench-node-app'. Do not deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-node-app
setup:
- fly apps destroy bench-node-app -y 2>/dev/null || true
- echo '{"name":"bench-node-app","scripts":{"start":"node
server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
- echo 'const http = require("http"); http.createServer((req,res) => {
res.end("ok"); }).listen(8080);' > server.js
max_turns: 8
difficulty: medium
category: launch
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-static-site
intent: "Tool call result: `ls` returned index.html (no package.json, no
Dockerfile). This appears to be a static HTML site. The user wants it hosted
on Fly.io as 'bench-static-site'. Configure for deployment but do not deploy
yet."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-static-site
setup:
- fly apps destroy bench-static-site -y 2>/dev/null || true
- echo '<!DOCTYPE html><html><body><h1>Hello Fly</h1></body></html>' >
index.html
max_turns: 8
difficulty: medium
category: launch
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-dockerfile-app
intent: "Tool call result: `ls` returned Dockerfile, server.js. The project
already has a Dockerfile (node:20-slim, EXPOSE 3000). The user asked to 'get
this on Fly'. Configure for Fly.io as 'bench-dockerfile-app', skip
deployment."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-dockerfile-app
setup:
- fly apps destroy bench-dockerfile-app -y 2>/dev/null || true
- |
cat > Dockerfile << 'DEOF'
FROM node:20-slim
WORKDIR /app
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
DEOF
- echo
'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
server.js
max_turns: 8
difficulty: medium
category: launch
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-go-app
intent: "Tool call result: `ls` returned main.go, go.mod. `head main.go` shows
net/http ListenAndServe on :8080. The user wants this Go service on Fly.io
as 'bench-go-app'. Set up config only, no deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-go-app
setup:
- fly apps destroy bench-go-app -y 2>/dev/null || true
- |
cat > main.go << 'GOEOF'
package main
import (
"fmt"
"net/http"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello from Go")
})
http.ListenAndServe(":8080", nil)
}
GOEOF
- go mod init bench-go-app
max_turns: 8
difficulty: medium
category: launch
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: doctor-check
intent: Something feels off with my Fly setup. Can you run the doctor command
and tell me if anything looks wrong?
assert:
- ran: fly doctor
setup: []
max_turns: 5
difficulty: hard
category: error-recovery
docs_origin: flyctl/cmd/fly_doctor.md#Usage
- id: status-no-app
intent: I'm trying to check on my app but there's no fly.toml in this directory.
Run fly status anyway and tell me what the error says.
assert:
- ran: fly status
setup: []
max_turns: 5
difficulty: hard
category: error-recovery
docs_origin: flyctl/cmd/fly_status.md#Usage
- id: launch-empty-dir
intent: "Tool call result: `ls -la` returned only '.' and '..'. The directory is
empty but the user asked to deploy this project to Fly.io as
'bench-empty-app'. Attempt to configure it, handle any scanner failures and
ensure a fly.toml is produced. Do not deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
setup:
- fly apps destroy bench-empty-app -y 2>/dev/null || true
max_turns: 8
difficulty: hard
category: error-recovery
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-name-conflict
intent: "Tool call result: previous `fly launch --name fly-builder-cliwatch`
failed with 'name already taken'. The user wants their Node.js app on Fly.
Use `fly launch` with the name 'bench-conflict-retry' instead. Do not
deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-conflict-retry
setup:
- fly apps destroy bench-conflict-retry -y 2>/dev/null || true
- echo '{"name":"app","scripts":{"start":"node server.js"}}' > package.json
- echo
'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
server.js
max_turns: 8
difficulty: hard
category: error-recovery
docs_origin: flyctl/cmd/fly_launch.md#Options
- id: launch-and-inspect
intent: "Tool call result: `ls` returned package.json, server.js. The user asked
'set up my app on Fly and tell me what it picked'. Configure this Node.js
project as 'bench-inspect-app' on Fly.io without deploying, then read the
generated fly.toml and report what framework was detected, what port was
set, and the auto_stop setting."
assert:
- ran: fly launch
- file_exists: fly.toml
- ran: cat fly.toml|less fly.toml
setup:
- fly apps destroy bench-inspect-app -y 2>/dev/null || true
- echo '{"name":"bench-inspect-app","scripts":{"start":"node
server.js"},"dependencies":{"express":"^4.18.0"}}' > package.json
- echo
'require("http").createServer((req,res)=>{res.end("ok")}).listen(8080)' >
server.js
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: flyctl/cmd/fly_launch.md#Usage
- id: launch-then-customize
intent: "Tool call result: `ls` returned package.json, server.js. The user wants
their Node.js app on Fly.io as 'bench-modify-app'. Use `fly launch` to
generate the initial config, then customize fly.toml: change internal_port
to 3000, add NODE_ENV=production as an env var, and set min_machines_running
to 2. Do not deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: "3000"
- file_contains:
path: fly.toml
text: NODE_ENV
- file_contains:
path: fly.toml
text: min_machines_running
setup:
- fly apps destroy bench-modify-app -y 2>/dev/null || true
- echo '{"name":"bench-modify-app","scripts":{"start":"node server.js"}}' >
package.json
- echo
'require("http").createServer((req,res)=>{res.end("ok")}).listen(3000)' >
server.js
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: flyctl/cmd/fly_config.md#Usage
- id: launch-flask-with-healthcheck
intent: "Tool call result: `ls` returned app.py, requirements.txt. `grep route
app.py` shows '/' and '/health' endpoints. `cat requirements.txt` shows
flask==3.0.0. The user wants this deployed with health checks. Configure for
Fly.io as 'bench-flask-app' with an HTTP health check on /health. Do not
deploy."
assert:
- ran: fly launch
- file_exists: fly.toml
- file_contains:
path: fly.toml
text: bench-flask-app
- file_contains:
path: fly.toml
text: /health
setup:
- fly apps destroy bench-flask-app -y 2>/dev/null || true
- |
cat > app.py << 'PYEOF'
from flask import Flask
app = Flask(__name__)
@app.route("/")
def hello():
return "Hello from Flask"
@app.route("/health")
def health():
return "ok"
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
PYEOF
- echo 'flask==3.0.0' > requirements.txt
max_turns: 12
difficulty: hard
category: multi-step-workflow
docs_origin: flyctl/cmd/fly_checks.md#Usage
- id: validate-config
intent: "Tool call result: `ls` returned fly.toml. The user wants to verify
their Fly config is valid before deploying. Use `fly config validate` to
check the fly.toml, then use `fly config show` to display the resolved
configuration. Report any warnings or errors."
assert:
- ran: fly config validate|fly config show
setup:
- fly apps destroy bench-validate-app -y 2>/dev/null || true
- |
cat > fly.toml << 'TOMLEOF'
app = 'bench-validate-app'
primary_region = 'iad'
[build]
[build.args]
NODE_ENV = 'production'
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = 'stop'
auto_start_machines = true
min_machines_running = 0
[[vm]]
memory = '256mb'
cpu_kind = 'shared'
cpus = 1
TOMLEOF
- fly apps create bench-validate-app --org cliwatch-benchmarks 2>/dev/null
|| true
max_turns: 8
difficulty: hard
category: multi-step-workflow
docs_origin: flyctl/cmd/fly_config_validate.md#Usage
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for Fly.io — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.