# agent queries a database $ psql -c "SELECT name, score FROM users ORDER BY score DESC LIMIT 3" name | score -------+------- alice | 95 bob | 87
Can AI agents use psql?
The PostgreSQL interactive terminal. Agents use it to run queries, manage schemas, import/export data, and administer databases.
See the latest run →psql eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 50% | 5.5 | 12.1k |
psql task results by model
| Task | gpt-5-nano |
|---|---|
insert-and-querymedium Using psql, create a table called 'products' with columns id (serial), name (text), and price (numeric). Insert three products: 'Widget' at 9.99, 'Gadget' at 24.50, and 'Gizmo' at 14.75. Then query all products ordered by price descending and save the output to query-result.txt. | ✓6t |
export-csvmedium Using psql, create a table 'scores' with columns name (text) and points (int). Insert 5 rows of sample data. Export the table to a CSV file called scores.csv using the COPY command or \copy. | ✓5t |
create-tablemedium Using psql, create a table called 'users' with columns: id (serial primary key), name (varchar 100, not null), email (varchar 255, unique), and created_at (timestamp with default now()). Then verify the table exists by listing the table structure. | ✗2t |
create-sql-scripteasy Create a SQL script file called 'setup.sql' that creates a 'tasks' table (id serial, title text not null, done boolean default false) and inserts 3 sample tasks. Then execute the script using psql. | ✗5t |
Task suite source55 lines · YAML
- id: create-table
intent: "Using psql, create a table called 'users' with columns: id (serial
primary key), name (varchar 100, not null), email (varchar 255, unique), and
created_at (timestamp with default now()). Then verify the table exists by
listing the table structure."
assert:
- ran: psql.*CREATE TABLE|psql.*create table
- ran: psql.*\\d|psql.*\\dt
setup:
- sudo -u postgres createdb bench_test 2>/dev/null || true
max_turns: 5
difficulty: medium
category: schema
- id: insert-and-query
intent: "Using psql, create a table called 'products' with columns id (serial),
name (text), and price (numeric). Insert three products: 'Widget' at 9.99,
'Gadget' at 24.50, and 'Gizmo' at 14.75. Then query all products ordered by
price descending and save the output to query-result.txt."
assert:
- ran: psql.*INSERT|psql.*insert
- ran: psql.*SELECT|psql.*select
- file_exists: query-result.txt
setup:
- sudo -u postgres createdb bench_test2 2>/dev/null || true
max_turns: 6
difficulty: medium
category: query
- id: export-csv
intent: Using psql, create a table 'scores' with columns name (text) and points
(int). Insert 5 rows of sample data. Export the table to a CSV file called
scores.csv using the COPY command or \copy.
assert:
- ran: psql
- file_exists: scores.csv
setup:
- sudo -u postgres createdb bench_test3 2>/dev/null || true
max_turns: 6
difficulty: medium
category: export
- id: create-sql-script
intent: Create a SQL script file called 'setup.sql' that creates a 'tasks' table
(id serial, title text not null, done boolean default false) and inserts 3
sample tasks. Then execute the script using psql.
assert:
- file_exists: setup.sql
- file_contains:
path: setup.sql
text: CREATE TABLE
- ran: psql.*setup.sql|psql.*-f
setup:
- sudo -u postgres createdb bench_test4 2>/dev/null || true
max_turns: 5
difficulty: easy
category: workflow
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for psql — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.