# agent checks pod health $ kubectl get pods -n production -o json {"items": [{"metadata": {"name": "api-7d4.."} "status": {"phase": "Running"}}]} $ kubectl logs api-7d4.. -n production --tail=20 2026-02-08 INFO Server started on :8080
Can AI agents use kubectl?
The Kubernetes command-line tool. Used by agents to manage clusters, inspect workloads, debug pods, and apply manifests.
See the latest run →kubectl eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 50% | 3.0 | 7.3k |
kubectl task results by model
| Task | gpt-5-nano |
|---|---|
create-pod-yamleasy Generate a Pod manifest YAML file called pod.yaml for a pod named 'web' running the nginx:alpine image on port 80, using kubectl create with --dry-run=client and -o yaml. | ✗4t |
create-deployment-yamleasy Generate a Deployment manifest for a deployment named 'api' with 3 replicas running the node:20-alpine image. Save it to deployment.yaml using --dry-run=client -o yaml. | ✓2t |
explain-resourceeasy Use kubectl explain to show the documentation for a Pod's spec.containers field. | ✗2t |
kustomize-buildhard Create a kustomization.yaml file that includes a resource file called deployment.yaml. Then use kubectl kustomize to build and output the result. | ✓4t |
Task suite source56 lines · YAML
- id: create-pod-yaml
intent: Generate a Pod manifest YAML file called pod.yaml for a pod named 'web'
running the nginx:alpine image on port 80, using kubectl create with
--dry-run=client and -o yaml.
assert:
- file_exists: pod.yaml
- file_contains:
path: pod.yaml
text: nginx
- file_contains:
path: pod.yaml
text: web
- ran: kubectl.*--dry-run
setup: []
max_turns: 4
difficulty: easy
category: generate
- id: create-deployment-yaml
intent: Generate a Deployment manifest for a deployment named 'api' with 3
replicas running the node:20-alpine image. Save it to deployment.yaml using
--dry-run=client -o yaml.
assert:
- file_exists: deployment.yaml
- file_contains:
path: deployment.yaml
text: replicas
- file_contains:
path: deployment.yaml
text: api
setup: []
max_turns: 4
difficulty: easy
category: generate
- id: explain-resource
intent: Use kubectl explain to show the documentation for a Pod's
spec.containers field.
assert:
- ran: kubectl explain
- output_contains: containers
setup: []
max_turns: 3
difficulty: easy
category: query
- id: kustomize-build
intent: Create a kustomization.yaml file that includes a resource file called
deployment.yaml. Then use kubectl kustomize to build and output the result.
assert:
- file_exists: kustomization.yaml
- ran: kubectl kustomize|kustomize build
setup:
- kubectl create deployment api --image=node:20-alpine --replicas=2
--dry-run=client -o yaml > deployment.yaml
max_turns: 8
difficulty: hard
category: workflow
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for kubectl — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.