# agent checks pod health
$ kubectl get pods -n production -o json
  {"items": [{"metadata": {"name": "api-7d4.."}
   "status": {"phase": "Running"}}]}
 
$ kubectl logs api-7d4.. -n production --tail=20
  2026-02-08 INFO  Server started on :8080

Can AI agents use kubectl?

The Kubernetes command-line tool. Used by agents to manage clusters, inspect workloads, debug pods, and apply manifests.

Docs →GitHub →

See the latest run →

50% overall pass rate1 model tested4 tasksv1.35.23/6/2026

kubectl eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	50%	3.0	7.3k

kubectl task results by model

Task	gpt-5-nano
create-pod-yamleasy Generate a Pod manifest YAML file called pod.yaml for a pod named 'web' running the nginx:alpine image on port 80, using kubectl create with --dry-run=client and -o yaml.	✗4t4 turns · 9.5k tokens
create-deployment-yamleasy Generate a Deployment manifest for a deployment named 'api' with 3 replicas running the node:20-alpine image. Save it to deployment.yaml using --dry-run=client -o yaml.	✓2t2 turns · 3.5k tokens
explain-resourceeasy Use kubectl explain to show the documentation for a Pod's spec.containers field.	✗2t2 turns · 7.9k tokens
kustomize-buildhard Create a kustomization.yaml file that includes a resource file called deployment.yaml. Then use kubectl kustomize to build and output the result.	✓4t4 turns · 8.5k tokens

Task suite source56 lines · YAML

- id: create-pod-yaml
  intent: Generate a Pod manifest YAML file called pod.yaml for a pod named 'web'
    running the nginx:alpine image on port 80, using kubectl create with
    --dry-run=client and -o yaml.
  assert:
    - file_exists: pod.yaml
    - file_contains:
        path: pod.yaml
        text: nginx
    - file_contains:
        path: pod.yaml
        text: web
    - ran: kubectl.*--dry-run
  setup: []
  max_turns: 4
  difficulty: easy
  category: generate
- id: create-deployment-yaml
  intent: Generate a Deployment manifest for a deployment named 'api' with 3
    replicas running the node:20-alpine image. Save it to deployment.yaml using
    --dry-run=client -o yaml.
  assert:
    - file_exists: deployment.yaml
    - file_contains:
        path: deployment.yaml
        text: replicas
    - file_contains:
        path: deployment.yaml
        text: api
  setup: []
  max_turns: 4
  difficulty: easy
  category: generate
- id: explain-resource
  intent: Use kubectl explain to show the documentation for a Pod's
    spec.containers field.
  assert:
    - ran: kubectl explain
    - output_contains: containers
  setup: []
  max_turns: 3
  difficulty: easy
  category: query
- id: kustomize-build
  intent: Create a kustomization.yaml file that includes a resource file called
    deployment.yaml. Then use kubectl kustomize to build and output the result.
  assert:
    - file_exists: kustomization.yaml
    - ran: kubectl kustomize|kustomize build
  setup:
    - kubectl create deployment api --image=node:20-alpine --replicas=2
      --dry-run=client -o yaml > deployment.yaml
  max_turns: 8
  difficulty: hard
  category: workflow

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for kubectl — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws

fly