# agent inspects a running container
$ docker ps --format json
  {"ID":"a1b2c3","Names":"api","Status":"Up 2h"}
 
$ docker logs api --tail 10
  Listening on port 8080

Can AI agents use Docker?

Container management CLI. Agents build images, run containers, manage volumes, and inspect running services.

See the latest run →
78% overall pass rate1 model tested18 tasksv28.0.4,3/6/2026

Docker eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano78%2.77.5k

Docker task results by model

Taskgpt-5-nano
quickstart-run-helloeasy
Run the 'hello-world' Docker image. Docker should pull the image if needed and print a hello message.
1t
quickstart-run-commandeasy
Run an alpine container that executes the command 'echo bench-ping-ok' and prints the output.
1t
quickstart-pull-imageeasy
Pull the 'busybox' image from Docker Hub. Print the image ID or digest after pulling.
3t
discover-version-infoeasy
Check what version of Docker is installed and print the full version output, including the Go version and build information.
1t
discover-system-infoeasy
Use Docker to display system-wide information about the Docker installation, including the number of containers and images.
4t
build-simple-imagemedium
Create a Dockerfile for an Alpine-based image that runs 'echo bench-build-success' as its default command. Build the image with the tag 'bench-simple:latest'.
2t
build-with-argmedium
Create a Dockerfile that uses a build argument called GREETING with a default value of 'hello'. The Dockerfile should be Alpine-based and use 'echo $GREETING' as its CMD. Build the image as 'bench-arg:latest' passing GREETING='bench-greet-ok' as a build arg.
6t
tag-imagemedium
Pull the 'alpine:latest' image if not already present, then create a new tag 'bench-alpine:v1' pointing to it. Verify the new tag exists by listing images filtered to 'bench-alpine'.
3t
run-detached-and-listmedium
Run an alpine container named 'bench-bg' in detached mode with the command 'sleep 300'. Then list running containers and confirm 'bench-bg' appears in the output.
2t
exec-in-containermedium
Start an alpine container named 'bench-exec' in detached mode running 'sleep 300'. Then use docker exec to run 'cat /etc/os-release' inside the container and print the output.
2t
container-logsmedium
Run an alpine container named 'bench-logger' that executes 'echo bench-log-line-1 && echo bench-log-line-2 && echo bench-log-line-3'. Then retrieve the last 2 lines of its logs using docker logs with the --tail flag.
3t
error-nonexistent-imagehard
Try to run a container from the image 'bench-does-not-exist:fake'. Capture the exit code and write it to bench-exit-code.txt. The command should fail because the image does not exist.
2t
error-remove-runninghard
Run an alpine container named 'bench-running' in detached mode (sleep 300). Try to remove it without the force flag (docker rm bench-running). The removal should fail. Then remove it with --force and confirm it is gone by listing containers.
4t
error-stop-and-removehard
Run an alpine container named 'bench-stopper' in detached mode (sleep 300). Stop it with docker stop, then remove it with docker rm. Verify it no longer appears in docker ps -a.
4t
workflow-volume-roundtriphard
Create a Docker volume named 'bench-data-vol'. Run an alpine container that mounts this volume at /data and writes 'volume-roundtrip-ok' to /data/bench-result.txt. Then run a second alpine container mounting the same volume and read the file with 'cat /data/bench-result.txt'. Print the contents.
3t
workflow-copy-file-outhard
Run an alpine container named 'bench-copier' in detached mode (sleep 300). Use docker exec to create a file /tmp/bench-report.txt inside the container with the content 'copy-extract-ok'. Then use docker cp to copy that file from the container to the current directory. Print the contents of the local bench-report.txt.
4t
workflow-build-run-inspecthard
Create a Dockerfile based on alpine that sets the environment variable APP_MODE=benchmark and runs 'sleep 300' as its command. Build it as 'bench-full:latest'. Run a container named 'bench-full-test' from that image in detached mode. Then use docker inspect with a format template to print only the APP_MODE environment variable value.
10t
workflow-network-communicationhard
Create a Docker network named 'bench-net'. Run an alpine container named 'bench-server' in detached mode on that network (sleep 300). Run a second alpine container on the same network that pings 'bench-server' by name with 'ping -c 1 bench-server'. The ping should succeed, proving DNS-based container discovery works.
3t
Task suite source264 lines · YAML
- id: quickstart-run-hello
  intent: Run the 'hello-world' Docker image. Docker should pull the image if
    needed and print a hello message.
  assert:
    - ran: docker run
    - output_contains: Hello from Docker
  setup:
    - docker rm -f bench-hello-world 2>/dev/null || true
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: data/cli/engine/docker_run.yaml#run
- id: quickstart-run-command
  intent: Run an alpine container that executes the command 'echo bench-ping-ok'
    and prints the output.
  assert:
    - ran: docker run
    - output_contains: bench-ping-ok
  setup: []
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: data/cli/engine/docker_run.yaml#run
- id: quickstart-pull-image
  intent: Pull the 'busybox' image from Docker Hub. Print the image ID or digest
    after pulling.
  assert:
    - ran: docker pull
    - output_contains: busybox
  setup: []
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: data/cli/engine/docker_pull.yaml#pull
- id: discover-version-info
  intent: Check what version of Docker is installed and print the full version
    output, including the Go version and build information.
  assert:
    - ran: docker.*version
    - output_contains: Version
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: data/cli/engine/docker_version.yaml#version
- id: discover-system-info
  intent: Use Docker to display system-wide information about the Docker
    installation, including the number of containers and images.
  assert:
    - ran: docker.*info
    - output_contains: Containers
  setup: []
  max_turns: 4
  difficulty: easy
  category: command-discovery
  docs_origin: data/cli/engine/docker_info.yaml#info
- id: build-simple-image
  intent: Create a Dockerfile for an Alpine-based image that runs 'echo
    bench-build-success' as its default command. Build the image with the tag
    'bench-simple:latest'.
  assert:
    - ran: docker build
    - file_exists: Dockerfile
    - verify:
        run: docker images bench-simple --format '{{.Repository}}:{{.Tag}}'
        output_contains: bench-simple:latest
  setup:
    - docker rmi -f bench-simple:latest 2>/dev/null || true
  max_turns: 5
  difficulty: medium
  category: build
  docs_origin: data/cli/engine/docker_build.yaml#build
- id: build-with-arg
  intent: Create a Dockerfile that uses a build argument called GREETING with a
    default value of 'hello'. The Dockerfile should be Alpine-based and use
    'echo $GREETING' as its CMD. Build the image as 'bench-arg:latest' passing
    GREETING='bench-greet-ok' as a build arg.
  assert:
    - ran: docker build
    - ran: docker.*--build-arg
    - verify:
        run: docker run --rm bench-arg:latest
        output_contains: bench-greet-ok
  setup:
    - docker rmi -f bench-arg:latest 2>/dev/null || true
  max_turns: 6
  difficulty: medium
  category: build
  docs_origin: data/cli/engine/docker_build.yaml#build-arg
- id: tag-image
  intent: Pull the 'alpine:latest' image if not already present, then create a new
    tag 'bench-alpine:v1' pointing to it. Verify the new tag exists by listing
    images filtered to 'bench-alpine'.
  assert:
    - ran: docker tag
    - verify:
        run: docker images bench-alpine --format '{{.Repository}}:{{.Tag}}'
        output_contains: bench-alpine:v1
  setup:
    - docker rmi -f bench-alpine:v1 2>/dev/null || true
  max_turns: 5
  difficulty: medium
  category: build
  docs_origin: data/cli/engine/docker_tag.yaml#tag
- id: run-detached-and-list
  intent: Run an alpine container named 'bench-bg' in detached mode with the
    command 'sleep 300'. Then list running containers and confirm 'bench-bg'
    appears in the output.
  assert:
    - ran: docker run
    - ran: docker ps
    - output_contains: bench-bg
  setup:
    - docker rm -f bench-bg 2>/dev/null || true
  max_turns: 5
  difficulty: medium
  category: containers
  docs_origin: _vendor/github.com/docker/cli/docs/reference/run.md#Foreground and background
- id: exec-in-container
  intent: Start an alpine container named 'bench-exec' in detached mode running
    'sleep 300'. Then use docker exec to run 'cat /etc/os-release' inside the
    container and print the output.
  assert:
    - ran: docker run
    - ran: docker exec
    - output_contains: Alpine
  setup:
    - docker rm -f bench-exec 2>/dev/null || true
  max_turns: 5
  difficulty: medium
  category: containers
  docs_origin: data/cli/engine/docker_exec.yaml#exec
- id: container-logs
  intent: Run an alpine container named 'bench-logger' that executes 'echo
    bench-log-line-1 && echo bench-log-line-2 && echo bench-log-line-3'. Then
    retrieve the last 2 lines of its logs using docker logs with the --tail
    flag.
  assert:
    - ran: docker run
    - ran: docker logs
    - output_contains: bench-log-line
  setup:
    - docker rm -f bench-logger 2>/dev/null || true
  max_turns: 6
  difficulty: medium
  category: containers
  docs_origin: data/cli/engine/docker_logs.yaml#logs
- id: error-nonexistent-image
  intent: Try to run a container from the image 'bench-does-not-exist:fake'.
    Capture the exit code and write it to bench-exit-code.txt. The command
    should fail because the image does not exist.
  assert:
    - ran: docker run
    - file_exists: bench-exit-code.txt
  setup: []
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: data/cli/engine/docker_run.yaml#run
- id: error-remove-running
  intent: Run an alpine container named 'bench-running' in detached mode (sleep
    300). Try to remove it without the force flag (docker rm bench-running). The
    removal should fail. Then remove it with --force and confirm it is gone by
    listing containers.
  assert:
    - ran: docker run
    - ran: docker rm
    - verify:
        run: docker ps -a --filter name=bench-running --format '{{.Names}}'
        output_equals: ""
  setup:
    - docker rm -f bench-running 2>/dev/null || true
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: data/cli/engine/docker_rm.yaml#rm
- id: error-stop-and-remove
  intent: Run an alpine container named 'bench-stopper' in detached mode (sleep
    300). Stop it with docker stop, then remove it with docker rm. Verify it no
    longer appears in docker ps -a.
  assert:
    - ran: docker stop
    - ran: docker rm
    - verify:
        run: docker ps -a --filter name=bench-stopper --format '{{.Names}}'
        output_equals: ""
  setup:
    - docker rm -f bench-stopper 2>/dev/null || true
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: data/cli/engine/docker_stop.yaml#stop
- id: workflow-volume-roundtrip
  intent: Create a Docker volume named 'bench-data-vol'. Run an alpine container
    that mounts this volume at /data and writes 'volume-roundtrip-ok' to
    /data/bench-result.txt. Then run a second alpine container mounting the same
    volume and read the file with 'cat /data/bench-result.txt'. Print the
    contents.
  assert:
    - ran: docker volume
    - ran: docker run
    - output_contains: volume-roundtrip-ok
  setup:
    - docker volume rm bench-data-vol 2>/dev/null || true
  max_turns: 8
  difficulty: hard
  category: multi-step-workflow
  docs_origin: data/cli/engine/docker_volume_create.yaml#volume create
- id: workflow-copy-file-out
  intent: Run an alpine container named 'bench-copier' in detached mode (sleep
    300). Use docker exec to create a file /tmp/bench-report.txt inside the
    container with the content 'copy-extract-ok'. Then use docker cp to copy
    that file from the container to the current directory. Print the contents of
    the local bench-report.txt.
  assert:
    - ran: docker run
    - ran: docker cp
    - file_exists: bench-report.txt
    - file_contains:
        path: bench-report.txt
        text: copy-extract-ok
  setup:
    - docker rm -f bench-copier 2>/dev/null || true
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: data/cli/engine/docker_cp.yaml#cp
- id: workflow-build-run-inspect
  intent: Create a Dockerfile based on alpine that sets the environment variable
    APP_MODE=benchmark and runs 'sleep 300' as its command. Build it as
    'bench-full:latest'. Run a container named 'bench-full-test' from that image
    in detached mode. Then use docker inspect with a format template to print
    only the APP_MODE environment variable value.
  assert:
    - ran: docker build
    - ran: docker run
    - ran: docker inspect
    - output_contains: benchmark
  setup:
    - docker rm -f bench-full-test 2>/dev/null || true
    - docker rmi -f bench-full:latest 2>/dev/null || true
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: _vendor/github.com/docker/cli/docs/reference/run.md#ENV
    (environment variables)
- id: workflow-network-communication
  intent: Create a Docker network named 'bench-net'. Run an alpine container named
    'bench-server' in detached mode on that network (sleep 300). Run a second
    alpine container on the same network that pings 'bench-server' by name with
    'ping -c 1 bench-server'. The ping should succeed, proving DNS-based
    container discovery works.
  assert:
    - ran: docker network
    - ran: docker run
    - output_contains: bench-server
  setup:
    - docker rm -f bench-server 2>/dev/null || true
    - docker network rm bench-net 2>/dev/null || true
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: data/cli/engine/docker_network_create.yaml#network create

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 18 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Docker see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals