# agent inspects a running container $ docker ps --format json {"ID":"a1b2c3","Names":"api","Status":"Up 2h"} $ docker logs api --tail 10 Listening on port 8080
Can AI agents use Docker?
Container management CLI. Agents build images, run containers, manage volumes, and inspect running services.
See the latest run →Docker eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 78% | 2.7 | 7.5k |
Docker task results by model
| Task | gpt-5-nano |
|---|---|
quickstart-run-helloeasy Run the 'hello-world' Docker image. Docker should pull the image if needed and print a hello message. | ✓1t |
quickstart-run-commandeasy Run an alpine container that executes the command 'echo bench-ping-ok' and prints the output. | ✓1t |
quickstart-pull-imageeasy Pull the 'busybox' image from Docker Hub. Print the image ID or digest after pulling. | ✗3t |
discover-version-infoeasy Check what version of Docker is installed and print the full version output, including the Go version and build information. | ✓1t |
discover-system-infoeasy Use Docker to display system-wide information about the Docker installation, including the number of containers and images. | ✗4t |
build-simple-imagemedium Create a Dockerfile for an Alpine-based image that runs 'echo bench-build-success' as its default command. Build the image with the tag 'bench-simple:latest'. | ✓2t |
build-with-argmedium Create a Dockerfile that uses a build argument called GREETING with a default value of 'hello'. The Dockerfile should be Alpine-based and use 'echo $GREETING' as its CMD. Build the image as 'bench-arg:latest' passing GREETING='bench-greet-ok' as a build arg. | ✓6t |
tag-imagemedium Pull the 'alpine:latest' image if not already present, then create a new tag 'bench-alpine:v1' pointing to it. Verify the new tag exists by listing images filtered to 'bench-alpine'. | ✓3t |
run-detached-and-listmedium Run an alpine container named 'bench-bg' in detached mode with the command 'sleep 300'. Then list running containers and confirm 'bench-bg' appears in the output. | ✓2t |
exec-in-containermedium Start an alpine container named 'bench-exec' in detached mode running 'sleep 300'. Then use docker exec to run 'cat /etc/os-release' inside the container and print the output. | ✓2t |
container-logsmedium Run an alpine container named 'bench-logger' that executes 'echo bench-log-line-1 && echo bench-log-line-2 && echo bench-log-line-3'. Then retrieve the last 2 lines of its logs using docker logs with the --tail flag. | ✗3t |
error-nonexistent-imagehard Try to run a container from the image 'bench-does-not-exist:fake'. Capture the exit code and write it to bench-exit-code.txt. The command should fail because the image does not exist. | ✓2t |
error-remove-runninghard Run an alpine container named 'bench-running' in detached mode (sleep 300). Try to remove it without the force flag (docker rm bench-running). The removal should fail. Then remove it with --force and confirm it is gone by listing containers. | ✓4t |
error-stop-and-removehard Run an alpine container named 'bench-stopper' in detached mode (sleep 300). Stop it with docker stop, then remove it with docker rm. Verify it no longer appears in docker ps -a. | ✓4t |
workflow-volume-roundtriphard Create a Docker volume named 'bench-data-vol'. Run an alpine container that mounts this volume at /data and writes 'volume-roundtrip-ok' to /data/bench-result.txt. Then run a second alpine container mounting the same volume and read the file with 'cat /data/bench-result.txt'. Print the contents. | ✓3t |
workflow-copy-file-outhard Run an alpine container named 'bench-copier' in detached mode (sleep 300). Use docker exec to create a file /tmp/bench-report.txt inside the container with the content 'copy-extract-ok'. Then use docker cp to copy that file from the container to the current directory. Print the contents of the local bench-report.txt. | ✓4t |
workflow-build-run-inspecthard Create a Dockerfile based on alpine that sets the environment variable APP_MODE=benchmark and runs 'sleep 300' as its command. Build it as 'bench-full:latest'. Run a container named 'bench-full-test' from that image in detached mode. Then use docker inspect with a format template to print only the APP_MODE environment variable value. | ✗10t |
workflow-network-communicationhard Create a Docker network named 'bench-net'. Run an alpine container named 'bench-server' in detached mode on that network (sleep 300). Run a second alpine container on the same network that pings 'bench-server' by name with 'ping -c 1 bench-server'. The ping should succeed, proving DNS-based container discovery works. | ✓3t |
Task suite source264 lines · YAML
- id: quickstart-run-hello
intent: Run the 'hello-world' Docker image. Docker should pull the image if
needed and print a hello message.
assert:
- ran: docker run
- output_contains: Hello from Docker
setup:
- docker rm -f bench-hello-world 2>/dev/null || true
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: data/cli/engine/docker_run.yaml#run
- id: quickstart-run-command
intent: Run an alpine container that executes the command 'echo bench-ping-ok'
and prints the output.
assert:
- ran: docker run
- output_contains: bench-ping-ok
setup: []
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: data/cli/engine/docker_run.yaml#run
- id: quickstart-pull-image
intent: Pull the 'busybox' image from Docker Hub. Print the image ID or digest
after pulling.
assert:
- ran: docker pull
- output_contains: busybox
setup: []
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: data/cli/engine/docker_pull.yaml#pull
- id: discover-version-info
intent: Check what version of Docker is installed and print the full version
output, including the Go version and build information.
assert:
- ran: docker.*version
- output_contains: Version
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: data/cli/engine/docker_version.yaml#version
- id: discover-system-info
intent: Use Docker to display system-wide information about the Docker
installation, including the number of containers and images.
assert:
- ran: docker.*info
- output_contains: Containers
setup: []
max_turns: 4
difficulty: easy
category: command-discovery
docs_origin: data/cli/engine/docker_info.yaml#info
- id: build-simple-image
intent: Create a Dockerfile for an Alpine-based image that runs 'echo
bench-build-success' as its default command. Build the image with the tag
'bench-simple:latest'.
assert:
- ran: docker build
- file_exists: Dockerfile
- verify:
run: docker images bench-simple --format '{{.Repository}}:{{.Tag}}'
output_contains: bench-simple:latest
setup:
- docker rmi -f bench-simple:latest 2>/dev/null || true
max_turns: 5
difficulty: medium
category: build
docs_origin: data/cli/engine/docker_build.yaml#build
- id: build-with-arg
intent: Create a Dockerfile that uses a build argument called GREETING with a
default value of 'hello'. The Dockerfile should be Alpine-based and use
'echo $GREETING' as its CMD. Build the image as 'bench-arg:latest' passing
GREETING='bench-greet-ok' as a build arg.
assert:
- ran: docker build
- ran: docker.*--build-arg
- verify:
run: docker run --rm bench-arg:latest
output_contains: bench-greet-ok
setup:
- docker rmi -f bench-arg:latest 2>/dev/null || true
max_turns: 6
difficulty: medium
category: build
docs_origin: data/cli/engine/docker_build.yaml#build-arg
- id: tag-image
intent: Pull the 'alpine:latest' image if not already present, then create a new
tag 'bench-alpine:v1' pointing to it. Verify the new tag exists by listing
images filtered to 'bench-alpine'.
assert:
- ran: docker tag
- verify:
run: docker images bench-alpine --format '{{.Repository}}:{{.Tag}}'
output_contains: bench-alpine:v1
setup:
- docker rmi -f bench-alpine:v1 2>/dev/null || true
max_turns: 5
difficulty: medium
category: build
docs_origin: data/cli/engine/docker_tag.yaml#tag
- id: run-detached-and-list
intent: Run an alpine container named 'bench-bg' in detached mode with the
command 'sleep 300'. Then list running containers and confirm 'bench-bg'
appears in the output.
assert:
- ran: docker run
- ran: docker ps
- output_contains: bench-bg
setup:
- docker rm -f bench-bg 2>/dev/null || true
max_turns: 5
difficulty: medium
category: containers
docs_origin: _vendor/github.com/docker/cli/docs/reference/run.md#Foreground and background
- id: exec-in-container
intent: Start an alpine container named 'bench-exec' in detached mode running
'sleep 300'. Then use docker exec to run 'cat /etc/os-release' inside the
container and print the output.
assert:
- ran: docker run
- ran: docker exec
- output_contains: Alpine
setup:
- docker rm -f bench-exec 2>/dev/null || true
max_turns: 5
difficulty: medium
category: containers
docs_origin: data/cli/engine/docker_exec.yaml#exec
- id: container-logs
intent: Run an alpine container named 'bench-logger' that executes 'echo
bench-log-line-1 && echo bench-log-line-2 && echo bench-log-line-3'. Then
retrieve the last 2 lines of its logs using docker logs with the --tail
flag.
assert:
- ran: docker run
- ran: docker logs
- output_contains: bench-log-line
setup:
- docker rm -f bench-logger 2>/dev/null || true
max_turns: 6
difficulty: medium
category: containers
docs_origin: data/cli/engine/docker_logs.yaml#logs
- id: error-nonexistent-image
intent: Try to run a container from the image 'bench-does-not-exist:fake'.
Capture the exit code and write it to bench-exit-code.txt. The command
should fail because the image does not exist.
assert:
- ran: docker run
- file_exists: bench-exit-code.txt
setup: []
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: data/cli/engine/docker_run.yaml#run
- id: error-remove-running
intent: Run an alpine container named 'bench-running' in detached mode (sleep
300). Try to remove it without the force flag (docker rm bench-running). The
removal should fail. Then remove it with --force and confirm it is gone by
listing containers.
assert:
- ran: docker run
- ran: docker rm
- verify:
run: docker ps -a --filter name=bench-running --format '{{.Names}}'
output_equals: ""
setup:
- docker rm -f bench-running 2>/dev/null || true
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: data/cli/engine/docker_rm.yaml#rm
- id: error-stop-and-remove
intent: Run an alpine container named 'bench-stopper' in detached mode (sleep
300). Stop it with docker stop, then remove it with docker rm. Verify it no
longer appears in docker ps -a.
assert:
- ran: docker stop
- ran: docker rm
- verify:
run: docker ps -a --filter name=bench-stopper --format '{{.Names}}'
output_equals: ""
setup:
- docker rm -f bench-stopper 2>/dev/null || true
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: data/cli/engine/docker_stop.yaml#stop
- id: workflow-volume-roundtrip
intent: Create a Docker volume named 'bench-data-vol'. Run an alpine container
that mounts this volume at /data and writes 'volume-roundtrip-ok' to
/data/bench-result.txt. Then run a second alpine container mounting the same
volume and read the file with 'cat /data/bench-result.txt'. Print the
contents.
assert:
- ran: docker volume
- ran: docker run
- output_contains: volume-roundtrip-ok
setup:
- docker volume rm bench-data-vol 2>/dev/null || true
max_turns: 8
difficulty: hard
category: multi-step-workflow
docs_origin: data/cli/engine/docker_volume_create.yaml#volume create
- id: workflow-copy-file-out
intent: Run an alpine container named 'bench-copier' in detached mode (sleep
300). Use docker exec to create a file /tmp/bench-report.txt inside the
container with the content 'copy-extract-ok'. Then use docker cp to copy
that file from the container to the current directory. Print the contents of
the local bench-report.txt.
assert:
- ran: docker run
- ran: docker cp
- file_exists: bench-report.txt
- file_contains:
path: bench-report.txt
text: copy-extract-ok
setup:
- docker rm -f bench-copier 2>/dev/null || true
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: data/cli/engine/docker_cp.yaml#cp
- id: workflow-build-run-inspect
intent: Create a Dockerfile based on alpine that sets the environment variable
APP_MODE=benchmark and runs 'sleep 300' as its command. Build it as
'bench-full:latest'. Run a container named 'bench-full-test' from that image
in detached mode. Then use docker inspect with a format template to print
only the APP_MODE environment variable value.
assert:
- ran: docker build
- ran: docker run
- ran: docker inspect
- output_contains: benchmark
setup:
- docker rm -f bench-full-test 2>/dev/null || true
- docker rmi -f bench-full:latest 2>/dev/null || true
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: _vendor/github.com/docker/cli/docs/reference/run.md#ENV
(environment variables)
- id: workflow-network-communication
intent: Create a Docker network named 'bench-net'. Run an alpine container named
'bench-server' in detached mode on that network (sleep 300). Run a second
alpine container on the same network that pings 'bench-server' by name with
'ping -c 1 bench-server'. The ping should succeed, proving DNS-based
container discovery works.
assert:
- ran: docker network
- ran: docker run
- output_contains: bench-server
setup:
- docker rm -f bench-server 2>/dev/null || true
- docker network rm bench-net 2>/dev/null || true
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: data/cli/engine/docker_network_create.yaml#network create
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 18 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for Docker — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.