# agent runs build targets $ make build gcc -Wall -O2 -o app main.c $ make test Running tests... All 12 tests passed.
Can AI agents use GNU Make?
The classic build automation tool. Agents use it to run build targets, manage dependencies between tasks, and automate project workflows.
See the latest run →GNU Make eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 100% | 2.8 | 14.5k |
GNU Make task results by model
| Task | gpt-5-nano |
|---|---|
basic-makefileeasy Create a Makefile with a 'hello' target that prints 'Hello, Make!' to stdout. Run the target. | ✓2t |
multi-targetmedium Create a Makefile with three targets: 'build' that creates a file called build.txt containing 'built', 'clean' that removes build.txt, and 'all' that depends on build. Make 'all' the default target. Run 'make' and verify build.txt exists. | ✓3t |
variables-and-patternseasy Create a Makefile that defines a variable CC=gcc and a variable CFLAGS=-Wall -O2. Add a target 'info' that prints both variables. Run 'make info'. | ✓4t |
phony-and-helpmedium Create a Makefile with .PHONY declarations for 'test', 'lint', and 'help' targets. The 'help' target should parse the Makefile and print each target with a description from ## comments. The 'test' target should echo 'running tests'. The 'lint' target should echo 'running linter'. Run 'make help'. | ✓2t |
Task suite source61 lines · YAML
- id: basic-makefile
intent: Create a Makefile with a 'hello' target that prints 'Hello, Make!' to
stdout. Run the target.
assert:
- file_exists: Makefile
- ran: make.*hello|make$
- output_contains: Hello, Make!
setup: []
max_turns: 3
difficulty: easy
category: basics
- id: multi-target
intent: "Create a Makefile with three targets: 'build' that creates a file
called build.txt containing 'built', 'clean' that removes build.txt, and
'all' that depends on build. Make 'all' the default target. Run 'make' and
verify build.txt exists."
assert:
- file_exists: Makefile
- ran: make
- file_exists: build.txt
- file_contains:
path: build.txt
text: built
setup: []
max_turns: 5
difficulty: medium
category: targets
- id: variables-and-patterns
intent: Create a Makefile that defines a variable CC=gcc and a variable
CFLAGS=-Wall -O2. Add a target 'info' that prints both variables. Run 'make
info'.
assert:
- file_exists: Makefile
- file_contains:
path: Makefile
text: CC
- ran: make info
- output_contains: gcc
setup: []
max_turns: 4
difficulty: easy
category: variables
- id: phony-and-help
intent: "Create a Makefile with .PHONY declarations for 'test', 'lint', and
'help' targets. The 'help' target should parse the Makefile and print each
target with a description from ## comments. The 'test' target should echo
'running tests'. The 'lint' target should echo 'running linter'. Run 'make
help'."
assert:
- file_exists: Makefile
- file_contains:
path: Makefile
text: .PHONY
- ran: make help
- output_contains: test
- output_contains: lint
setup: []
max_turns: 6
difficulty: medium
category: workflow
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for GNU Make — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.