# agent runs build targets
$ make build
  gcc -Wall -O2 -o app main.c
 
$ make test
  Running tests...
  All 12 tests passed.

Can AI agents use GNU Make?

The classic build automation tool. Agents use it to run build targets, manage dependencies between tasks, and automate project workflows.

Docs →GitHub →

See the latest run →

100% overall pass rate1 model tested4 tasksvGNU Make 4.33/6/2026

GNU Make eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	100%	2.8	14.5k

GNU Make task results by model

Task	gpt-5-nano
basic-makefileeasy Create a Makefile with a 'hello' target that prints 'Hello, Make!' to stdout. Run the target.	✓2t2 turns · 8.0k tokens
multi-targetmedium Create a Makefile with three targets: 'build' that creates a file called build.txt containing 'built', 'clean' that removes build.txt, and 'all' that depends on build. Make 'all' the default target. Run 'make' and verify build.txt exists.	✓3t3 turns · 13.5k tokens
variables-and-patternseasy Create a Makefile that defines a variable CC=gcc and a variable CFLAGS=-Wall -O2. Add a target 'info' that prints both variables. Run 'make info'.	✓4t4 turns · 7.6k tokens
phony-and-helpmedium Create a Makefile with .PHONY declarations for 'test', 'lint', and 'help' targets. The 'help' target should parse the Makefile and print each target with a description from ## comments. The 'test' target should echo 'running tests'. The 'lint' target should echo 'running linter'. Run 'make help'.	✓2t2 turns · 28.8k tokens

Task suite source61 lines · YAML

- id: basic-makefile
  intent: Create a Makefile with a 'hello' target that prints 'Hello, Make!' to
    stdout. Run the target.
  assert:
    - file_exists: Makefile
    - ran: make.*hello|make$
    - output_contains: Hello, Make!
  setup: []
  max_turns: 3
  difficulty: easy
  category: basics
- id: multi-target
  intent: "Create a Makefile with three targets: 'build' that creates a file
    called build.txt containing 'built', 'clean' that removes build.txt, and
    'all' that depends on build. Make 'all' the default target. Run 'make' and
    verify build.txt exists."
  assert:
    - file_exists: Makefile
    - ran: make
    - file_exists: build.txt
    - file_contains:
        path: build.txt
        text: built
  setup: []
  max_turns: 5
  difficulty: medium
  category: targets
- id: variables-and-patterns
  intent: Create a Makefile that defines a variable CC=gcc and a variable
    CFLAGS=-Wall -O2. Add a target 'info' that prints both variables. Run 'make
    info'.
  assert:
    - file_exists: Makefile
    - file_contains:
        path: Makefile
        text: CC
    - ran: make info
    - output_contains: gcc
  setup: []
  max_turns: 4
  difficulty: easy
  category: variables
- id: phony-and-help
  intent: "Create a Makefile with .PHONY declarations for 'test', 'lint', and
    'help' targets. The 'help' target should parse the Makefile and print each
    target with a description from ## comments. The 'test' target should echo
    'running tests'. The 'lint' target should echo 'running linter'. Run 'make
    help'."
  assert:
    - file_exists: Makefile
    - file_contains:
        path: Makefile
        text: .PHONY
    - ran: make help
    - output_contains: test
    - output_contains: lint
  setup: []
  max_turns: 6
  difficulty: medium
  category: workflow

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for GNU Make — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws

fly