# agent builds and tests a Rust project $ cargo test Compiling mylib v0.1.0 Running unittests src/lib.rs test result: ok. 3 passed; 0 failed
Can AI agents use Cargo?
The Rust package manager and build system. Agents use it to create projects, manage dependencies, run tests, and build optimized binaries.
See the latest run →Cargo eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 63% | 3.6 | 10.4k |
Cargo task results by model
| Task | gpt-5-nano |
|---|---|
error-fix-test-failurehard A library project bench-failtest has a failing test. Run cargo test to see which test fails, then fix the code in src/lib.rs so all tests pass. The function 'double' should return its argument multiplied by 2. | ✗6t |
error-missing-mainhard A project bench-nomain was created as a binary but its src/main.rs is missing. Try to build it, observe the error, then create a valid src/main.rs that prints 'recovered' and build again successfully. | ✗6t |
workflow-create-test-buildhard Create a new library project called bench-mathlib. Add a public function 'add(a: i32, b: i32) -> i32' that returns the sum, and a test that verifies add(2, 3) == 5. Run the tests to confirm they pass, then build the project in release mode. | ✗6t |
workflow-workspace-depshard Create a Cargo workspace called bench-ws with two members: 'bench-ws/utils' (a library with a public function 'greet(name: &str) -> String' that returns a greeting) and 'bench-ws/app' (a binary that depends on utils and calls greet). Edit bench-ws/app/Cargo.toml to add utils as a path dependency. Run the app and confirm it prints a greeting. | ✓2t |
workflow-profile-comparehard Create a new binary project called bench-profiles. Edit the Cargo.toml to add a custom profile called 'bench-fast' that inherits from release and sets opt-level to 3 and lto to true. Build the project once in dev mode and once with the bench-fast profile (using --profile bench-fast). Write the sizes of both binaries (in bytes) to a file called bench-sizes.txt, one per line, labeled 'dev: <size>' and 'bench-fast: <size>'. | ✗8t |
workflow-fmt-clippy-testhard Create a new library project called bench-quality. Write a function 'is_even(n: i32) -> bool' that returns true if n is even, and add a test for it. Then run cargo fmt to format the code, cargo clippy to check for lints, and cargo test to verify correctness. All three commands should succeed. | ✓6t |
quickstart-new-binaryeasy Create a new Rust binary project called bench-hello using cargo, then build it. | ✓2t |
quickstart-new-libraryeasy Create a new Rust library project called bench-mylib using cargo with the --lib flag. | ✓3t |
quickstart-build-runeasy Create a new Rust binary project called bench-greeter. Edit src/main.rs so it prints 'Hello CLIWatch'. Then run the project using cargo run. | ✓3t |
discover-versioneasy Check what version of cargo is installed and print it. | ✓1t |
discover-help-subcommandseasy Show the list of available cargo subcommands by running cargo's help. The output should include subcommands like build, test, and run. | ✓1t |
config-edition-metadatamedium Create a new binary project called bench-meta using cargo. Then edit its Cargo.toml to set the description to 'A benchmark project', the license to 'MIT', and the edition to '2021'. Print the contents of the final Cargo.toml. | ✓5t |
config-featuresmedium Create a new library project called bench-featured using cargo. Edit its Cargo.toml to define two features: 'logging' (no dependencies) and 'full' (which enables the 'logging' feature). Also add a default feature that enables 'logging'. Print the final Cargo.toml. | ✓4t |
config-workspacemedium Create a Cargo workspace in a directory called bench-workspace. The workspace should contain two member packages: bench-workspace/core (a library) and bench-workspace/cli (a binary). The root Cargo.toml should define the workspace members, and each member should have its own Cargo.toml. Verify the setup compiles by running cargo check from the workspace root. | ✓8t |
flags-release-buildmedium Create a new binary project called bench-release using cargo. Build it in release mode. Verify that the release binary exists at bench-release/target/release/bench-release. | ✓3t |
flags-test-filtermedium Create a new library project called bench-tests. Add two test functions in src/lib.rs: 'test_addition' that asserts 2+3==5, and 'test_subtraction' that asserts 5-2==3. Run only the test_addition test by name using cargo test. | ✓5t |
flags-message-format-jsonmedium Create a new binary project called bench-jsonbuild. Build it using cargo build with --message-format=json so the build output is JSON. Save the JSON build output to a file called bench-build-output.json. | ✗5t |
flags-metadatamedium Create a new binary project called bench-metaquery. Run cargo metadata on it with --no-deps and --format-version=1. Save the JSON output to bench-metadata.json. | ✗5t |
error-fix-compilehard A binary project bench-fixme has a compile error in src/main.rs (a missing semicolon). Try to build it, observe the error, then fix the code so it compiles successfully and run it. | ✗6t |
Task suite source319 lines · YAML
- id: quickstart-new-binary
intent: Create a new Rust binary project called bench-hello using cargo, then
build it.
assert:
- ran: cargo
- file_exists: bench-hello/Cargo.toml
- file_exists: bench-hello/src/main.rs
setup: []
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: quickstart-new-library
intent: Create a new Rust library project called bench-mylib using cargo with
the --lib flag.
assert:
- ran: cargo
- file_exists: bench-mylib/Cargo.toml
- file_exists: bench-mylib/src/lib.rs
setup: []
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: src/doc/src/guide/creating-a-new-project.md#Creating a New Package
- id: quickstart-build-run
intent: Create a new Rust binary project called bench-greeter. Edit src/main.rs
so it prints 'Hello CLIWatch'. Then run the project using cargo run.
assert:
- ran: cargo
- file_exists: bench-greeter/Cargo.toml
- output_contains: Hello CLIWatch
setup: []
max_turns: 5
difficulty: easy
category: getting-started
docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: discover-version
intent: Check what version of cargo is installed and print it.
assert:
- ran: cargo
- output_contains: cargo
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: src/doc/src/commands/cargo.md#SYNOPSIS
- id: discover-help-subcommands
intent: Show the list of available cargo subcommands by running cargo's help.
The output should include subcommands like build, test, and run.
assert:
- ran: cargo.*--help|cargo.*-h|cargo help|cargo --list
- output_contains: build
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: src/doc/src/commands/cargo.md#DESCRIPTION
- id: config-edition-metadata
intent: Create a new binary project called bench-meta using cargo. Then edit its
Cargo.toml to set the description to 'A benchmark project', the license to
'MIT', and the edition to '2021'. Print the contents of the final
Cargo.toml.
assert:
- ran: cargo
- file_exists: bench-meta/Cargo.toml
- file_contains:
path: bench-meta/Cargo.toml
text: A benchmark project
- file_contains:
path: bench-meta/Cargo.toml
text: MIT
setup: []
max_turns: 5
difficulty: medium
category: config
docs_origin: src/doc/src/reference/manifest.md#The `[package]` section
- id: config-features
intent: "Create a new library project called bench-featured using cargo. Edit
its Cargo.toml to define two features: 'logging' (no dependencies) and
'full' (which enables the 'logging' feature). Also add a default feature
that enables 'logging'. Print the final Cargo.toml."
assert:
- ran: cargo
- file_exists: bench-featured/Cargo.toml
- file_contains:
path: bench-featured/Cargo.toml
text: logging
- file_contains:
path: bench-featured/Cargo.toml
text: full
- file_contains:
path: bench-featured/Cargo.toml
text: default
setup: []
max_turns: 6
difficulty: medium
category: config
docs_origin: src/doc/src/reference/features.md#The `[features]` section
- id: config-workspace
intent: "Create a Cargo workspace in a directory called bench-workspace. The
workspace should contain two member packages: bench-workspace/core (a
library) and bench-workspace/cli (a binary). The root Cargo.toml should
define the workspace members, and each member should have its own
Cargo.toml. Verify the setup compiles by running cargo check from the
workspace root."
assert:
- ran: cargo
- file_exists: bench-workspace/Cargo.toml
- file_exists: bench-workspace/core/Cargo.toml
- file_exists: bench-workspace/cli/Cargo.toml
- file_contains:
path: bench-workspace/Cargo.toml
text: workspace
setup: []
max_turns: 8
difficulty: medium
category: config
docs_origin: src/doc/src/reference/workspaces.md#Workspaces
- id: flags-release-build
intent: Create a new binary project called bench-release using cargo. Build it
in release mode. Verify that the release binary exists at
bench-release/target/release/bench-release.
assert:
- ran: cargo.*--release|cargo.*-r
- file_exists: bench-release/target/release/bench-release
setup: []
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: src/doc/src/commands/cargo-build.md#Compilation Options
- id: flags-test-filter
intent: "Create a new library project called bench-tests. Add two test functions
in src/lib.rs: 'test_addition' that asserts 2+3==5, and 'test_subtraction'
that asserts 5-2==3. Run only the test_addition test by name using cargo
test."
assert:
- ran: cargo test
- output_contains: test_addition
- output_contains: 1 passed
setup: []
max_turns: 6
difficulty: medium
category: flag-parsing
docs_origin: src/doc/src/commands/cargo-test.md#DESCRIPTION
- id: flags-message-format-json
intent: Create a new binary project called bench-jsonbuild. Build it using cargo
build with --message-format=json so the build output is JSON. Save the JSON
build output to a file called bench-build-output.json.
assert:
- ran: cargo.*--message-format
- file_exists: bench-build-output.json
- file_contains:
path: bench-build-output.json
text: reason
setup: []
max_turns: 6
difficulty: medium
category: flag-parsing
docs_origin: src/doc/src/commands/cargo-build.md#Output Options
- id: flags-metadata
intent: Create a new binary project called bench-metaquery. Run cargo metadata
on it with --no-deps and --format-version=1. Save the JSON output to
bench-metadata.json.
assert:
- ran: cargo metadata
- file_exists: bench-metadata.json
- file_contains:
path: bench-metadata.json
text: bench-metaquery
setup: []
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: src/doc/src/commands/cargo-metadata.md#DESCRIPTION
- id: error-fix-compile
intent: A binary project bench-fixme has a compile error in src/main.rs (a
missing semicolon). Try to build it, observe the error, then fix the code so
it compiles successfully and run it.
assert:
- ran: cargo
- output_contains: x is 5
setup:
- cargo new bench-fixme
- |
cat > bench-fixme/src/main.rs << 'EOF'
fn main() {
let x = 5
println!("x is {}", x);
}
EOF
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: error-fix-test-failure
intent: A library project bench-failtest has a failing test. Run cargo test to
see which test fails, then fix the code in src/lib.rs so all tests pass. The
function 'double' should return its argument multiplied by 2.
assert:
- ran: cargo test
- output_contains: 2 passed
setup:
- cargo new bench-failtest --lib
- |
cat > bench-failtest/src/lib.rs << 'EOF'
pub fn double(x: i32) -> i32 {
x + x + 1
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_double_two() {
assert_eq!(double(2), 4);
}
#[test]
fn test_double_zero() {
assert_eq!(double(0), 0);
}
}
EOF
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: src/doc/src/guide/tests.md#Tests
- id: error-missing-main
intent: A project bench-nomain was created as a binary but its src/main.rs is
missing. Try to build it, observe the error, then create a valid src/main.rs
that prints 'recovered' and build again successfully.
assert:
- ran: cargo
- file_exists: bench-nomain/src/main.rs
- output_contains: recovered
setup:
- cargo new bench-nomain
- rm bench-nomain/src/main.rs
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: src/doc/src/guide/project-layout.md#Package Layout
- id: workflow-create-test-build
intent: "Create a new library project called bench-mathlib. Add a public
function 'add(a: i32, b: i32) -> i32' that returns the sum, and a test that
verifies add(2, 3) == 5. Run the tests to confirm they pass, then build the
project in release mode."
assert:
- ran: cargo test
- ran: cargo build.*--release|cargo build.*-r
- file_contains:
path: bench-mathlib/src/lib.rs
text: fn add
- output_contains: 1 passed
setup: []
max_turns: 8
difficulty: hard
category: multi-step-workflow
docs_origin: src/doc/src/guide/tests.md#Tests
- id: workflow-workspace-deps
intent: "Create a Cargo workspace called bench-ws with two members:
'bench-ws/utils' (a library with a public function 'greet(name: &str) ->
String' that returns a greeting) and 'bench-ws/app' (a binary that depends
on utils and calls greet). Edit bench-ws/app/Cargo.toml to add utils as a
path dependency. Run the app and confirm it prints a greeting."
assert:
- ran: cargo
- file_exists: bench-ws/Cargo.toml
- file_exists: bench-ws/utils/src/lib.rs
- file_exists: bench-ws/app/src/main.rs
- file_contains:
path: bench-ws/app/Cargo.toml
text: utils
setup: []
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: src/doc/src/reference/workspaces.md#Workspaces
- id: workflow-profile-compare
intent: "Create a new binary project called bench-profiles. Edit the Cargo.toml
to add a custom profile called 'bench-fast' that inherits from release and
sets opt-level to 3 and lto to true. Build the project once in dev mode and
once with the bench-fast profile (using --profile bench-fast). Write the
sizes of both binaries (in bytes) to a file called bench-sizes.txt, one per
line, labeled 'dev: <size>' and 'bench-fast: <size>'."
assert:
- ran: cargo build
- ran: cargo.*--profile.*bench-fast
- file_exists: bench-sizes.txt
- file_contains:
path: bench-sizes.txt
text: "dev:"
- file_contains:
path: bench-sizes.txt
text: "bench-fast:"
setup: []
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: src/doc/src/reference/profiles.md#Profiles
- id: workflow-fmt-clippy-test
intent: "Create a new library project called bench-quality. Write a function
'is_even(n: i32) -> bool' that returns true if n is even, and add a test for
it. Then run cargo fmt to format the code, cargo clippy to check for lints,
and cargo test to verify correctness. All three commands should succeed."
assert:
- ran: cargo fmt
- ran: cargo clippy
- ran: cargo test
- file_contains:
path: bench-quality/src/lib.rs
text: is_even
setup: []
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: src/doc/src/commands/cargo-fmt.md#DESCRIPTION
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for Cargo — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.