# agent builds and tests a Rust project
$ cargo test
   Compiling mylib v0.1.0
     Running unittests src/lib.rs
  test result: ok. 3 passed; 0 failed

Can AI agents use Cargo?

The Rust package manager and build system. Agents use it to create projects, manage dependencies, run tests, and build optimized binaries.

Docs →GitHub →

See the latest run →

63% overall pass rate1 model tested19 tasksv1.93.13/6/2026

Cargo eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	63%	3.6	10.4k

Cargo task results by model

Task	gpt-5-nano
error-fix-test-failurehard A library project bench-failtest has a failing test. Run cargo test to see which test fails, then fix the code in src/lib.rs so all tests pass. The function 'double' should return its argument multiplied by 2.	✗6t6 turns · 5.7k tokens
error-missing-mainhard A project bench-nomain was created as a binary but its src/main.rs is missing. Try to build it, observe the error, then create a valid src/main.rs that prints 'recovered' and build again successfully.	✗6t6 turns · 11.1k tokens
workflow-create-test-buildhard Create a new library project called bench-mathlib. Add a public function 'add(a: i32, b: i32) -> i32' that returns the sum, and a test that verifies add(2, 3) == 5. Run the tests to confirm they pass, then build the project in release mode.	✗6t6 turns · 10.7k tokens
workflow-workspace-depshard Create a Cargo workspace called bench-ws with two members: 'bench-ws/utils' (a library with a public function 'greet(name: &str) -> String' that returns a greeting) and 'bench-ws/app' (a binary that depends on utils and calls greet). Edit bench-ws/app/Cargo.toml to add utils as a path dependency. Run the app and confirm it prints a greeting.	✓2t2 turns · 18.8k tokens
workflow-profile-comparehard Create a new binary project called bench-profiles. Edit the Cargo.toml to add a custom profile called 'bench-fast' that inherits from release and sets opt-level to 3 and lto to true. Build the project once in dev mode and once with the bench-fast profile (using --profile bench-fast). Write the sizes of both binaries (in bytes) to a file called bench-sizes.txt, one per line, labeled 'dev: <size>' and 'bench-fast: <size>'.	✗8t8 turns · 22.5k tokens
workflow-fmt-clippy-testhard Create a new library project called bench-quality. Write a function 'is_even(n: i32) -> bool' that returns true if n is even, and add a test for it. Then run cargo fmt to format the code, cargo clippy to check for lints, and cargo test to verify correctness. All three commands should succeed.	✓6t6 turns · 12.5k tokens
quickstart-new-binaryeasy Create a new Rust binary project called bench-hello using cargo, then build it.	✓2t2 turns · 2.3k tokens
quickstart-new-libraryeasy Create a new Rust library project called bench-mylib using cargo with the --lib flag.	✓3t3 turns · 2.5k tokens
quickstart-build-runeasy Create a new Rust binary project called bench-greeter. Edit src/main.rs so it prints 'Hello CLIWatch'. Then run the project using cargo run.	✓3t3 turns · 3.7k tokens
discover-versioneasy Check what version of cargo is installed and print it.	✓1t1 turn · 1.0k tokens
discover-help-subcommandseasy Show the list of available cargo subcommands by running cargo's help. The output should include subcommands like build, test, and run.	✓1t1 turn · 2.9k tokens
config-edition-metadatamedium Create a new binary project called bench-meta using cargo. Then edit its Cargo.toml to set the description to 'A benchmark project', the license to 'MIT', and the edition to '2021'. Print the contents of the final Cargo.toml.	✓5t5 turns · 9.5k tokens
config-featuresmedium Create a new library project called bench-featured using cargo. Edit its Cargo.toml to define two features: 'logging' (no dependencies) and 'full' (which enables the 'logging' feature). Also add a default feature that enables 'logging'. Print the final Cargo.toml.	✓4t4 turns · 26.8k tokens
config-workspacemedium Create a Cargo workspace in a directory called bench-workspace. The workspace should contain two member packages: bench-workspace/core (a library) and bench-workspace/cli (a binary). The root Cargo.toml should define the workspace members, and each member should have its own Cargo.toml. Verify the setup compiles by running cargo check from the workspace root.	✓8t8 turns · 23.6k tokens
flags-release-buildmedium Create a new binary project called bench-release using cargo. Build it in release mode. Verify that the release binary exists at bench-release/target/release/bench-release.	✓3t3 turns · 4.4k tokens
flags-test-filtermedium Create a new library project called bench-tests. Add two test functions in src/lib.rs: 'test_addition' that asserts 2+3==5, and 'test_subtraction' that asserts 5-2==3. Run only the test_addition test by name using cargo test.	✓5t5 turns · 12.7k tokens
flags-message-format-jsonmedium Create a new binary project called bench-jsonbuild. Build it using cargo build with --message-format=json so the build output is JSON. Save the JSON build output to a file called bench-build-output.json.	✗5t5 turns · 7.6k tokens
flags-metadatamedium Create a new binary project called bench-metaquery. Run cargo metadata on it with --no-deps and --format-version=1. Save the JSON output to bench-metadata.json.	✗5t5 turns · 11.3k tokens
error-fix-compilehard A binary project bench-fixme has a compile error in src/main.rs (a missing semicolon). Try to build it, observe the error, then fix the code so it compiles successfully and run it.	✗6t6 turns · 7.2k tokens

Task suite source319 lines · YAML

- id: quickstart-new-binary
  intent: Create a new Rust binary project called bench-hello using cargo, then
    build it.
  assert:
    - ran: cargo
    - file_exists: bench-hello/Cargo.toml
    - file_exists: bench-hello/src/main.rs
  setup: []
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: quickstart-new-library
  intent: Create a new Rust library project called bench-mylib using cargo with
    the --lib flag.
  assert:
    - ran: cargo
    - file_exists: bench-mylib/Cargo.toml
    - file_exists: bench-mylib/src/lib.rs
  setup: []
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: src/doc/src/guide/creating-a-new-project.md#Creating a New Package
- id: quickstart-build-run
  intent: Create a new Rust binary project called bench-greeter. Edit src/main.rs
    so it prints 'Hello CLIWatch'. Then run the project using cargo run.
  assert:
    - ran: cargo
    - file_exists: bench-greeter/Cargo.toml
    - output_contains: Hello CLIWatch
  setup: []
  max_turns: 5
  difficulty: easy
  category: getting-started
  docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: discover-version
  intent: Check what version of cargo is installed and print it.
  assert:
    - ran: cargo
    - output_contains: cargo
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: src/doc/src/commands/cargo.md#SYNOPSIS
- id: discover-help-subcommands
  intent: Show the list of available cargo subcommands by running cargo's help.
    The output should include subcommands like build, test, and run.
  assert:
    - ran: cargo.*--help|cargo.*-h|cargo help|cargo --list
    - output_contains: build
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: src/doc/src/commands/cargo.md#DESCRIPTION
- id: config-edition-metadata
  intent: Create a new binary project called bench-meta using cargo. Then edit its
    Cargo.toml to set the description to 'A benchmark project', the license to
    'MIT', and the edition to '2021'. Print the contents of the final
    Cargo.toml.
  assert:
    - ran: cargo
    - file_exists: bench-meta/Cargo.toml
    - file_contains:
        path: bench-meta/Cargo.toml
        text: A benchmark project
    - file_contains:
        path: bench-meta/Cargo.toml
        text: MIT
  setup: []
  max_turns: 5
  difficulty: medium
  category: config
  docs_origin: src/doc/src/reference/manifest.md#The `[package]` section
- id: config-features
  intent: "Create a new library project called bench-featured using cargo. Edit
    its Cargo.toml to define two features: 'logging' (no dependencies) and
    'full' (which enables the 'logging' feature). Also add a default feature
    that enables 'logging'. Print the final Cargo.toml."
  assert:
    - ran: cargo
    - file_exists: bench-featured/Cargo.toml
    - file_contains:
        path: bench-featured/Cargo.toml
        text: logging
    - file_contains:
        path: bench-featured/Cargo.toml
        text: full
    - file_contains:
        path: bench-featured/Cargo.toml
        text: default
  setup: []
  max_turns: 6
  difficulty: medium
  category: config
  docs_origin: src/doc/src/reference/features.md#The `[features]` section
- id: config-workspace
  intent: "Create a Cargo workspace in a directory called bench-workspace. The
    workspace should contain two member packages: bench-workspace/core (a
    library) and bench-workspace/cli (a binary). The root Cargo.toml should
    define the workspace members, and each member should have its own
    Cargo.toml. Verify the setup compiles by running cargo check from the
    workspace root."
  assert:
    - ran: cargo
    - file_exists: bench-workspace/Cargo.toml
    - file_exists: bench-workspace/core/Cargo.toml
    - file_exists: bench-workspace/cli/Cargo.toml
    - file_contains:
        path: bench-workspace/Cargo.toml
        text: workspace
  setup: []
  max_turns: 8
  difficulty: medium
  category: config
  docs_origin: src/doc/src/reference/workspaces.md#Workspaces
- id: flags-release-build
  intent: Create a new binary project called bench-release using cargo. Build it
    in release mode. Verify that the release binary exists at
    bench-release/target/release/bench-release.
  assert:
    - ran: cargo.*--release|cargo.*-r
    - file_exists: bench-release/target/release/bench-release
  setup: []
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: src/doc/src/commands/cargo-build.md#Compilation Options
- id: flags-test-filter
  intent: "Create a new library project called bench-tests. Add two test functions
    in src/lib.rs: 'test_addition' that asserts 2+3==5, and 'test_subtraction'
    that asserts 5-2==3. Run only the test_addition test by name using cargo
    test."
  assert:
    - ran: cargo test
    - output_contains: test_addition
    - output_contains: 1 passed
  setup: []
  max_turns: 6
  difficulty: medium
  category: flag-parsing
  docs_origin: src/doc/src/commands/cargo-test.md#DESCRIPTION
- id: flags-message-format-json
  intent: Create a new binary project called bench-jsonbuild. Build it using cargo
    build with --message-format=json so the build output is JSON. Save the JSON
    build output to a file called bench-build-output.json.
  assert:
    - ran: cargo.*--message-format
    - file_exists: bench-build-output.json
    - file_contains:
        path: bench-build-output.json
        text: reason
  setup: []
  max_turns: 6
  difficulty: medium
  category: flag-parsing
  docs_origin: src/doc/src/commands/cargo-build.md#Output Options
- id: flags-metadata
  intent: Create a new binary project called bench-metaquery. Run cargo metadata
    on it with --no-deps and --format-version=1. Save the JSON output to
    bench-metadata.json.
  assert:
    - ran: cargo metadata
    - file_exists: bench-metadata.json
    - file_contains:
        path: bench-metadata.json
        text: bench-metaquery
  setup: []
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: src/doc/src/commands/cargo-metadata.md#DESCRIPTION
- id: error-fix-compile
  intent: A binary project bench-fixme has a compile error in src/main.rs (a
    missing semicolon). Try to build it, observe the error, then fix the code so
    it compiles successfully and run it.
  assert:
    - ran: cargo
    - output_contains: x is 5
  setup:
    - cargo new bench-fixme
    - |
      cat > bench-fixme/src/main.rs << 'EOF'
      fn main() {
          let x = 5
          println!("x is {}", x);
      }
      EOF
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: src/doc/src/getting-started/first-steps.md#First Steps with Cargo
- id: error-fix-test-failure
  intent: A library project bench-failtest has a failing test. Run cargo test to
    see which test fails, then fix the code in src/lib.rs so all tests pass. The
    function 'double' should return its argument multiplied by 2.
  assert:
    - ran: cargo test
    - output_contains: 2 passed
  setup:
    - cargo new bench-failtest --lib
    - |
      cat > bench-failtest/src/lib.rs << 'EOF'
      pub fn double(x: i32) -> i32 {
          x + x + 1
      }

      #[cfg(test)]
      mod tests {
          use super::*;

          #[test]
          fn test_double_two() {
              assert_eq!(double(2), 4);
          }

          #[test]
          fn test_double_zero() {
              assert_eq!(double(0), 0);
          }
      }
      EOF
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: src/doc/src/guide/tests.md#Tests
- id: error-missing-main
  intent: A project bench-nomain was created as a binary but its src/main.rs is
    missing. Try to build it, observe the error, then create a valid src/main.rs
    that prints 'recovered' and build again successfully.
  assert:
    - ran: cargo
    - file_exists: bench-nomain/src/main.rs
    - output_contains: recovered
  setup:
    - cargo new bench-nomain
    - rm bench-nomain/src/main.rs
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: src/doc/src/guide/project-layout.md#Package Layout
- id: workflow-create-test-build
  intent: "Create a new library project called bench-mathlib. Add a public
    function 'add(a: i32, b: i32) -> i32' that returns the sum, and a test that
    verifies add(2, 3) == 5. Run the tests to confirm they pass, then build the
    project in release mode."
  assert:
    - ran: cargo test
    - ran: cargo build.*--release|cargo build.*-r
    - file_contains:
        path: bench-mathlib/src/lib.rs
        text: fn add
    - output_contains: 1 passed
  setup: []
  max_turns: 8
  difficulty: hard
  category: multi-step-workflow
  docs_origin: src/doc/src/guide/tests.md#Tests
- id: workflow-workspace-deps
  intent: "Create a Cargo workspace called bench-ws with two members:
    'bench-ws/utils' (a library with a public function 'greet(name: &str) ->
    String' that returns a greeting) and 'bench-ws/app' (a binary that depends
    on utils and calls greet). Edit bench-ws/app/Cargo.toml to add utils as a
    path dependency. Run the app and confirm it prints a greeting."
  assert:
    - ran: cargo
    - file_exists: bench-ws/Cargo.toml
    - file_exists: bench-ws/utils/src/lib.rs
    - file_exists: bench-ws/app/src/main.rs
    - file_contains:
        path: bench-ws/app/Cargo.toml
        text: utils
  setup: []
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: src/doc/src/reference/workspaces.md#Workspaces
- id: workflow-profile-compare
  intent: "Create a new binary project called bench-profiles. Edit the Cargo.toml
    to add a custom profile called 'bench-fast' that inherits from release and
    sets opt-level to 3 and lto to true. Build the project once in dev mode and
    once with the bench-fast profile (using --profile bench-fast). Write the
    sizes of both binaries (in bytes) to a file called bench-sizes.txt, one per
    line, labeled 'dev: <size>' and 'bench-fast: <size>'."
  assert:
    - ran: cargo build
    - ran: cargo.*--profile.*bench-fast
    - file_exists: bench-sizes.txt
    - file_contains:
        path: bench-sizes.txt
        text: "dev:"
    - file_contains:
        path: bench-sizes.txt
        text: "bench-fast:"
  setup: []
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: src/doc/src/reference/profiles.md#Profiles
- id: workflow-fmt-clippy-test
  intent: "Create a new library project called bench-quality. Write a function
    'is_even(n: i32) -> bool' that returns true if n is even, and add a test for
    it. Then run cargo fmt to format the code, cargo clippy to check for lints,
    and cargo test to verify correctness. All three commands should succeed."
  assert:
    - ran: cargo fmt
    - ran: cargo clippy
    - ran: cargo test
    - file_contains:
        path: bench-quality/src/lib.rs
        text: is_even
  setup: []
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: src/doc/src/commands/cargo-fmt.md#DESCRIPTION

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for Cargo — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws

fly