# agent manages dependencies
$ npm ls --depth=0 --json
  {"dependencies":{"express":{"version":"4.21.0"},
   "typescript":{"version":"5.7.0"}}}
 
$ npm audit --json
  {"vulnerabilities":{},"metadata":{"totalDependencies":42}}

Can AI agents use npm?

The Node.js package manager. Agents use it to install dependencies, run scripts, manage versions, and publish packages.

See the latest run →
89% overall pass rate1 model tested19 tasksv10.9.43/6/2026

npm eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano89%3.16.3k

npm task results by model

Taskgpt-5-nano
quickstart-init-custom-nameeasy
Create a new npm project non-interactively, then use npm pkg to set the package name to 'bench-app'. The package.json should contain the name 'bench-app'.
3t
flags-version-readmedium
There is a package.json in the current directory. Use npm to read and print just the current version from the package.json. Do not bump the version, just display it.
5t
workflow-pack-and-inspecthard
There is a project with a package.json, an index.js, and a README.md. Run npm pack to create a tarball. Then list the contents of the generated tarball (using tar) and write the file listing to bench-pack-contents.txt.
3t
quickstart-init-projecteasy
Initialize a new npm project non-interactively with default values. The resulting package.json should exist in the current directory.
2t
quickstart-run-scripteasy
There is a package.json with a script called 'hello' that runs 'echo Hello from benchmark'. Execute that script using npm.
1t
discover-versioneasy
Check what version of npm is installed and print it.
1t
discover-help-listeasy
List all available npm commands. npm has a way to display usage info for all commands at once.
2t
config-set-and-getmedium
Use npm config to set the 'init-author-name' config value to 'Benchmark Author', then retrieve and display it to confirm it was set.
2t
config-pkg-set-scriptsmedium
There is a package.json in the current directory. Use npm pkg to add a 'build' script that runs 'echo built successfully' and a 'lint' script that runs 'echo lint passed'. Then list all scripts using npm pkg get.
3t
config-workspace-setupmedium
Set up an npm workspace project. Create a root package.json with a 'workspaces' field pointing to 'packages/*'. Then create two workspace packages: packages/bench-core (with its own package.json, name '@bench/core', version '1.0.0') and packages/bench-utils (with its own package.json, name '@bench/utils', version '1.0.0'). Run npm install at the root to link them.
2t
flags-ls-jsonmedium
There is a project with some dependencies installed. List the installed packages in JSON format using npm ls with the --json flag. Print the output to stdout.
1t
flags-ls-depthmedium
There is a project with dependencies installed. List only the top-level (depth 0) dependencies. Do not show nested sub-dependencies.
1t
flags-pack-dry-runmedium
There is a package.json in the current directory. Run npm pack in dry-run mode to see what files would be included in the tarball without actually creating it.
1t
error-missing-scripthard
There is a package.json that does not have a 'test' script defined. Try running npm test. It will fail. Then add a 'test' script that runs 'echo tests passed' to the package.json and run npm test again so it succeeds.
6t
error-install-nonexistenthard
Try to install a package called 'bench-nonexistent-pkg-xyz-99999' using npm. This package does not exist, so the install will fail. Capture or display the error. Then write the npm error code to a file called bench-error.txt (just the error code, like 'E404').
2t
error-fix-invalid-jsonhard
The package.json file has invalid JSON (a trailing comma). Fix the package.json so it is valid JSON, then run npm install to confirm it works.
5t
workflow-init-add-deps-runhard
Initialize an npm project non-interactively. Add a 'start' script that runs 'node index.js'. Create an index.js file that prints 'bench-app running'. Then run npm start to execute it.
8t
workflow-version-bumphard
There is a package.json at version 1.0.0. First, initialize a git repo and make an initial commit (so npm version can tag). Then use npm version to bump the patch version (from 1.0.0 to 1.0.1). Verify the package.json now shows 1.0.1.
6t
workflow-local-install-from-tarballhard
There are two project directories: bench-lib/ (a library) and bench-app/ (a consumer). First, run npm pack in bench-lib/ to create a tarball. Then, in bench-app/, install the tarball as a dependency using npm install with a file path to the .tgz file. Verify that bench-app/node_modules/bench-lib exists.
4t
Task suite source366 lines · YAML
- id: quickstart-init-project
  intent: Initialize a new npm project non-interactively with default values. The
    resulting package.json should exist in the current directory.
  assert:
    - ran: npm
    - file_exists: package.json
  setup: []
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: quickstart-init-custom-name
  intent: Create a new npm project non-interactively, then use npm pkg to set the
    package name to 'bench-app'. The package.json should contain the name
    'bench-app'.
  assert:
    - ran: npm
    - file_exists: package.json
    - file_contains:
        path: package.json
        text: bench-app
  setup: []
  max_turns: 4
  difficulty: easy
  category: getting-started
  docs_origin: docs/lib/content/commands/npm-pkg.md#Description
- id: quickstart-run-script
  intent: There is a package.json with a script called 'hello' that runs 'echo
    Hello from benchmark'. Execute that script using npm.
  assert:
    - ran: npm
    - output_contains: Hello from benchmark
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-project",
        "version": "1.0.0",
        "scripts": {
          "hello": "echo Hello from benchmark"
        }
      }
      SETUP
  max_turns: 3
  difficulty: easy
  category: getting-started
  docs_origin: docs/lib/content/commands/npm-run-script.md#Description
- id: discover-version
  intent: Check what version of npm is installed and print it.
  assert:
    - ran: npm.*--version|npm.*-v|npm version
    - output_contains: .
  setup: []
  max_turns: 3
  difficulty: easy
  category: command-discovery
  docs_origin: docs/lib/content/commands/npm.md#Description
- id: discover-help-list
  intent: List all available npm commands. npm has a way to display usage info for
    all commands at once.
  assert:
    - ran: npm.*-l|npm help|npm --help
  setup: []
  max_turns: 4
  difficulty: easy
  category: command-discovery
  docs_origin: docs/lib/content/commands/npm-help.md#Description
- id: config-set-and-get
  intent: Use npm config to set the 'init-author-name' config value to 'Benchmark
    Author', then retrieve and display it to confirm it was set.
  assert:
    - ran: npm config set
    - ran: npm config get|npm config list
    - output_contains: Benchmark Author
  setup:
    - npm init -y
  max_turns: 5
  difficulty: medium
  category: config
  docs_origin: docs/lib/content/commands/npm-config.md#set
- id: config-pkg-set-scripts
  intent: There is a package.json in the current directory. Use npm pkg to add a
    'build' script that runs 'echo built successfully' and a 'lint' script that
    runs 'echo lint passed'. Then list all scripts using npm pkg get.
  assert:
    - ran: npm pkg set
    - file_contains:
        path: package.json
        text: build
    - file_contains:
        path: package.json
        text: lint
  setup:
    - npm init -y
  max_turns: 6
  difficulty: medium
  category: config
  docs_origin: docs/lib/content/commands/npm-pkg.md#Description
- id: config-workspace-setup
  intent: "Set up an npm workspace project. Create a root package.json with a
    'workspaces' field pointing to 'packages/*'. Then create two workspace
    packages: packages/bench-core (with its own package.json, name
    '@bench/core', version '1.0.0') and packages/bench-utils (with its own
    package.json, name '@bench/utils', version '1.0.0'). Run npm install at the
    root to link them."
  assert:
    - ran: npm
    - file_exists: package.json
    - file_exists: packages/bench-core/package.json
    - file_exists: packages/bench-utils/package.json
    - file_contains:
        path: package.json
        text: workspaces
  setup: []
  max_turns: 8
  difficulty: medium
  category: config
  docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: flags-ls-json
  intent: There is a project with some dependencies installed. List the installed
    packages in JSON format using npm ls with the --json flag. Print the output
    to stdout.
  assert:
    - ran: npm ls.*--json|npm list.*--json
    - output_contains: abbrev
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-flags-project",
        "version": "1.0.0",
        "dependencies": {}
      }
      SETUP
    - npm install --save --prefer-offline abbrev@2.0.0 2>/dev/null || npm
      install --save abbrev@2.0.0
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: docs/lib/content/commands/npm-ls.md#Description
- id: flags-ls-depth
  intent: There is a project with dependencies installed. List only the top-level
    (depth 0) dependencies. Do not show nested sub-dependencies.
  assert:
    - ran: npm ls.*--depth|npm list.*--depth
    - output_contains: abbrev
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-depth-project",
        "version": "1.0.0",
        "dependencies": {}
      }
      SETUP
    - npm install --save --prefer-offline abbrev@2.0.0 2>/dev/null || npm
      install --save abbrev@2.0.0
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: docs/lib/content/commands/npm-ls.md#Description
- id: flags-pack-dry-run
  intent: There is a package.json in the current directory. Run npm pack in
    dry-run mode to see what files would be included in the tarball without
    actually creating it.
  assert:
    - ran: npm pack.*--dry-run
    - output_contains: package.json
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-pack-test",
        "version": "1.0.0",
        "main": "index.js"
      }
      SETUP
    - echo 'module.exports = 42;' > index.js
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: docs/lib/content/commands/npm-pack.md#Description
- id: flags-version-read
  intent: There is a package.json in the current directory. Use npm to read and
    print just the current version from the package.json. Do not bump the
    version, just display it.
  assert:
    - ran: npm
    - output_contains: 2.5.3
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-version-project",
        "version": "2.5.3",
        "description": "test project"
      }
      SETUP
  max_turns: 5
  difficulty: medium
  category: flag-parsing
  docs_origin: docs/lib/content/commands/npm-version.md#Description
- id: error-missing-script
  intent: There is a package.json that does not have a 'test' script defined. Try
    running npm test. It will fail. Then add a 'test' script that runs 'echo
    tests passed' to the package.json and run npm test again so it succeeds.
  assert:
    - ran: npm test
    - file_contains:
        path: package.json
        text: test
    - output_contains: tests passed
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-error-project",
        "version": "1.0.0",
        "scripts": {}
      }
      SETUP
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: docs/lib/content/commands/npm-test.md#Description
- id: error-install-nonexistent
  intent: Try to install a package called 'bench-nonexistent-pkg-xyz-99999' using
    npm. This package does not exist, so the install will fail. Capture or
    display the error. Then write the npm error code to a file called
    bench-error.txt (just the error code, like 'E404').
  assert:
    - ran: npm install.*bench-nonexistent-pkg-xyz-99999|npm
        i.*bench-nonexistent-pkg-xyz-99999
    - file_exists: bench-error.txt
    - file_contains:
        path: bench-error.txt
        text: E404
  setup:
    - npm init -y
  max_turns: 6
  difficulty: hard
  category: error-recovery
  docs_origin: docs/lib/content/commands/npm-install.md#Description
- id: error-fix-invalid-json
  intent: The package.json file has invalid JSON (a trailing comma). Fix the
    package.json so it is valid JSON, then run npm install to confirm it works.
  assert:
    - ran: npm
    - verify:
        run: node -e
          "JSON.parse(require('fs').readFileSync('package.json','utf8'));console.log('VALID')"
        output_contains: VALID
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-broken",
        "version": "1.0.0",
        "description": "broken json",
      }
      SETUP
  max_turns: 8
  difficulty: hard
  category: error-recovery
  docs_origin: docs/lib/content/commands/npm-install.md#Description
- id: workflow-init-add-deps-run
  intent: Initialize an npm project non-interactively. Add a 'start' script that
    runs 'node index.js'. Create an index.js file that prints 'bench-app
    running'. Then run npm start to execute it.
  assert:
    - ran: npm
    - file_exists: package.json
    - file_exists: index.js
    - file_contains:
        path: package.json
        text: start
    - output_contains: bench-app running
  setup: []
  max_turns: 8
  difficulty: hard
  category: multi-step-workflow
  docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: workflow-version-bump
  intent: There is a package.json at version 1.0.0. First, initialize a git repo
    and make an initial commit (so npm version can tag). Then use npm version to
    bump the patch version (from 1.0.0 to 1.0.1). Verify the package.json now
    shows 1.0.1.
  assert:
    - ran: npm version
    - file_contains:
        path: package.json
        text: 1.0.1
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-versioned",
        "version": "1.0.0",
        "description": "version bump test"
      }
      SETUP
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: docs/lib/content/commands/npm-version.md#Description
- id: workflow-pack-and-inspect
  intent: There is a project with a package.json, an index.js, and a README.md.
    Run npm pack to create a tarball. Then list the contents of the generated
    tarball (using tar) and write the file listing to bench-pack-contents.txt.
  assert:
    - ran: npm pack
    - file_exists: bench-pack-contents.txt
    - file_contains:
        path: bench-pack-contents.txt
        text: package.json
    - file_contains:
        path: bench-pack-contents.txt
        text: index.js
  setup:
    - |
      cat > package.json << 'SETUP'
      {
        "name": "bench-packable",
        "version": "1.0.0",
        "main": "index.js"
      }
      SETUP
    - echo 'module.exports = true;' > index.js
    - echo '# Bench Packable' > README.md
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: docs/lib/content/commands/npm-pack.md#Description
- id: workflow-local-install-from-tarball
  intent: "There are two project directories: bench-lib/ (a library) and
    bench-app/ (a consumer). First, run npm pack in bench-lib/ to create a
    tarball. Then, in bench-app/, install the tarball as a dependency using npm
    install with a file path to the .tgz file. Verify that
    bench-app/node_modules/bench-lib exists."
  assert:
    - ran: npm pack
    - ran: npm install
    - file_exists: bench-app/node_modules/bench-lib/package.json
  setup:
    - mkdir -p bench-lib bench-app
    - |
      cat > bench-lib/package.json << 'SETUP'
      {
        "name": "bench-lib",
        "version": "1.0.0",
        "main": "index.js"
      }
      SETUP
    - "echo 'module.exports = { greet: () => \"hello\" };' > bench-lib/index.js"
    - |
      cat > bench-app/package.json << 'SETUP'
      {
        "name": "bench-app",
        "version": "1.0.0"
      }
      SETUP
  max_turns: 10
  difficulty: hard
  category: multi-step-workflow
  docs_origin: docs/lib/content/commands/npm-install.md#Description

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for npm see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals