# agent manages dependencies $ npm ls --depth=0 --json {"dependencies":{"express":{"version":"4.21.0"}, "typescript":{"version":"5.7.0"}}} $ npm audit --json {"vulnerabilities":{},"metadata":{"totalDependencies":42}}
Can AI agents use npm?
The Node.js package manager. Agents use it to install dependencies, run scripts, manage versions, and publish packages.
See the latest run →npm eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 89% | 3.1 | 6.3k |
npm task results by model
| Task | gpt-5-nano |
|---|---|
quickstart-init-custom-nameeasy Create a new npm project non-interactively, then use npm pkg to set the package name to 'bench-app'. The package.json should contain the name 'bench-app'. | ✓3t |
flags-version-readmedium There is a package.json in the current directory. Use npm to read and print just the current version from the package.json. Do not bump the version, just display it. | ✗5t |
workflow-pack-and-inspecthard There is a project with a package.json, an index.js, and a README.md. Run npm pack to create a tarball. Then list the contents of the generated tarball (using tar) and write the file listing to bench-pack-contents.txt. | ✓3t |
quickstart-init-projecteasy Initialize a new npm project non-interactively with default values. The resulting package.json should exist in the current directory. | ✓2t |
quickstart-run-scripteasy There is a package.json with a script called 'hello' that runs 'echo Hello from benchmark'. Execute that script using npm. | ✓1t |
discover-versioneasy Check what version of npm is installed and print it. | ✓1t |
discover-help-listeasy List all available npm commands. npm has a way to display usage info for all commands at once. | ✓2t |
config-set-and-getmedium Use npm config to set the 'init-author-name' config value to 'Benchmark Author', then retrieve and display it to confirm it was set. | ✓2t |
config-pkg-set-scriptsmedium There is a package.json in the current directory. Use npm pkg to add a 'build' script that runs 'echo built successfully' and a 'lint' script that runs 'echo lint passed'. Then list all scripts using npm pkg get. | ✓3t |
config-workspace-setupmedium Set up an npm workspace project. Create a root package.json with a 'workspaces' field pointing to 'packages/*'. Then create two workspace packages: packages/bench-core (with its own package.json, name '@bench/core', version '1.0.0') and packages/bench-utils (with its own package.json, name '@bench/utils', version '1.0.0'). Run npm install at the root to link them. | ✓2t |
flags-ls-jsonmedium There is a project with some dependencies installed. List the installed packages in JSON format using npm ls with the --json flag. Print the output to stdout. | ✓1t |
flags-ls-depthmedium There is a project with dependencies installed. List only the top-level (depth 0) dependencies. Do not show nested sub-dependencies. | ✓1t |
flags-pack-dry-runmedium There is a package.json in the current directory. Run npm pack in dry-run mode to see what files would be included in the tarball without actually creating it. | ✗1t |
error-missing-scripthard There is a package.json that does not have a 'test' script defined. Try running npm test. It will fail. Then add a 'test' script that runs 'echo tests passed' to the package.json and run npm test again so it succeeds. | ✓6t |
error-install-nonexistenthard Try to install a package called 'bench-nonexistent-pkg-xyz-99999' using npm. This package does not exist, so the install will fail. Capture or display the error. Then write the npm error code to a file called bench-error.txt (just the error code, like 'E404'). | ✓2t |
error-fix-invalid-jsonhard The package.json file has invalid JSON (a trailing comma). Fix the package.json so it is valid JSON, then run npm install to confirm it works. | ✓5t |
workflow-init-add-deps-runhard Initialize an npm project non-interactively. Add a 'start' script that runs 'node index.js'. Create an index.js file that prints 'bench-app running'. Then run npm start to execute it. | ✓8t |
workflow-version-bumphard There is a package.json at version 1.0.0. First, initialize a git repo and make an initial commit (so npm version can tag). Then use npm version to bump the patch version (from 1.0.0 to 1.0.1). Verify the package.json now shows 1.0.1. | ✓6t |
workflow-local-install-from-tarballhard There are two project directories: bench-lib/ (a library) and bench-app/ (a consumer). First, run npm pack in bench-lib/ to create a tarball. Then, in bench-app/, install the tarball as a dependency using npm install with a file path to the .tgz file. Verify that bench-app/node_modules/bench-lib exists. | ✓4t |
Task suite source366 lines · YAML
- id: quickstart-init-project
intent: Initialize a new npm project non-interactively with default values. The
resulting package.json should exist in the current directory.
assert:
- ran: npm
- file_exists: package.json
setup: []
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: quickstart-init-custom-name
intent: Create a new npm project non-interactively, then use npm pkg to set the
package name to 'bench-app'. The package.json should contain the name
'bench-app'.
assert:
- ran: npm
- file_exists: package.json
- file_contains:
path: package.json
text: bench-app
setup: []
max_turns: 4
difficulty: easy
category: getting-started
docs_origin: docs/lib/content/commands/npm-pkg.md#Description
- id: quickstart-run-script
intent: There is a package.json with a script called 'hello' that runs 'echo
Hello from benchmark'. Execute that script using npm.
assert:
- ran: npm
- output_contains: Hello from benchmark
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-project",
"version": "1.0.0",
"scripts": {
"hello": "echo Hello from benchmark"
}
}
SETUP
max_turns: 3
difficulty: easy
category: getting-started
docs_origin: docs/lib/content/commands/npm-run-script.md#Description
- id: discover-version
intent: Check what version of npm is installed and print it.
assert:
- ran: npm.*--version|npm.*-v|npm version
- output_contains: .
setup: []
max_turns: 3
difficulty: easy
category: command-discovery
docs_origin: docs/lib/content/commands/npm.md#Description
- id: discover-help-list
intent: List all available npm commands. npm has a way to display usage info for
all commands at once.
assert:
- ran: npm.*-l|npm help|npm --help
setup: []
max_turns: 4
difficulty: easy
category: command-discovery
docs_origin: docs/lib/content/commands/npm-help.md#Description
- id: config-set-and-get
intent: Use npm config to set the 'init-author-name' config value to 'Benchmark
Author', then retrieve and display it to confirm it was set.
assert:
- ran: npm config set
- ran: npm config get|npm config list
- output_contains: Benchmark Author
setup:
- npm init -y
max_turns: 5
difficulty: medium
category: config
docs_origin: docs/lib/content/commands/npm-config.md#set
- id: config-pkg-set-scripts
intent: There is a package.json in the current directory. Use npm pkg to add a
'build' script that runs 'echo built successfully' and a 'lint' script that
runs 'echo lint passed'. Then list all scripts using npm pkg get.
assert:
- ran: npm pkg set
- file_contains:
path: package.json
text: build
- file_contains:
path: package.json
text: lint
setup:
- npm init -y
max_turns: 6
difficulty: medium
category: config
docs_origin: docs/lib/content/commands/npm-pkg.md#Description
- id: config-workspace-setup
intent: "Set up an npm workspace project. Create a root package.json with a
'workspaces' field pointing to 'packages/*'. Then create two workspace
packages: packages/bench-core (with its own package.json, name
'@bench/core', version '1.0.0') and packages/bench-utils (with its own
package.json, name '@bench/utils', version '1.0.0'). Run npm install at the
root to link them."
assert:
- ran: npm
- file_exists: package.json
- file_exists: packages/bench-core/package.json
- file_exists: packages/bench-utils/package.json
- file_contains:
path: package.json
text: workspaces
setup: []
max_turns: 8
difficulty: medium
category: config
docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: flags-ls-json
intent: There is a project with some dependencies installed. List the installed
packages in JSON format using npm ls with the --json flag. Print the output
to stdout.
assert:
- ran: npm ls.*--json|npm list.*--json
- output_contains: abbrev
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-flags-project",
"version": "1.0.0",
"dependencies": {}
}
SETUP
- npm install --save --prefer-offline abbrev@2.0.0 2>/dev/null || npm
install --save abbrev@2.0.0
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: docs/lib/content/commands/npm-ls.md#Description
- id: flags-ls-depth
intent: There is a project with dependencies installed. List only the top-level
(depth 0) dependencies. Do not show nested sub-dependencies.
assert:
- ran: npm ls.*--depth|npm list.*--depth
- output_contains: abbrev
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-depth-project",
"version": "1.0.0",
"dependencies": {}
}
SETUP
- npm install --save --prefer-offline abbrev@2.0.0 2>/dev/null || npm
install --save abbrev@2.0.0
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: docs/lib/content/commands/npm-ls.md#Description
- id: flags-pack-dry-run
intent: There is a package.json in the current directory. Run npm pack in
dry-run mode to see what files would be included in the tarball without
actually creating it.
assert:
- ran: npm pack.*--dry-run
- output_contains: package.json
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-pack-test",
"version": "1.0.0",
"main": "index.js"
}
SETUP
- echo 'module.exports = 42;' > index.js
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: docs/lib/content/commands/npm-pack.md#Description
- id: flags-version-read
intent: There is a package.json in the current directory. Use npm to read and
print just the current version from the package.json. Do not bump the
version, just display it.
assert:
- ran: npm
- output_contains: 2.5.3
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-version-project",
"version": "2.5.3",
"description": "test project"
}
SETUP
max_turns: 5
difficulty: medium
category: flag-parsing
docs_origin: docs/lib/content/commands/npm-version.md#Description
- id: error-missing-script
intent: There is a package.json that does not have a 'test' script defined. Try
running npm test. It will fail. Then add a 'test' script that runs 'echo
tests passed' to the package.json and run npm test again so it succeeds.
assert:
- ran: npm test
- file_contains:
path: package.json
text: test
- output_contains: tests passed
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-error-project",
"version": "1.0.0",
"scripts": {}
}
SETUP
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: docs/lib/content/commands/npm-test.md#Description
- id: error-install-nonexistent
intent: Try to install a package called 'bench-nonexistent-pkg-xyz-99999' using
npm. This package does not exist, so the install will fail. Capture or
display the error. Then write the npm error code to a file called
bench-error.txt (just the error code, like 'E404').
assert:
- ran: npm install.*bench-nonexistent-pkg-xyz-99999|npm
i.*bench-nonexistent-pkg-xyz-99999
- file_exists: bench-error.txt
- file_contains:
path: bench-error.txt
text: E404
setup:
- npm init -y
max_turns: 6
difficulty: hard
category: error-recovery
docs_origin: docs/lib/content/commands/npm-install.md#Description
- id: error-fix-invalid-json
intent: The package.json file has invalid JSON (a trailing comma). Fix the
package.json so it is valid JSON, then run npm install to confirm it works.
assert:
- ran: npm
- verify:
run: node -e
"JSON.parse(require('fs').readFileSync('package.json','utf8'));console.log('VALID')"
output_contains: VALID
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-broken",
"version": "1.0.0",
"description": "broken json",
}
SETUP
max_turns: 8
difficulty: hard
category: error-recovery
docs_origin: docs/lib/content/commands/npm-install.md#Description
- id: workflow-init-add-deps-run
intent: Initialize an npm project non-interactively. Add a 'start' script that
runs 'node index.js'. Create an index.js file that prints 'bench-app
running'. Then run npm start to execute it.
assert:
- ran: npm
- file_exists: package.json
- file_exists: index.js
- file_contains:
path: package.json
text: start
- output_contains: bench-app running
setup: []
max_turns: 8
difficulty: hard
category: multi-step-workflow
docs_origin: docs/lib/content/commands/npm-init.md#Description
- id: workflow-version-bump
intent: There is a package.json at version 1.0.0. First, initialize a git repo
and make an initial commit (so npm version can tag). Then use npm version to
bump the patch version (from 1.0.0 to 1.0.1). Verify the package.json now
shows 1.0.1.
assert:
- ran: npm version
- file_contains:
path: package.json
text: 1.0.1
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-versioned",
"version": "1.0.0",
"description": "version bump test"
}
SETUP
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: docs/lib/content/commands/npm-version.md#Description
- id: workflow-pack-and-inspect
intent: There is a project with a package.json, an index.js, and a README.md.
Run npm pack to create a tarball. Then list the contents of the generated
tarball (using tar) and write the file listing to bench-pack-contents.txt.
assert:
- ran: npm pack
- file_exists: bench-pack-contents.txt
- file_contains:
path: bench-pack-contents.txt
text: package.json
- file_contains:
path: bench-pack-contents.txt
text: index.js
setup:
- |
cat > package.json << 'SETUP'
{
"name": "bench-packable",
"version": "1.0.0",
"main": "index.js"
}
SETUP
- echo 'module.exports = true;' > index.js
- echo '# Bench Packable' > README.md
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: docs/lib/content/commands/npm-pack.md#Description
- id: workflow-local-install-from-tarball
intent: "There are two project directories: bench-lib/ (a library) and
bench-app/ (a consumer). First, run npm pack in bench-lib/ to create a
tarball. Then, in bench-app/, install the tarball as a dependency using npm
install with a file path to the .tgz file. Verify that
bench-app/node_modules/bench-lib exists."
assert:
- ran: npm pack
- ran: npm install
- file_exists: bench-app/node_modules/bench-lib/package.json
setup:
- mkdir -p bench-lib bench-app
- |
cat > bench-lib/package.json << 'SETUP'
{
"name": "bench-lib",
"version": "1.0.0",
"main": "index.js"
}
SETUP
- "echo 'module.exports = { greet: () => \"hello\" };' > bench-lib/index.js"
- |
cat > bench-app/package.json << 'SETUP'
{
"name": "bench-app",
"version": "1.0.0"
}
SETUP
max_turns: 10
difficulty: hard
category: multi-step-workflow
docs_origin: docs/lib/content/commands/npm-install.md#Description
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 19 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for npm — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.