# agent manages Python dependencies $ pip3 install requests Successfully installed requests-2.31.0 $ pip3 freeze > requirements.txt $ pip3 list --format=json [{"name":"requests","version":"2.31.0"}]
Can AI agents use pip?
The Python package installer. Agents use it to install libraries, manage virtual environments, freeze requirements, and audit dependencies.
See the latest run →pip eval results by model
| Model | Pass rate | Avg turns | Avg tokens |
|---|---|---|---|
| gpt-5-nano | 100% | 2.5 | 4.4k |
pip task results by model
| Task | gpt-5-nano |
|---|---|
install-and-freezemedium Create a Python virtual environment called 'venv', activate it, install the 'requests' package, and save the installed packages to requirements.txt using pip freeze. | ✓2t |
install-from-requirementseasy Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on separate lines. Then create a virtual environment and install from the requirements file. | ✓4t |
show-package-infoeasy Install the 'urllib3' package and then show detailed information about it using pip show. | ✓1t |
check-outdatedmedium Create a virtual environment, install 'setuptools', and then list all outdated packages in the environment. | ✓3t |
Task suite source52 lines · YAML
- id: install-and-freeze
intent: Create a Python virtual environment called 'venv', activate it, install
the 'requests' package, and save the installed packages to requirements.txt
using pip freeze.
assert:
- ran: python.*-m venv|virtualenv
- ran: pip.*install.*requests
- ran: pip.*freeze
- file_exists: requirements.txt
- file_contains:
path: requirements.txt
text: requests
setup: []
max_turns: 6
difficulty: medium
category: basics
- id: install-from-requirements
intent: Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on
separate lines. Then create a virtual environment and install from the
requirements file.
assert:
- file_exists: requirements.txt
- file_contains:
path: requirements.txt
text: click
- ran: pip.*install.*-r
setup: []
max_turns: 5
difficulty: easy
category: basics
- id: show-package-info
intent: Install the 'urllib3' package and then show detailed information about
it using pip show.
assert:
- ran: pip.*install.*urllib3
- ran: pip.*show.*urllib3
- output_contains: urllib3
setup: []
max_turns: 4
difficulty: easy
category: query
- id: check-outdated
intent: Create a virtual environment, install 'setuptools', and then list all
outdated packages in the environment.
assert:
- ran: pip.*install.*setuptools
- ran: pip.*list.*--outdated|pip.*list.*-o
setup: []
max_turns: 6
difficulty: medium
category: workflow
Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.
What you get with CLIWatch
Everything below is running live for pip — see the latest run. Set up the same for your CLI in minutes.
| Model | Pass Rate | Delta |
|---|---|---|
| Sonnet 4.5 | 95% | +5% |
| GPT-4.1 | 80% | -5% |
| Haiku 4.5 | 65% | -10% |
CI & PR Comments
Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.
Track Over Time
See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.
thresholds:
claude-sonnet-4-5: 80%
gpt-4.1: 75%
claude-haiku-4-5: 60%Quality Gates
Set per-model pass rate thresholds. CI fails if evals drop below your targets.
Get this for your CLI
Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.