# agent manages Python dependencies
$ pip3 install requests
  Successfully installed requests-2.31.0
 
$ pip3 freeze > requirements.txt
$ pip3 list --format=json
  [{"name":"requests","version":"2.31.0"}]

Can AI agents use pip?

The Python package installer. Agents use it to install libraries, manage virtual environments, freeze requirements, and audit dependencies.

Docs →GitHub →

See the latest run →

100% overall pass rate1 model tested4 tasksvpip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)3/6/2026

pip eval results by model

Model	Pass rate	Avg turns	Avg tokens
gpt-5-nano	100%	2.5	4.4k

pip task results by model

Task	gpt-5-nano
install-and-freezemedium Create a Python virtual environment called 'venv', activate it, install the 'requests' package, and save the installed packages to requirements.txt using pip freeze.	✓2t2 turns · 6.1k tokens
install-from-requirementseasy Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on separate lines. Then create a virtual environment and install from the requirements file.	✓4t4 turns · 5.4k tokens
show-package-infoeasy Install the 'urllib3' package and then show detailed information about it using pip show.	✓1t1 turn · 2.7k tokens
check-outdatedmedium Create a virtual environment, install 'setuptools', and then list all outdated packages in the environment.	✓3t3 turns · 3.3k tokens

Task suite source52 lines · YAML

- id: install-and-freeze
  intent: Create a Python virtual environment called 'venv', activate it, install
    the 'requests' package, and save the installed packages to requirements.txt
    using pip freeze.
  assert:
    - ran: python.*-m venv|virtualenv
    - ran: pip.*install.*requests
    - ran: pip.*freeze
    - file_exists: requirements.txt
    - file_contains:
        path: requirements.txt
        text: requests
  setup: []
  max_turns: 6
  difficulty: medium
  category: basics
- id: install-from-requirements
  intent: Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on
    separate lines. Then create a virtual environment and install from the
    requirements file.
  assert:
    - file_exists: requirements.txt
    - file_contains:
        path: requirements.txt
        text: click
    - ran: pip.*install.*-r
  setup: []
  max_turns: 5
  difficulty: easy
  category: basics
- id: show-package-info
  intent: Install the 'urllib3' package and then show detailed information about
    it using pip show.
  assert:
    - ran: pip.*install.*urllib3
    - ran: pip.*show.*urllib3
    - output_contains: urllib3
  setup: []
  max_turns: 4
  difficulty: easy
  category: query
- id: check-outdated
  intent: Create a virtual environment, install 'setuptools', and then list all
    outdated packages in the environment.
  assert:
    - ran: pip.*install.*setuptools
    - ran: pip.*list.*--outdated|pip.*list.*-o
  setup: []
  max_turns: 6
  difficulty: medium
  category: workflow

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for pip — see the latest run. Set up the same for your CLI in minutes.

Model	Pass Rate	Delta
Sonnet 4.5	95%	+5%
GPT-4.1	80%	-5%
Haiku 4.5	65%	-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days

v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Start Free Read the guide

Compare other CLI evals

git

npm

aws

fly