# agent manages Python dependencies
$ pip3 install requests
  Successfully installed requests-2.31.0
 
$ pip3 freeze > requirements.txt
$ pip3 list --format=json
  [{"name":"requests","version":"2.31.0"}]

Can AI agents use pip?

The Python package installer. Agents use it to install libraries, manage virtual environments, freeze requirements, and audit dependencies.

See the latest run →
100% overall pass rate1 model tested4 tasksvpip 24.0 from /usr/lib/python3/dist-packages/pip (python 3.12)3/6/2026

pip eval results by model

ModelPass rateAvg turnsAvg tokens
gpt-5-nano100%2.54.4k

pip task results by model

Taskgpt-5-nano
install-and-freezemedium
Create a Python virtual environment called 'venv', activate it, install the 'requests' package, and save the installed packages to requirements.txt using pip freeze.
2t
install-from-requirementseasy
Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on separate lines. Then create a virtual environment and install from the requirements file.
4t
show-package-infoeasy
Install the 'urllib3' package and then show detailed information about it using pip show.
1t
check-outdatedmedium
Create a virtual environment, install 'setuptools', and then list all outdated packages in the environment.
3t
Task suite source52 lines · YAML
- id: install-and-freeze
  intent: Create a Python virtual environment called 'venv', activate it, install
    the 'requests' package, and save the installed packages to requirements.txt
    using pip freeze.
  assert:
    - ran: python.*-m venv|virtualenv
    - ran: pip.*install.*requests
    - ran: pip.*freeze
    - file_exists: requirements.txt
    - file_contains:
        path: requirements.txt
        text: requests
  setup: []
  max_turns: 6
  difficulty: medium
  category: basics
- id: install-from-requirements
  intent: Create a requirements.txt file with 'click>=8.0' and 'rich>=13.0' on
    separate lines. Then create a virtual environment and install from the
    requirements file.
  assert:
    - file_exists: requirements.txt
    - file_contains:
        path: requirements.txt
        text: click
    - ran: pip.*install.*-r
  setup: []
  max_turns: 5
  difficulty: easy
  category: basics
- id: show-package-info
  intent: Install the 'urllib3' package and then show detailed information about
    it using pip show.
  assert:
    - ran: pip.*install.*urllib3
    - ran: pip.*show.*urllib3
    - output_contains: urllib3
  setup: []
  max_turns: 4
  difficulty: easy
  category: query
- id: check-outdated
  intent: Create a virtual environment, install 'setuptools', and then list all
    outdated packages in the environment.
  assert:
    - ran: pip.*install.*setuptools
    - ran: pip.*list.*--outdated|pip.*list.*-o
  setup: []
  max_turns: 6
  difficulty: medium
  category: workflow

Evals are a snapshot, not a verdict. We run identical tasks across all models to keep comparisons fair. Results vary with CLI version, task selection, and model updates. Evals run weekly on 4 tasks using @cliwatch/cli-bench.

What you get with CLIWatch

Everything below is running live for pip see the latest run. Set up the same for your CLI in minutes.

ModelPass RateDelta
Sonnet 4.595%+5%
GPT-4.180%-5%
Haiku 4.565%-10%

CI & PR Comments

Get automated PR comments with per-model pass rates, regressions, and a link to the full comparison dashboard.

Pass rateLast 30 days
v1.0v1.6

Track Over Time

See how your CLI's agent compatibility changes across releases. Spot trends and regressions at a glance.

thresholds:
  claude-sonnet-4-5: 80%
  gpt-4.1: 75%
  claude-haiku-4-5: 60%

Quality Gates

Set per-model pass rate thresholds. CI fails if evals drop below your targets.

Get this for your CLI

Run evals in CI, get PR comments with regressions, track pass rates over time, and gate merges on quality thresholds — all from a single GitHub Actions workflow.

Compare other CLI evals