Coming soon

CLI Intelligence

Your evals show pass/fail. Intelligence tells you why agents struggle with your CLI and exactly what to fix. AI-powered analysis of your benchmark traces, with projected impact for every recommendation.

Get Started with Evals Learn about Evals →

Example Report Preview

Executive Summary

Only 38% of first command attempts succeed, with agents consistently struggling with the plural subcommand name. Help text is effective once found (85% success after reading --help), indicating the CLI's functionality is well-documented but poorly surfaced through command naming.

HighAdd singular subcommand alias9/24 traces

HighAdd --format as alias for -o7/24 traces

MediumAccept case-insensitive enum values2/24 traces

Pass Rate

88%→100%

Discovery Cost

1.3→~0.3

Avg Turns

6.3→~3.5

How it works

Actionable Recommendations

Specific CLI changes ranked by severity. Not generic advice, but concrete fixes like "add a singular alias for 'checks'" with the exact frequency from your traces.

Projected Impact

See how your pass rate, discovery cost, and average turns would improve if you implemented each recommendation. Data-driven prioritization.

Built on Your Eval Data

Intelligence analyzes the traces from your actual benchmark runs. It sees what agents try, where they fail, and why. No synthetic data, no guesswork.

What Intelligence surfaces

Every insight is derived from your real agent traces. No synthetic benchmarks, no guesswork.

Command Chain Analysis

Agents chain list, then get, then show when a single describe could do it all. Intelligence surfaces composite command opportunities that cut agent turns.

Go from 5 chained commands to 2

Error Recovery Patterns

Agents hit an error, then try 3 flag variations before finding the right one. Intelligence identifies which error messages need better suggestions and where "did you mean?" prompts would eliminate retry loops.

Eliminate 60% of wrong-flag retries

Help Text Effectiveness

After reading --help, 85% of attempts succeed. But only 40% of agents try it first. Intelligence shows where discoverability gates are, and what to surface in error messages to short-circuit the help lookup.

Double first-attempt success rate

PR Impact Preview

Renaming a subcommand? Intelligence predicts the impact: "This rename will break 3/12 agent tasks. Here's the predicted new pass rate." Regression prevention, not just detection.

Catch regressions before merge

Start with free evals today

Intelligence builds on your eval data. Start running benchmarks now and you'll be first in line when Intelligence launches.

Get Started Free