Why two axes
A single score conflates two different things. A tool can survive the agent shift because it has a moat (Stripe, Plaid, Costco) or because it's already living inside the agent stack (a polished MCP server, a Claude Skill, a usage-priced API). Most rubrics confuse the two. We separate them.
The result is a 2x2. The cells aren't value judgments — they're strategic postures, and each has a different path forward.
Axis 1 — Defensibility (0-100)
Judged by Claude. Yes, really. We give Claude the tool's scraped homepage, pricing, docs, and metadata, and ask it which of seven moats apply:
- Physical-world coupling — hardware, logistics, terminals.
- Regulatory moat — licenses, HIPAA, banking charters.
- Network liquidity — two/N-sided marketplaces where users bring users.
- Proprietary refreshing data — data nobody else has, that updates.
- High-trust catastrophic workflows — legal, financial, medical.
- Multi-stakeholder coordination — Stripe, Plaid, ServiceTitan-class.
- Brand / community / taste — cultural authority.
Each moat that applies adds points. Most tools have 0-2 real moats. We're strict — "we have integrations" is not a moat.
The model also writes a one-paragraph verdict in Stork voice, plus a defense plan when the score is mid or low. This is the part that's the editorial brand.
Axis 2 — Agent-Readiness (0-100)
Deterministic probes. No LLM. Seven signals, weighted by how much they actually move adoption among agentic users:
| Signal | Points |
|---|---|
| Verified working MCP server | 25 |
| Listed on agent surfaces (registries, Cursor, Claude Desktop, etc.) | 20 |
| Usage-based pricing available | 15 |
| Headless agent auth (API key, sandbox, no sales gate) | 15 |
| Public OpenAPI / API contract | 10 |
| Active changelog (last 90 days) | 10 |
| llms.txt present | 5 |
We over-weight verified MCPs and agent-surface presence because they're the hardest signals to fake and the most predictive of agentic usage. We under-weight /llms.txt because it's trivially shippable.
The four cells
- Compounding — High defensibility AND high agent-readiness. Wins twice. (Stripe, Plaid, Twilio with MCPs.)
- Becomes the API — Low defensibility, high agent-readiness. The UI dies, the API survives because agents call it.
- Sleeping Giant — High defensibility, low agent-readiness. Safe but invisible. Add an MCP and you climb.
- Dead Man Walking — No moat, no agent presence. An LLM can do most of what the UI promises. Either build a moat or pivot.
Exemptions
Some tools aren't meant to be agent surfaces — hardware companies, regulated insurance carriers, brick-and-mortar retail, pure consumer media. These get marked Not an Agent Surface and scored on defensibility alone. The exempt list is small and conservative; we'd rather under-exempt than over-exempt.
Refresh cadence
Axis 2 probes are designed to re-run weekly (cron currently disabled — see admin). Axis 1 LLM judgment re-runs when:
- Anthropic or OpenAI ships a major capability that changes the field.
- A tool owner requests a re-score (rate-limited to once per week per tool).
- An admin manually triggers it.
New submissions are scored with Sonnet 4.6 at submission time. Backfilled tools start on Haiku 4.5 and are upgraded to Sonnet on owner request or feature event.
Honest disclosure
We use Claude to judge tools. Claude is made by Anthropic, which also pays Stork zero dollars for any of this. The framing "Claude judges Claude" is on the nose and we like it that way.
The probes can have false positives or false negatives. If your score is wrong, request a re-score from your tool page. Manual overrides per signal are available to admins; if you have a working MCP we missed, tell us.
What this isn't
It isn't a stock recommendation. It isn't a moral judgment. It isn't permanent. Tools move. Models improve. We re-score.