AWS Just Killed the AI Pilot Phase

A shocking report revealed 95% of enterprise AI pilots fail. AWS just launched three core features in AgentCore designed to fix the trust and control issues that kill AI projects before they start.

industry insights
Hero image for: AWS Just Killed the AI Pilot Phase

AI's 95% Failure Rate Is Real

Ninety-five percent of enterprise AI pilots fail. That number, from a widely cited MIT report, hit boardrooms like a fire alarm this year, because it exposes a brutal reality: most corporate AI never makes it past the cool-demo stage. Budgets get burned, slide decks look great, and then the pilot quietly dies before touching a real customer or a production workflow.

Underneath that failure rate sits a simple problem: enterprises do not trust non-deterministic systems they cannot fully control. Traditional software behaves predictably; the same input yields the same output every time. Large language models improvise. They hallucinate, misinterpret policy, and occasionally invent data—behaviors that are unacceptable when you are moving money, touching medical records, or hitting internal APIs.

A slick chatbot demo in a conference room runs on handpicked prompts, curated data, and a forgiving audience. A production-grade AI system runs on messy tickets, half-complete CRM entries, angry customers, and compliance officers who assume everything will go wrong. That gap between demo and deployment is where pilots go to die. The system that looked magical in a sandbox suddenly needs audit trails, rate limits, error budgets, and incident playbooks.

Most enterprises discover this only after the pilot “succeeds” technically but fails organizationally. Security teams block access to critical tools. Legal demands hard guarantees on data usage. Ops teams cannot debug why an agent decided to refund $5,000 instead of $50. Without guardrails, evaluations, and observability built in, AI becomes an unaccountable black box bolted onto mission-critical systems.

This is why “agentic” AI has stalled in what many teams now call pilot purgatory. Agents can call tools, trigger workflows, and act autonomously, but companies lack a way to systematically prove they are safe, measurable, and improvable over time. The industry does not just need better models; it needs infrastructure that treats policy, evaluation, and memory as first-class citizens, not afterthoughts.

That is the shift AWS is now openly targeting: turning AI from experimental toy into governed infrastructure that enterprises can actually run at scale.

AWS's Answer to the Enterprise Dilemma

Illustration: AWS's Answer to the Enterprise Dilemma
Illustration: AWS's Answer to the Enterprise Dilemma

AWS re:Invent has turned into a live-fire exercise for enterprise AI, and AgentCore is AWS’s answer to the 95% pilot failure rate hanging over CIOs’ heads. Instead of another “build your own agent” SDK, AgentCore arrives as a production platform: a managed gateway, policy engine, eval system, and memory layer designed to keep agents from going rogue at scale.

AWS is blunt about the target customer: enterprises that already ran flashy demos and then slammed into security, compliance, and reliability walls. AgentCore promises agents that can run across any model, hit internal tools and APIs, and still respect corporate rules, SLAs, and audit trails. No infra babysitting, no one-off glue code.

At re:Invent, AWS elevated three ideas to first-class, always-on components of AgentCore: Policy, Evaluations, and Episodic memory. These are not optional add-ons; they sit directly in the agent execution path, inspecting every request and every tool call.

Policy turns natural-language rules into executable guardrails. You can write constraints like “forbid Slack messages unless user has messaging right scope” or “block URLs containing ‘internal’ unless username starts with admin,” and AgentCore compiles that into code that runs in milliseconds. The policy engine sits behind the AgentCore gateway, deciding which tools an agent may call before anything touches Salesforce, Slack, or internal systems.

Evaluations attack the other half of the trust problem: quality drift and silent failure. AgentCore ships with off‑the‑shelf evals for correctness, safety, instruction following, and tool use, plus hooks for custom metrics, from brand voice to domain‑specific accuracy. Teams can run evals on demand or continuously, then wire the scores into monitoring stacks to decide when an agent is ready to leave “pilot” purgatory.

Episodic memory completes the picture by letting agents learn from prior successes and failures across many sessions, not just a single chat thread. Those memories feed back into both runtime behavior and evaluations, so enterprises can track whether agents actually improve instead of just improvising faster.

Building Unbreakable AI Guardrails

Policy in AgentCore is AWS’s attempt to hard-code corporate common sense into AI. Instead of burying rules in fragile prompts, AgentCore exposes Policy as a first-class control layer that sits between agents and the tools, data, and systems they want to touch. Every request hits this policy engine before anything else happens.

That design matters because modern models are no longer just autocomplete toys. Research from Anthropic and others documents capabilities like deception, strategic misrepresentation, and data exfiltration attempts when models get access to sensitive tools or internal networks. Enterprises cannot rely on vibes and red-team anecdotes when a misstep could leak customer data or trigger a financial transaction.

Policy gives enterprises a centralized, scalable way to say what agents may and may not do, then enforce it at runtime. You describe constraints in natural language—“forbid Slack messages unless user has messaging right scope,” “block URLs containing ‘internal’ unless username starts with admin”—and AgentCore generates the programmatic policy code automatically. That code executes in milliseconds, fast enough to sit in the hot path for thousands of requests per second.

Under the hood, every agent call routes through the AgentCore gateway, which consults the policy engine before exposing any tool. If the policy denies access, the agent never even sees the capability, whether that’s a Salesforce API, an S3 bucket, or a payments endpoint. Policy operates at the infrastructure layer, not at the mercy of whatever the model “feels like” doing.

Contrast that with how most teams ship agents today. They stuff a paragraph of “don’t leak secrets, don’t browse internal sites, don’t approve refunds over $100” into a system prompt and hope the model obeys. That works in a demo; it breaks the moment you scale to hundreds of workflows, dozens of tools, and millions of calls.

Prompt-level instructions also fail silently. Models hallucinate, ignore instructions under pressure, or get jailbroken by clever inputs, and you rarely know until after something goes wrong. Policy in AgentCore flips that: governance lives outside the model, centrally managed, versioned, auditable, and testable with automated reasoning techniques that formally check for hallucinations and rule violations.

For enterprises trying to move beyond AI pilots, that shift is the difference between “please behave” and “cannot misbehave by design.” AWS is betting that kind of hard control plane, documented on the Amazon Bedrock AgentCore - Official Product Page, is what finally gets agents into production at scale.

From Plain English to Policy Code

Policies in AgentCore start as plain English, not YAML or JSON. Developers type instructions into a prompt box exactly the way they’d explain them to a security team: “Forbid Slack messages unless user has messaging right scope. Viewing websites with URL containing internal is forbidden unless the username starts with admin. Allow Slack messages when the user is within the allowed group.”

Behind that deceptively simple interface, AgentCore treats those sentences as source code. A policy compiler parses the natural language, resolves entities like “Slack messages,” “messaging right scope,” and “username,” and emits programmatic rules that bind directly to tools, resources, and identity attributes in your stack.

That generated policy is not a slow LLM call at runtime. AgentCore turns it into low-level, executable policy code that runs as deterministic logic, so each request hits compiled checks instead of re-prompting a model. You write the rule once in English, then AgentCore locks it in as fast, testable code.

AWS pushes you to validate those guardrails like any other production system. After generating the policy, you run test cases against it in the console, confirming that a user without the “messaging right scope” cannot send a Slack message, while an admin user can open an internal URL. No redeploys, no re-architecting—just adjust the text, regenerate, and re-test.

Scale is where this stops looking like a toy and starts looking like infrastructure. AgentCore’s policy engine sits on the hot path and evaluates rules in milliseconds, even as agents fan out across tools like Slack, Salesforce, and internal APIs. AWS explicitly targets “thousands of requests per second,” which pushes this closer to a firewall than a chatbot plugin.

AgentCore Gateway is the traffic cop that makes it work at that volume. Every agent request—whether from an internal assistant, MCP client, or external app—routes through the Gateway before it ever touches a tool or data source. The Gateway calls the policy engine, which decides, per request, which tools and resources the agent can actually use.

That means a single natural-language rule like “forbid Slack messages unless user has messaging right scope” becomes a global control surface. Any agent trying to hit the Slack tool gets checked, every time, at wire speed. No shadow agents, no forgotten scripts, no bypass paths.

For enterprises burned by that 95% AI pilot failure rate, this is the critical shift: policy moves from slideware to code, from documentation to the execution path.

Your AI Agent's Performance Review

Illustration: Your AI Agent's Performance Review
Illustration: Your AI Agent's Performance Review

Trust, not features, is what kills most AI pilots, and AWS knows it. After Policy, the second pillar of AgentCore is Evaluations—a built-in performance review system for agents that treats quality as part of the execution path, not a dashboard you bolt on later.

Most enterprises do evaluation backwards. Teams hack together an agent, ship a pilot, then scramble to measure whether it works. AgentCore flips that: AWS wants you to define evals first, establish a baseline, and only then start iterating, so every change has a measurable impact instead of “it feels smarter.”

Out of the box, AgentCore ships with a battery of standard evaluation signals. AWS calls out dimensions like: - correctness - helpfulness - conciseness - instruction following - faithfulness - response relevance - coherence - refusal behavior

Those signals matter because agents are non-deterministic. A demo might look flawless, then quietly degrade once you wire in real tools, noisy context, and messy customer data. Continuous monitoring across these eval dimensions is how you catch drift before a VP gets a hallucinated refund policy in their inbox.

AgentCore lets you run evaluations on demand or continuously. You can gate a new agent version behind a quality threshold, or run rolling evals in production to compare behavior week over week. That baseline becomes your north star: if correctness drops 10% after adding a new tool, you know exactly when you broke trust.

Custom evals plug the gap between generic quality and business reality. If your support bot must mirror a specific brand voice, you can codify that as a custom signal. If your compliance team needs hard guarantees around refusal in regulated workflows, you can write an eval that fails any response that strays outside policy.

Because Evaluations live inside AgentCore, not off to the side in a BI tool, every score ties back to a traceable decision path. When an agent goes off script, you can walk the chain from prompt, to tools, to memory, to final output and fix the actual failure mode, not just the symptom.

Custom Evals: Is Your AI a Pirate?

Off-the-shelf evals only get enterprises halfway. AgentCore’s real power move is custom evaluations, where teams define exactly what “good” looks like for their own agents and score against it continuously, not just in a lab benchmark once a quarter. That shift turns evals from a static QA checklist into a live governance system.

AWS’s own demo goes silly on purpose: a “talk like a pirate” eval. You literally specify that the agent must respond in pirate-speak—“Ahoy,” “matey,” nautical slang—and the custom eval checks every response. If the output sounds like LinkedIn instead of Blackbeard, the eval fails and logs it.

That pirate bit is a joke with sharp edges. Swap the theme and you get a serious enterprise pattern: enforce a brand voice across every customer-facing agent. A retailer can require friendly, concise, emoji-free responses; a bank can demand formal tone, hedged language, and explicit risk disclaimers. A custom eval scores each reply against those rules and feeds that data into dashboards and alerts.

More complex use cases go beyond tone. A healthcare agent might need to: - Follow a multi-step triage workflow - Surface specific regulatory disclaimers - Escalate to a human under defined risk conditions

A custom eval can replay real conversations, verify each step, and assign pass/fail on workflow adherence, not just “helpfulness.” That’s how teams stop guessing whether an agent is safe to unleash on patients, traders, or field technicians.

All of this plugs directly into Amazon CloudWatch. Standard metrics like latency and error rate sit next to custom scores for correctness, workflow compliance, or pirate-speak on a single timeline. Engineering, legal, and marketing teams can stare at the same graphs, and when something drifts, they can trace it back through AgentCore logs and the policies described in Introducing Amazon Bedrock AgentCore - AWS Blog.

The Agent That Learns From Its Mistakes

Episodic memory turns AgentCore from a clever chatbot router into something closer to an institutional brain. Instead of treating every request as a one-off transaction, agents can now store and retrieve experiences: what they tried, which tools they called, what worked, and what backfired.

Traditional enterprise agents behave like goldfish. They answer a ticket, call an API, close the loop, and forget everything the moment the response goes out. Episodic memory flips that model, giving AgentCore a persistent, queryable record of agent behavior over time.

Crucially, this memory is global, not personal. It does not cling to a single user’s chat thread or a specific session ID. When an agent figures out the right remediation steps for a nasty S3 permissions bug, those steps become part of the shared memory that every future instance of that agent can draw on.

That propagation changes how organizations think about “training.” Instead of retraining models or rewriting prompts every time a new edge case appears, the agent logs the episode, captures the context, tags the outcome as a success or failure, and reuses it. One support interaction in January can quietly improve thousands of similar cases in March.

Pattern recognition becomes the killer feature. With enough logged episodes, agents can start to spot that: - 80% of failed order lookups trace back to a single legacy API - Certain tools consistently time out under specific load patterns - A particular policy rule triggers unnecessary refusals for safe requests

Those patterns feed back into decision-making. The agent can preemptively avoid flaky tools, escalate high-risk flows faster, or choose safer paths when previous attempts produced policy violations. Over time, the agent behaves less like a stateless function and more like a continuously improving operations runbook.

Because evaluations sit in the same execution path, AgentCore can score each episode and store the result alongside the memory. That closes the loop: policy constrains behavior, evaluations judge outcomes, and episodic memory makes sure every hard-earned lesson sticks across the entire deployment.

Connecting Memory to Measurable Improvement

Illustration: Connecting Memory to Measurable Improvement
Illustration: Connecting Memory to Measurable Improvement

Memory stops being a party trick once you wire it directly into evaluations. AgentCore now treats episodic memory as another data source for its quality checks, so every interaction feeds into a tight feedback loop: act, score, learn, repeat. That loop runs continuously, not as a quarterly MLOps science project.

Instead of judging an agent only on a single response, evaluations can now ask, “Given what you learned last week, did you actually do better today?” AgentCore can compare performance on recurring tasks across episodes: identical tickets, similar support flows, or repeat refund scenarios. If accuracy, latency, or policy compliance does not trend up over dozens or hundreds of runs, your “learning” agent is just hoarding logs.

Because memory is first-class, evals can enforce longitudinal goals, not just one-off correctness. You can define targets like “reduce tool-call failures by 30% over 500 episodes” or “cut average handle time by 10% for repeat customers.” Those metrics tie directly to business KPIs instead of abstract model scores.

Observability gets sharper too. When an agent fails a custom eval—hallucinating a price, misrouting a ticket, leaking internal data—you can trace the entire reasoning path. AgentCore lets you walk back through the episodic memory: which tools it called, which prior conversations it reused, which policy decisions it hit or ignored.

That trace turns postmortems from guesswork into root-cause analysis. You can see whether the agent: - Learned from a bad example and propagated the error - Misinterpreted a previous success pattern - Skipped a relevant memory that should have changed its plan

Once you know which memory led it astray, you can prune or rewrite that episode, then re-run the same eval set to verify the fix. The feedback loop closes: memory changes, behavior changes, metrics move—or they do not, and you know immediately.

Static AI tools behave like forms: same inputs, same outputs, no sense of history. With episodic memory wired into live quality evaluations, agents start to look like digital workers who onboard, get coached, and improve. Policy keeps them inside the lines, evaluations grade their performance, and memory gives them something to build on.

Why 'Built-In' Beats 'Bolted-On'

Built-in policy, evaluation, and memory inside AgentCore are not just convenience features; they sit directly on the execution path of every agent step. Every tool call, every resource access, every response routes through the same gateway that enforces policy and records episodic memory before the model ever touches sensitive data.

That architecture choice matters. Because policy lives at the gateway, AgentCore can apply guardrails to thousands of requests per second with millisecond latency, rather than bolting on a slow, separate “governance service” that runs after the fact. Evaluations tap into the same low-level traces, so quality checks see the exact context the agent used, not a lossy summary.

Most rival frameworks treat safety and monitoring as sidecars. You wire up: - A separate policy proxy in front of tools - A separate eval pipeline in a notebook or CI job - A separate logging system for observability

Those pieces often drift out of sync, miss edge cases, or silently break when someone adds a new tool or changes a prompt.

AgentCore’s first-class design means new tools and workflows automatically inherit the same policies, evals, and memory behavior. When a developer registers an API or MCP tool, the gateway immediately subjects it to the existing policy engine and evaluation hooks—no extra SDK calls, no custom middleware, no bespoke wrappers per team.

Production teams care about failure modes, not demos. With AgentCore, a hallucinated refund, a data exfiltration attempt, or a broken workflow all surface through the same evaluation and trace pipeline that operations teams already monitor. Because episodic memory also sits in that core path, those failures feed back into the agent’s long-term behavior rather than disappearing into logs.

Contrast that with common “bolt-on” eval stacks, where quality checks run on sampled logs hours later. By the time a bad decision shows up in a dashboard, the agent may have repeated it thousands of times. Deep integration lets AgentCore run evals continuously and reactively, gating deployments or routing to humans when scores drop.

AWS is effectively saying that guardrails, measurement, and learning are table stakes, not plugins. AgentCore bakes that stance into its architecture, aligning with the broader re:Invent push toward opinionated, production-first AI platforms highlighted in Top Announcements of AWS re:Invent 2025 - AWS Blog.

The New Blueprint for Production AI

Ninety-five percent of enterprise AI pilots die in the sandbox because nobody can both trust and control what the models do at scale. AgentCore’s Policy, Evaluations, and Episodic Memory stack attacks that failure loop directly: hard guardrails define what agents may access, evals verify how they behave, and memory lets them improve instead of repeating the same mistakes forever.

Policy moves governance from slide decks into the execution path. Plain-English rules like “forbid Slack messages unless user has messaging right scope” compile into code that gates every tool call through the AgentCore gateway in milliseconds, across thousands of requests per second, with automated reasoning catching hallucinations and sketchy behavior before it hits production systems.

Evaluations turn fuzzy “is this working?” debates into dashboards and regression tests. Off‑the‑shelf metrics track correctness, safety, instruction following, and tool choice, while custom evals encode domain quirks—brand tone, legal constraints, even “talk like a pirate” if that matters—so teams can ship agents with the same rigor they use for APIs and microservices.

Episodic Memory closes the loop. Agents no longer operate as amnesiacs; they carry forward patterns from past successes and failures across users, workflows, and environments, and Evaluations can directly measure whether those memories translate into higher scores and fewer incidents over time.

Taken together, this trifecta looks less like a feature release and more like a new blueprint for production AI. Instead of brittle one‑off bots, enterprises get a governed, observable, self‑improving agent fabric that can actually graduate from pilot to company‑wide rollout.

AgentCore now sits in the same category as Kubernetes or IAM: invisible when it works, foundational when it doesn’t. As automated agents start handling tickets, invoices, security checks, and code changes, platforms that bake control, measurement, and learning into the core runtime will decide which companies escape the 95% and which stay stuck in endless “experiments.”

Frequently Asked Questions

What are the three main new features in AWS AgentCore?

The three key announcements are Policy for natural language-based guardrails, Evaluations for continuous quality and performance monitoring, and Episodic Memory for agents to learn from past interactions.

How does AgentCore Policy ensure AI safety?

It converts plain English rules into programmatic code. These policies are checked at a central gateway in milliseconds before an agent can act, preventing unauthorized or unsafe operations.

Is AgentCore tied to a specific AI model like Claude or Llama?

No, AgentCore is designed to be model and framework agnostic. This allows enterprises to build and manage agents using any underlying large language model that fits their needs.

What makes AgentCore's new features different from other solutions?

The primary differentiator is that Policy, Evaluations, and Memory are built in as 'first-class citizens' at the lowest level of the agent execution path, rather than being added on as an afterthought.

Tags

#aws#agentcore#enterprise-ai#ai-agents#reinvent

Stay Ahead of the AI Curve

Discover the best AI tools, agents, and MCP servers curated by Stork.AI. Find the right solutions to supercharge your workflow.