Claude's 24-Hour Test: The Power of an AI Coding Agent Harness

The 24-Hour AI Gauntlet

Cole Medin wanted to know what happens when you stop treating AI as a coding autocomplete and start treating it like a junior engineer who never sleeps. So he spun up Anthropic’s new Claude Code “long‑running agent harness” and forced an AI to work for a full 24 hours, no breaks, no “I’m done” button. The result: a stress test not of raw model IQ, but of whether agentic systems can grind through a real software project end to end.

Instead of asking for a to‑do list app or a single Python script, Medin set a brutal target: a functional web clone of claude.ai. That means chat history, conversation flows, artifacts, and a responsive UI that behaves like the real product, not just a static landing page. The harness framed success as a working full‑stack app, not a passing code snippet.

Medin wired the experiment around test‑driven development from the start. Before the AI wrote a line of code, he defined automated end‑to‑end tests that spin up a dev server, launch a headless browser, and click through core flows. The agent’s job: keep editing code until those tests go green.

Anthropic’s open‑sourced harness, which Medin pulled from GitHub, glues this all together. An “initializer” agent lays out specs, tasks, and test suites, then a dedicated coding agent repeatedly edits files, runs tests, and inspects failures. Each session behaves like a mini sprint, and the harness chains dozens of these sprints back to back.

Over roughly 24 hours, the system cycled through more than 50 coding sessions, touching backend APIs, frontend components, and test fixtures. The browser tests acted as a ruthless referee: they either confirmed a feature worked or shoved the agent back into the editor. Progress came in bursts as the AI fixed a failing flow, then hit a new integration edge case.

By the end, a bit over half of the total tests passed, enough to produce a recognizable claude.ai‑style interface but far from a pixel‑perfect clone. The harness showed that “24 hours of AI” does not magically equal “production‑ready SaaS,” yet it also proved that modern agents can sustain nontrivial, multi‑layered software work when given structure, persistence, and hard metrics for done.

Beyond 'Chat-to-Code': The Agent Harness

Anthropic’s open-source agent harness turns Claude from a chatty autocomplete into something closer to a junior engineer who never clocks out. Instead of a single prompt and a blob of code, the harness wires Claude into a scaffold that can run for hours, even days, without someone babysitting every step.

At its core, the harness enforces a loop: plan → code → test → refine. Claude proposes a change, edits files, runs automated tests or a dev server, inspects the results, then decides what to do next. That cycle repeats dozens of times, exactly what Cole Medin leans on when he asks Claude to chase a claude.ai clone for 24 straight hours.

Single-shot prompts give you a one-and-done answer based on a static snapshot of context. A long-running, stateful session keeps accumulating project history: failing tests, prior diffs, architectural decisions, even TODO comments. Over 50+ coding sessions, the agent can refactor earlier choices, untangle regressions, and pursue multi-step strategies that would be impossible in a single response window.

Anthropic’s design splits this into distinct roles. An initializer agent runs first, reading the repo, specs, and predefined tests, then drafting a high-level plan: tech stack, directory layout, milestones, and which tests define “done.” It can even generate or refine test suites so the system has an objective scoreboard before writing serious code.

Once the initializer sets the stage, a dedicated coding agent takes over. That agent loops through concrete tasks: create React components, wire API routes, adjust database schemas, or fix a specific failing Playwright test. Each loop uses tools exposed by the harness—file edit commands, test runners, headless browser checks—to make and verify changes.

Because the harness persists state to disk and threads it back into prompts, Claude can reason about yesterday’s migrations or that one brittle UI test that keeps flaking. Medin’s 24-hour run shows the result: the harness doesn’t just generate code, it orchestrates an ongoing negotiation between plan and reality, measured in passing tests rather than pretty demos.

Your Tests Are the Real Prompt

Your tests, not your prompts, really drove this 24‑hour stunt. Cole Medin treated test‑driven development (TDD) as the steering wheel: define what “done” means in code, then let Claude Code grind until reality matches the spec. No vibes, no “looks good to me,” just red or green.

Before the agent wrote a single line of UI, Cole wired up a full test suite that captured the core claude.ai flows. The harness knew about conversation creation, message history, and artifacts as explicit requirements, not vague goals. Success meant those tests passed, or the agent kept working.

That test suite acted as a contract between human and agent. Instead of micromanaging every component, Cole only said: here are the behaviors, here are the assertions, satisfy them. The agent’s autonomy lived entirely inside that contract, with the harness enforcing it run after run.

Progress stopped being subjective almost immediately. After each coding session, the harness ran the tests and produced a simple scoreboard: X of Y passing, plus stack traces for failures. Over roughly 50+ sessions across 24 hours, that number crept from zero to “a bit over half” of the tests passing.

Tests doubled as navigation and guardrails. When a refactor broke an earlier flow, the red tests yanked the agent back, forcing it to reconcile new code with old promises. That feedback loop replaced human code review with something colder and more reliable: automated checks.

Cole leaned heavily on end‑to‑end tests that simulated a real user in a headless browser. Using tools like Playwright or Puppeteer, the harness would: - Boot the dev server - Open a headless Chromium instance - Click through login, new chat, and artifact creation - Assert on DOM content, network calls, and persisted state

Those browser tests turned abstract requirements into concrete steps: “click this button,” “type this prompt,” “expect this response shape.” When they failed, the agent saw exact selectors, error messages, and expected vs. actual values, then patched code and reran the suite.

By the end, the passing tests described a partial but real claude.ai clone. The failing ones mapped precisely to missing or broken behaviors, not hand‑wavy disappointment.

The First Few Hours: A Flurry of Progress

Momentum hits almost immediately. Claude Code, wired into Anthropic’s long‑running agent harness, spins up a fresh project, installs dependencies, and scaffolds a full‑stack app before a human would finish sketching the architecture. Within the first hour, it generates a React front end, a basic backend API, and the wiring needed to run end‑to‑end tests against a local dev server.

UI work comes fast and confident. The agent recreates a claude.ai‑style layout: sidebar for conversations, main chat pane, and an artifacts panel that can render code blocks and formatted text. It stubs out components for message bubbles, input areas, and conversation lists, then hooks them to placeholder data so the interface feels alive even before real logic lands.

Because Cole Medin front‑loads a battery of TDD checks, progress has a scoreboard. Early tests cover fundamentals: app boots without crashing, chat view renders, messages display in order, and basic routing works. Claude chews through these like a senior engineer on a greenfield sprint, often fixing failing tests in a single iteration.

Low‑level plumbing follows. The agent wires API routes for creating conversations, posting messages, and fetching history, then updates the front end to call them. TypeScript types, simple error handling, and environment config appear without prompting, a side effect of the harness constantly re‑running tests and surfacing stack traces.

During this “low‑hanging fruit” window, the system looks uncannily like magic. You watch commits stack up: new components, CSS tweaks, utility functions, test files. Each green test unlocks the next layer of ambition—multi‑message flows, loading states, basic artifacts rendering—without a human touching the keyboard.

For a few hours, the bottleneck is not intelligence, but I/O. The agent waits on `npm install`, browser tests, and dev server restarts more than it waits on ideas, ripping through the easy 30–40% of the test suite before the work gets truly hard.

Hitting the Plateau: Where AI Gets Stuck

Momentum doesn’t fail with a crash; it thins out into repetition. After roughly a dozen hours and dozens of agent sessions, Cole Medin’s claude.ai clone stops leaping forward and starts pacing in circles. New commits still land, but they mostly reshuffle existing logic, tweak selectors, or rename components without unlocking new passing tests.

Complexity stops being local and becomes systemic. The agent now wrestles with multi-hop problems: browser flows that depend on auth state, conversation threads that must persist across reloads, and artifact rendering that touches backend APIs, front-end routing, and UI state. Each change fixes one edge case while quietly breaking another two.

Flaky tests become the main antagonist. Headless browser checks occasionally fail due to race conditions, timing issues, or minor DOM differences. The harness dutifully treats every red test as a real bug, so the agent spends run after run chasing non-deterministic failures that a human would quickly tag as “test is bad, not app.”

You can see the plateau in the numbers. After 24 hours, the harness reports a bit over half the end-to-end tests passing—impressive for an automated system, but a hard ceiling rather than a smooth curve. Early hours knock out the obvious wins; later hours grind against integration tests that encode product nuance, not just syntax correctness.

As tasks get fuzzier, architectural intuition starts to matter and the agent doesn’t have any. It can refactor React components, shuffle API handlers, and adjust TypeScript types, but it lacks a strong mental model of the entire claude.ai-style app. When browser flows misalign with backend assumptions, the agent reacts locally instead of redesigning the flow.

Senior engineers handle this phase by stepping back and changing the shape of the system. They: - Collapse leaky abstractions - Introduce clearer boundaries between UI, state, and API - Rewrite brittle tests that encode the wrong contract

The agent does none of that on its own. It treats every failure as a patchable defect, not a signal that the architecture or test suite needs a rethink. That makes it a powerful implementer—a tireless junior developer that never stops coding—but not the person you want deciding how your app should actually work.

This plateau, more than the flashy first-hour demo, shows where state-of-the-art autonomous coding really sits: brilliant at execution, still naïve at design.

The Final Scorecard: Success or Failure?

By hour 24, Cole Medin’s experiment ended with a very un-Silicon-Valley metric: only “a little over half” of the automated tests passed. No victory lap, no polished claude.ai clone, just a harness quietly reporting that roughly 50–60% of its own spec had been met.

Framed another way, a mostly unsupervised AI coding agent spent an entire day grinding on a real full‑stack app and shipped something that actually runs, routes, and renders. For a hands‑off system, that’s wild progress compared to the “toy CRUD app in one prompt” era, yet it still falls well short of production‑grade software.

The passing tests clustered around what current models excel at: structure, boilerplate, and predictable flows. UI rendering checks, component layout, basic navigation, and simple API endpoints mostly went green because they map cleanly onto patterns large language models already know.

Failures stacked up where messy, interconnected state lived. Complex conversation threading, artifact lifecycle rules, multi‑step flows, and edge‑case error handling produced a graveyard of red tests, exposing how brittle autonomous refactoring becomes when every change can break three other subsystems. The agent often fixed one failing test only to resurrect a previous one.

Cole’s harness leaned heavily on browser‑based end‑to‑end tests, spinning up a headless environment and clicking through the faux claude.ai interface. Those tests validated real behavior—buttons, modals, network calls—rather than just function signatures, which made every passing test more meaningful and every failing one harder to brute‑force away.

Cost-wise, the system behaved less like an infinite token firehose and more like a CPU‑bound CI server. Real‑world test runs, not prompt length, dominated wall‑clock time, so you got dozens of full iterations without crossing into absurd, million‑token‑per‑hour territory.

That tradeoff exposes an important constraint for long‑running agents: wall‑clock latency creates a natural throttle on token burn, but it also limits how many times the system can explore, fail, and recover. You can’t just “scale to more tokens” and expect the remaining 40‑plus percent of tests—often the gnarliest integration cases—to fall like dominoes.

Why TDD is Non-Negotiable for AI Coders

Code agents do not need vibes, they need tests. Cole Medin’s 24‑hour Claude Code marathon only stayed sane because every important behavior for the claude.ai clone existed first as automated checks. The agent’s job was not “build an app,” it was “make these tests go green,” which turned a vague prompt into a concrete contract.

That test harness acted like rails for an otherwise stochastic system. Each coding loop looked the same: propose edits, run the test suite, inspect failures, repeat. Over 50+ sessions, that rhythm created something rare in AI coding experiments: repeatable progress instead of a pile of unrelated code dumps.

TDD also gave the agent regression armor. When Claude refactored the React front end or rewired API handlers, the harness immediately re‑ran end‑to‑end browser tests that clicked through conversations, artifacts, and sidebar flows. If a “fix” broke message history or artifact rendering, a red test yanked the agent back before the bug spread.

That safety net encouraged aggressively risky changes you would never trust in a pure “prompt-and-ship” workflow. The agent could rip out whole components, reorganize routes, or rename data structures because the tests preserved behavior. Intent lived in the assertions; implementation became an interchangeable detail the model could keep shuffling.

TDD also cleanly separated intent from implementation, which maps almost perfectly to how LLMs operate. Human engineers encoded product expectations as Jest and Playwright tests: “When I send a message, it appears in the thread,” “Artifacts open in a panel with metadata.” Claude only had to search the codebase for ways to satisfy those statements.

That externalization matters because models hallucinate requirements when prompts stay high‑level. Here, intent existed outside the model’s context window, pinned to disk as code. Even after thousands of tokens and dozens of tool calls, the ground truth for “done” remained the same: pass the suite, not please the prompter.

Compare that to the usual prompt‑and‑pray coding people try in chat UIs. You paste a fuzzy spec, get a blob of TypeScript, eyeball it, then discover three prompts later that a “small tweak” silently broke authentication or state management. No automated regression checks, no stable target, just vibes and manual clicking.

Medin’s experiment makes the trade‑off obvious. Structured TDD plus a harness produced a claude.ai‑style app with over half the tests passing after 24 hours. Prompt‑only workflows rarely survive 24 minutes without collapsing into inconsistent, unreproducible code.

The Human's New Role: AI Architect

Human effort in Cole Medin’s 24‑hour experiment did not go into writing React components or tweaking Tailwind classes. It went into defining the system the AI would inhabit: the repo layout, the claude.ai‑style feature set, and the rules of engagement the agent had to follow. Once that scaffolding existed, Claude Code became more like a very fast, very literal contractor than a colleague.

Medin’s most leveraged moves happened before the first line of AI‑written code. He chose the tech stack, wired up the long‑running harness from Anthropic’s GitHub repo, and decided that “done” meant passing a battery of automated tests. That foundation dictated everything the agent could and could not do over those 24 hours.

The harness itself effectively encoded a new job description for humans. An “initializer” agent set specs, tasks, and tests; a “coding agent” iterated on the codebase, ran the suite, and chased green checkmarks. Medin’s role was to design that loop, not micromanage each function or CSS rule.

Future developers who thrive in this world will obsess over problem framing, not syntax. They will define:

1The problem space: what the app must do, which flows matter, which edge cases count
2The constraints: stack choices, performance budgets, security rules, integration points
3The success criteria: end‑to‑end tests, acceptance thresholds, and non‑negotiable behaviors

Those decisions shaped why Claude could get “a little over half” of the tests passing and also why it stalled there. Missing or ambiguous tests meant the agent had no incentive to fix certain integration bugs. Overly broad goals left it thrashing on complex UI flows instead of prioritizing core functionality.

Value shifts toward engineering the harness itself: the prompts, tools, and feedback signals that keep agents pointed at the right hill. That includes writing ruthless test suites, designing observability around agent runs, and deciding when to reset context or refactor the spec. Humans become AI architects, responsible for the blueprint and the measuring tape, while the model handles the drywall and wiring.

Where This Fits in the AI Coding Explosion

Agentic coding no longer lives only in research papers and demo reels. Cole Medin’s 24‑hour gauntlet drops Anthropic’s long‑running agent harness squarely into the same conversation as GitHub Copilot, Codeium, and Replit Ghostwriter—except this thing doesn’t just autocomplete a function, it runs an entire software sprint by itself. The system scaffolds a claude.ai clone, wires UI flows, and pounds away at end‑to‑end tests for an entire day.

That jump from “smart autocomplete” to “persistent worker” is the real story. Tools like GitHub Copilot operate at the keystroke level: they predict the next line, maybe the next block. Medin’s setup operates at the task level: “implement artifacts, wire conversations, satisfy these 40+ tests,” then grind through dozens of agent sessions until reality matches the spec—at least halfway.

Agent frameworks used to feel like DeepMind‑only toys, hidden behind internal orchestration stacks. Anthropic’s open‑sourced claude-code-harness flips that dynamic. A solo developer can now spin up: - An initializer agent that defines specs and tests - A coding agent that edits code and runs browsers - A feedback loop that keeps going for 50+ sessions

That accessibility changes who gets to experiment with autonomous agents. You no longer need a custom infra team to run long‑lived tools that call CLIs, spin up headless browsers, and manage project state. You need a GitHub repo, a test suite, and a credit card.

Industry‑wise, this points toward a new layer in the stack: “AI build pipelines” that sit next to CI/CD. IDE copilots help humans type faster; harnessed agents execute roadmaps, refactor codebases overnight, or hammer away at flaky integration tests. Medin’s 24‑hour run looks messy and incomplete, but it previews a near future where every serious engineering org has at least one repo where the primary contributor is a bot.

Your First Step into Agent-Driven Dev

Most developers do not need a 24‑hour AI marathon running in a tmux pane. The real upgrade comes from adopting the agentic habits behind Cole Medin’s stunt: encode your goals as tests, give the model tools, and let it grind through a feedback loop while you supervise at the system level.

Start with a single feature, not a full claude.ai clone. Pick something like “add OAuth login,” “implement CSV import,” or “build a settings page,” and define 3–10 automated tests that describe “done” more precisely than any prompt ever will.

Wrap that feature in a tight test‑and‑refine loop. Have your AI of choice write the implementation, run the tests, then ask it to fix whatever fails. Resist the urge to hand‑patch immediately; instead, treat yourself as the architect who adjusts specs, clarifies edge cases, and occasionally rewrites a gnarly function.

Anthropic’s own repos give you a concrete starting point. The long‑running agent harness Cole used lives at github.com/anthropics/claude-agentic-coding, and the broader Claude Code examples show how to wire up file editing, test runners, and browser automation into one loop.

You do not have to copy Cole’s 50‑plus coding sessions or spin up a headless browser farm on day one. You can get 80% of the benefit by letting an agent repeatedly call `pytest`, `npm test`, or Playwright scripts and only stepping in when it obviously plateaus or starts thrashing.

A practical starter recipe looks like this: - Write a short spec and tests for one feature - Give the AI your repo, the spec, and the test command - Let it iterate until tests pass or progress stalls - Refine tests, architecture, or prompts, then repeat

Used this way, agent‑driven development does not replace you; it widens your reach. You can attempt features you would have shelved as “too big for this sprint,” explore more ambitious refactors, and keep a higher bar for quality because tests, not your patience, enforce it.

Frequently Asked Questions

What is an AI agent harness?

An AI agent harness is a framework that provides an AI model with tools, memory, and a structured loop (plan, code, test, refine) to perform complex, long-running tasks autonomously, like coding an entire application.

Did the AI successfully build the app in 24 hours?

The AI made significant progress, completing over half of the required tests for a claude.ai clone. However, it did not fully complete the project, highlighting the current limitations of AI agents on complex integration tasks.

Is this a practical way to build software today?

While still experimental, the test-driven approach shown is highly practical. It demonstrates that defining success with automated tests allows AI to work more reliably and produce better results than simple conversational prompting.

What AI model was used in the experiment?

The experiment used Anthropic's Claude Code model, which is specifically optimized for software development tasks, within their open-sourced long-running agent harness.

AI Codes for 24 Hours. The Future is Here (and Flawed).