Anthropic's AI Coding Agent: 24-Hour Test Results & Future of Dev

The Impossible Challenge: An AI Codes for 24 Hours

Anthropic’s latest coding experiment sounds like a dare: wire up Claude to a long‑running agent “harness,” hit go, and let it code for 24 hours straight. No coffee, no breaks, just an AI model grinding through a massive software spec while you sleep. The goal: see whether a modern coding model can behave less like autocomplete and more like a tireless junior dev team.

Long tasks usually break AI agents in boring, predictable ways. After a few hours, they swamp their context window, forget earlier decisions, and either hallucinate structure or simply declare the project “done” while half the features live only in the prompt. Traditional tools reset state, lose thread history, and force humans to babysit every major refactor.

Anthropic’s open‑sourced harness attacks that failure mode head‑on. Instead of one giant monologue with the model, the harness coordinates multiple agents, splits work across separate context windows, and persists state to disk. It leans on test‑driven development: define hundreds of test cases and a detailed app spec up front, then let agents iterate until the tests finally go green.

Cole Medin’s experiment pushes this harness to an extreme: a 24‑hour coding marathon to build a working clone of Claude’s own web app, complete with projects, conversations, artifacts, and file uploads. The harness spins up an initializer agent to generate a feature list with roughly 200+ granular test cases, scaffolds the project, and wires up Git from the start so every change has a trail. After that, coding agents cycle for hours, implementing and fixing features against those tests.

Framed as YouTube spectacle, this still previews a serious future for agentic coding. Long‑running AI agents that quietly build MVPs, background prototypes, and full UI shells overnight could compress weeks of setup into a single calendar day. The 24‑hour stunt just shows what happens when you stop treating AI as a chat box and start treating it as a process.

Breaking the AI Stamina Barrier

Stamina, not raw IQ, quietly kills most AI coding experiments. Long-running agents drift, overwrite their own plans, or simply “decide” they’re done once the context window fills up with half-baked code and meandering instructions. The Anthropic setup attacks that failure mode directly: a harness that remembers what the agent can’t.

Rather than a smart new agent, the harness acts as a coordination layer wrapped around ordinary Claude Code sessions. It tracks files, tasks, and test results across hours of execution, spinning up fresh conversations whenever one thread gets too bloated to stay coherent. Each new session starts with a distilled snapshot of what matters, not a messy transcript of everything that came before.

Massive projects turn into structured todo lists. The harness starts from a plain-text app spec or PRD, then explodes it into a feature list with hundreds of tiny, testable behaviors. Cole Medin’s run targeted 200+ test cases for a Claude.ai-style clone, all generated up front from that single spec.

Those features don’t live as vague bullet points. They become JSON objects with fields like description, files touched, and specific acceptance criteria. The harness can then pick one feature at a time, feed the relevant context into Claude, and ask it to implement or fix just that slice of the system.

Instead of one 24-hour mega-chat, the system runs dozens or hundreds of focused “sprints.” Each sprint is a short-lived agent session with a narrow goal: add a component, wire up an API call, make a test pass. When the context window starts to clog, the harness closes that session and opens a new one seeded with the current repo state and task list.

State lives on disk and in git, not in the model’s memory. The harness leans on: - The codebase itself - The feature list JSON - A growing test suite and logs

By externalizing state, the harness turns a flaky, forgetful agent into something that behaves more like a deterministic build pipeline—one that can keep coding for 24 hours without losing the plot.

The Mission: Clone Claude.ai From Scratch

Claude’s 24-hour mission had a brutally clear brief: rebuild the Claude.ai web app from scratch, no human copiloting, no mid-course corrections. Not a toy chat box, but a working clone of the interface millions of users hit every day. Same core flows, same sense of polish, running end-to-end on code written entirely by an AI that never sleeps.

That means replicating the full conversational surface area. The agent had to stand up persistent conversation management with message history, sidebar threads, and proper routing to different projects. It also needed real file uploads and attachments, not stubs—handling documents, code, and PDFs that flow into the model and back out as references in the UI.

On top of that, Cole Medin’s spec demanded project-level organization and a clean, modern front end. The clone needed: - Project creation and switching - Grouped conversations per project - Support for “artifacts” or rich outputs - A responsive, Claude-style layout with light UX chrome, not raw Bootstrap

This is exactly the kind of thing long-running agents should excel at: a dense mix of front-end React or Next.js work, back-end API plumbing, and glue code to keep state consistent. It forces Claude to juggle routing, auth, persistence, and UI state while staying aligned with a human-readable product spec. No single prompt can cover that; only a system that decomposes work and revisits context over and over has a shot.

Anthropic’s own Effective Harnesses for Long-Running Agents - Anthropic article uses a Claude.ai-style clone as its poster child, complete with hundreds of tests and a multi-agent workflow. On paper, the harness coordinates initializer and coding agents, spins up scaffolding, and grinds through 200+ test cases until the app passes. On YouTube, that glossy diagram turns into a brutal question: can the same setup actually ship a Claude.ai clone in 24 hours with zero human edits, or did the blog post quietly lean on hand-tuning and cherry-picked screenshots?

Those stakes make this more than a novelty benchmark. If a harness plus Claude can really build a production-adjacent Claude.ai clone unattended, that hints at a near future where “start a new app” means writing a spec, hitting run, and coming back to a working SaaS skeleton the next morning.

The Architect: Meet the Initializer Agent

Initializer Agent acts like the project’s chief architect, but with zero ego and unlimited patience. It’s the first process the Anthropic harness spins up, and everything downstream lives or dies on the quality of its work. Before a single feature gets coded, this agent sits with the app spec—the pseudo-PRD for the Claude.ai clone—and turns it into a fully structured plan.

Its job sounds simple: “analyze requirements and set up the project.” In practice, that means converting a few pages of text into a machine-readable blueprint that other agents can follow for 24 hours straight without wandering off. No debugging, no UI polish, no refactors—just setup.

The harness forces the Initializer Agent to create four core artifacts that define the entire build:

1A feature list JSON with 200+ granular test cases
2An initialization script to spin up the project
3Boilerplate code scaffolding for the full stack
4A freshly initialized Git repository

That feature list JSON quietly does the heaviest lifting. It explodes the Claude.ai clone spec into hundreds of tiny, verifiable behaviors: starting a new conversation, uploading a file, switching projects, rendering artifacts, handling empty states, and more. Each test case becomes a target for later coding agents, enforcing a kind of AI-native test-driven development.

The initialization script glues the environment together so future agents don’t waste tokens reinventing setup steps. It encodes decisions like framework choice, package managers, and dev commands—think `npm install`, database bootstraps, and `npm run dev` equivalents captured in one reproducible entry point.

Scaffolding gives the coding agents a map of the codebase before they touch a single component. You get predefined directories for frontend, backend, API routes, and shared utilities, plus placeholder files that hint at architecture: routing, state management, and integration points for Claude’s chat, artifacts, and file handling.

Git is the final non-negotiable piece. The Initializer Agent creates a new repo, establishing version history from line one, so subsequent agents can commit, diff, and roll back safely. For long-running agentic coding systems, that history is the only thing preventing a 24-hour session from collapsing into chaos.

The Unrelenting Logic of the Coding Loop

Coding harnesses live or die on their main workhorse: the Coding Agent. Once the Initializer Agent sketches the blueprint, this agent enters a relentless loop, waking up with a fresh context window, rereading the project state, and marching through features one by one. No chatting, no brainstorming—just a tight feedback cycle of tests, edits, and commits.

At the center sits a rigid test-driven development (TDD) discipline. Before a single line of production code changes, the system already knows what “done” looks like via a massive feature list JSON, often with 200+ granular test cases. The Coding Agent’s job is not to be creative; it is to make those tests go green.

Each loop starts with the agent loading a progress artifact: a structured file that tracks which features exist, which tests pass, and what broke recently. From there, it picks the next target—say, “support uploading multiple files to a project” or “render conversation history with artifacts”—based on priority and dependencies. That choice happens inside the prompt, but the state guiding it lives on disk.

Before touching the codebase, the agent runs the full regression suite. That means every iteration begins by revalidating everything built so far, catching regressions immediately instead of hours later. If a previously passing test fails, the agent pivots to fixing that before adding anything new.

Only after the regression tests pass does the agent implement the new feature. It edits source files, updates components, tweaks API handlers, and wires up UI behavior, all via the same tool interface. Then it reruns tests, iterating until the new case passes or it hits a configured limit on attempts.

When the feature works, the harness forces the agent to externalize its memory. It updates the progress file with details: which feature was implemented, which tests now pass, known limitations, and next logical steps. This file becomes a compact, machine-readable changelog for the next session.

Every loop ends with a Git commit. The harness treats Git not as an afterthought but as a core memory substrate: diffs tell the next Coding Agent instance exactly what changed, commit messages summarize intent, and history guards against catastrophic mistakes. Combined with the progress file, these commits let a brand-new context window “remember” 18 hours of work without rereading the entire codebase.

Beyond the CLI: The Power of the SDK

Command-line tools like Claude Code feel powerful, but this 24-hour experiment quietly steps around them. Instead of shelling out to a CLI, the harness talks straight to Claude through the Claude Agents SDK in Python, treating the model like a first-class software component rather than a fancy terminal command.

Anthropic’s harness spins up agents, schedules work, and inspects git state entirely through SDK calls. The Python process orchestrates everything: creating sessions, streaming tool calls, reading and writing files, and even restarting agents when they stall. No human ever types `claude code` into a prompt once the run starts.

Direct SDK access also turns model choice into a config detail instead of a rebuild. The same harness could call: - Claude Sonnet 4.5 for cost-efficient iterations - Claude Opus 4.5 for gnarlier refactors - Third-party models like Code Llama or GPT-style coders via compatible APIs

Model swapping becomes a one-line change in a client initializer, not a whole new workflow. The harness already treats “Claude” as an abstraction: a coding agent with tools, context, and a contract. Underneath, that contract can point at any model that speaks JSON and respects the protocol.

This is why SDKs look like the real future of agentic coding. CLIs shine for quick one-off fixes or interactive debugging; they break down when you need persistent state, background jobs, or cross-agent coordination. Long-running systems like this harness demand programmatic hooks for logging, retries, metrics, and security controls.

Anthropic’s own Autonomous Coding Quickstart - Anthropic GitHub Repository bakes this assumption in. The repo is just Python, prompts, and wiring around the Agents SDK, making the whole thing feel less like a dev tool and more like an extensible microservice for software creation.

How to Run Your Own 24-Hour AI Coder

Running your own 24-hour Claude coder starts with Anthropic’s open-source harness on GitHub. Head to the autonomous coding quickstart in the claude-quickstarts repo, specifically the `autonomous-coding` directory, and clone it locally. You get a ready-made scaffold: prompts, agent wiring, and scripts for spinning up long-running Claude coding agents.

Setup feels closer to configuring a dev toolchain than a toy demo. You install dependencies (Python, Node, and project packages via `npm install` or `pnpm install`), drop your environment variables into a `.env` file, and point the harness at your Claude credentials. The repo ships with example configs for the Claude.ai clone, so you can mostly tweak instead of invent.

Cost control becomes the non-obvious killer feature. Cole Medin calls out a crucial trick from the video: use a Claude subscription token (the same one your browser uses for Claude Code) instead of a metered API key. If you wire this to a pay-per-use key and let it run 24 hours, you risk waking up to a three or four-figure bill.

Kicking off the whole process comes down to a single command from the repo root, something like:

- `python main.py --app-spec=app_spec.txt`

After you hit enter, nothing exciting happens for 10–20 minutes. That’s the Initializer Agent quietly generating 200+ test cases, scaffolding the project, writing the init script, and bootstrapping a git repo before any visible UI appears.

Everything lives or dies on your app spec file. Anthropic’s harness expects a brutally detailed PRD-style text file describing pages, flows, edge cases, roles, and non-functional requirements. If you hand it a hand-wavy “chat app clone” paragraph, you get a hand-wavy product.

A strong app spec for a Claude.ai clone reads like something you’d hand a human team: URL structure, conversation states, file upload limits, artifact behavior, keyboard shortcuts, error copy, and even empty-state designs. The Initializer Agent explodes that into granular tests, so every vague sentence in your spec turns into a vague or missing feature 12 hours later.

The Gauntlet Begins: Claude is Unleashed

Midnight hits, the command runs, and the harness quietly flips from setup to execution. The Initializer Agent spins up its first session, pulling in the app spec, generating that sprawling feature_list.json with roughly 200 granular test cases, and wiring up the initial Next.js-style scaffolding plus a fresh git repo. Once it writes those artifacts, control hands off to the workhorse: the Coding Agent loop.

Your terminal stops looking like a normal dev console and starts reading like a live system log from an alien pair programmer. Tool calls stream by every few seconds: `read_file`, `write_file`, `run_tests`, `git diff`, `git commit`. You watch directories like `app/`, `components/`, and `lib/` fill with TypeScript, React components, and API route handlers, all authored by Claude with no prompts from you after that first `npm start`.

Output lines stack up at a pace no human could sustain. One moment the agent is scaffolding a sidebar for projects, the next it is wiring conversation threads, then patching a flaky test in the artifacts panel. The harness keeps sessions small, rotating context and spinning up new Coding Agent runs while preserving state through the filesystem, git history, and the feature list JSON.

Hands stay off the keyboard by design. No “approve” buttons, no manual retries, no mid-course prompt tweaks. Once you kick off `node run_harness.mjs`, the system owns the next 24 hours: planning, coding, running tests, and committing code. The only human activity is watching the scroll and occasionally checking system metrics to make sure the machine itself does not melt.

Security and validation threads through almost every action. The harness wraps shell commands to block anything dangerous, constrains file writes to the project directory, and uses Puppeteer via an MCP server to visually verify the Claude.ai clone in a headless browser. The agent can:

1Boot the dev server
2Open localhost in Chromium
3Click through projects, conversations, and file uploads
4Compare the rendered UI to its spec and test expectations

Each Puppeteer pass feeds back into the loop as another signal: did the app actually behave, or does the next commit need to rip out and rewrite half the UI?

The Final Verdict: What an AI Builds in 24 Hours

Twenty-four hours and hundreds of agent cycles later, Claude emerged with something real: a working, full-stack Claude.ai-style web app. Not a toy, not a static mockup, but a React front end, API backend, and a test suite wired into the same harness that drove the build. Cole Medin scrolls through it on video like any normal SaaS product, because functionally, that’s what it is.

Visually, the clone lands surprisingly close. The sidebar layout, chat threads, project list, and overall Claude aesthetic all show up: light, clean, and familiar. You can start conversations, rename them, and see them populate in a persistent history panel.

Core interaction works too. The app sends messages to Claude, streams responses, and preserves context across turns in a conversation. File uploads function for basic use cases, attaching documents to a chat and surfacing them in the UI, though edge cases around large or unusual files still break.

Artifacts, Claude’s distinctive “inline apps” feature, arrive in partial form. The clone can render simple artifacts, display them in a dedicated panel, and keep them linked to a conversation. More advanced flows—multi-artifact sessions, complex stateful tools, or editing artifacts in place—either fail silently or behave inconsistently.

Project management lands somewhere in the middle. The harness-driven agent implements: - Project creation and deletion - Assigning conversations to projects - Basic filtering of chats by project

But bulk operations, robust search, and cross-project views remain flaky or missing, often surfaced as unimplemented buttons or dead UI states.

Under the hood, the test-driven strategy pays off. Out of roughly 200+ generated test cases, a large majority pass by the end of the 24 hours, with failures clustering around advanced UX polish and obscure error handling. The harness keeps cycling until progress flattens out, not when Claude gets “tired” or decides it’s done.

Medin calls the harness “legit” on camera, and it doesn’t feel like hype. He stresses that this is not production-grade engineering yet, but as a proof that agentic coding can autonomously assemble a complex, multi-featured web app in a day, the demo lands hard. Paired with Anthropic’s broader advances in long-running agents and models like Claude Opus 4.5 detailed in Introducing Claude Opus 4.5 - Anthropic, the takeaway is blunt: this workflow is early, but it already works.

Your New AI Coworker Clocks In Tomorrow

Your current “AI pair programmer” is about to feel quaint. Long-running harnesses like Anthropic’s open-source agent harness turn models such as Claude from chatty assistants into background workers that quietly grind through a backlog for 24 hours or more, without losing the plot halfway through a refactor.

Instead of babysitting a prompt window, you can hand an agent a PRD, a repo, and a test suite, then come back to a working prototype. Cole Medin’s Claude experiment shows this concretely: a harness-coordinated Claude Code instance scaffolds a Claude.ai-style interface, wires projects and conversations, and iterates through hundreds of tests over a full day of compute.

For developers, this looks less like a novelty and more like a new tier of infrastructure. Think of agents as: - Overnight prototype builders - Continuous refactoring daemons - Test-generation and coverage bots - Documentation and migration assistants

Give one of these systems 24 hours and a feature list JSON with 200+ cases, and it will dutifully chase green checks while you sleep.

None of this feels “production-ready” yet. The harness in Anthropic’s quickstart repo is experimental, brittle around flaky tests, and prone to the same hallucinations as any LLM. But the strategies it encodes—test-driven prompts, strict success criteria, Git as the source of truth, multi-agent coordination—map directly onto how you harden real-world AI systems.

You can already lift these patterns into your stack. Use an initializer agent to generate specs, scaffolding, and tests; constrain a coding agent to only modify certain directories; wire CI to run the same harness-driven checks before merge. Each step makes your AI helpers less like autocomplete and more like deterministic workers attached to your pipeline.

Agentic engineering will change what “writing software” even means. Human engineers define architectures, constraints, and review gates, while fleets of specialized agents chew through implementation, tests, and integration over dozens of hours. The Claude clone experiment is a rough sketch of that future: codebases shaped less by keystrokes and more by orchestrating legions of tireless, test-obsessed collaborators.

Frequently Asked Questions

What is the Anthropic Harness for long-running agents?

It's an open-source coordination layer that allows AI coding agents to work on complex tasks for extended periods (hours or days) by managing context windows and breaking work into smaller, testable chunks.

Can this harness be used with models other than Claude?

Yes. The harness is model-agnostic. Because it's a system of prompts and artifact files, you can swap out Claude Code for other models like OpenAI's or open-source alternatives by adapting the client SDK.

Is this autonomous coding system ready for production use?

No, it's still highly experimental. It's best suited for rapid prototyping, proof-of-concept generation, and exploring the future of agentic engineering, rather than building production-ready applications.

How does the harness avoid context window limitations?

It creates a new, fresh context window for each coding agent session. The agent catches up on progress by reading core artifact files like a progress summary, a feature list, and the existing codebase, ensuring it only needs relevant context for the next granular task.

Claude Coded for 24 Hours. The Results Are Wild.