Claude 4.5's Shocking Coding Test
We put Anthropic's new Claude Opus 4.5 to the test on a real-world coding project. The results show a new era for AI-assisted development is here, but it's not what you think.
The AI Hype vs. Reality Check
Hype cycles move fast in AI, and Claude Opus 4.5 arrives at full sprint. Anthropic claims its new Claude Opus model now tops a wall of benchmarks, from code generation leaderboards to long-context reasoning tests, positioning it as the “frontier” system for serious software work. Charts look great in a launch blog, but they do not ship product.
For developers, one metric actually matters: does this model make shipping features faster and safer than the last one? If a tool cannot cut the number of edits, rollbacks, and “why is this broken in prod?” moments, benchmark scores become trivia. The only scoreboard that counts lives in git history and incident logs.
To test that, you need more than contrived LeetCode puzzles or toy CRUD apps. You need a Real Coding Workflow inside a real product, with messy legacy code, half-documented components, and UX requirements that change mid-implementation. That is where Claude Opus 4.5 either earns its hype or exposes the gap between leaderboard wins and day-to-day engineering reality.
So the test bed is Composalo, a production web app that already has paying users and a non-trivial codebase. The setup: run Opus 4.5 through Cursor and terminal-based cloud code, keep everything In Production, and see whether the AI can behave like a competent pair programmer rather than a code autocomplete on steroids. No cherry-picked greenfield projects, no sanitized repos.
The workflow centers on three concrete tasks that ramp in difficulty. First, a simple UI tweak: add an interaction-mode button to a node toolbar so users understand they can double-click and scroll an embedded screen. Pure front end, small scope, perfect for testing whether Opus can read an existing component and thread a feature through without collateral damage.
Next comes a medium-hard feature: a database-backed “duplicate node” action that copies a node, preserves the right data, generates fresh IDs, and avoids dragging along prompt history or stale versions. Finally, a complex streaming UI behavior pushes the model into multi-file reasoning, state management, and edge-case handling. Benchmarks say Claude Opus 4.5 can handle it. Reality will decide.
The 'Plan Mode' Flywheel
Plan mode quietly runs the whole workflow. Before Claude Opus touches a single line of code, Moritz drops into plan mode and narrates what he wants: where the feature lives, how it behaves, which components it touches. That up-front description turns the model from autocomplete-on-steroids into something closer to a junior engineer taking a spec.
Claude Opus then does something deceptively powerful: it interrogates the spec. Questions like “Which icon would you prefer?” and “Where should the button be positioned?” sound trivial, but they kill off entire classes of rework. Instead of hallucinating UI decisions, the model locks in details such as a cursor pointer icon, placement after the preview button, or whether a duplicate node should tweak the title.
Those clarifying questions act as a guardrail against misaligned intent. Moritz doesn’t wait to discover that the button ended up in the wrong toolbar or that duplication logic copied the wrong version history. He answers a handful of targeted prompts, and Claude Opus bakes those constraints into its plan before touching the codebase.
For simple UI tweaks, that lightweight back-and-forth inside cloud code is enough. Moritz stays in plan mode, approves the file list—node toolbar, web app node, button styling—and then auto-accepts edits. The result: a working interaction toggle, correct cursor behavior, and even edge-case handling like deactivating when clicking outside, all on the first run.
Complex features flip the workflow into a heavier gear. When database writes, Supabase schemas, or multi-version logic enter the picture, Moritz asks Claude Opus to generate a separate planning document. That doc captures the agreed behavior: which fields to duplicate, which to skip (prompt history, versions), and rules like “only duplicate the version the user is currently viewing.”
That planning document becomes a durable artifact. Moritz can:
- Reuse it with a fresh Claude Opus session
- Hand it to a faster model like the composer one
- Revisit it weeks later to understand why an implementation behaves a certain way
Instead of relying on fragile chat history, the workflow builds a stable implementation path that survives context resets, model swaps, and human second thoughts.
The Easy Win: Nailing a UI Tweak
Adding a simple “interact with content” control should be the sort of softball any modern coding model can hit. For this first feature, Claude Opus had to wire a new button into Composalo’s node toolbar so users could explicitly toggle an interaction mode for the embedded screen instead of discovering it via a hidden double-click gesture.
Moritz drops into plan mode and describes the change: a new icon-only button in `NodeToolbar`, positioned after the preview button, that toggles an interaction mode on the `WebAppNode` iframe. Opus immediately identifies the two key components—`NodeToolbar` for the UI and `WebAppNode` for the iframe behavior—and proposes edits confined to those files, no shotgun refactor, no wandering into unrelated modules.
Implementation runs in a single pass. Opus adds the toolbar button, wires it to the existing double-click interaction logic, and updates styling so the active state clearly signals “you are now interacting with the embedded app.” Moritz boots the local dev server, clicks the new “interact with content” button, and sees the cursor flip to a hand over the iframe while scroll and interaction work as expected.
Then comes the unscripted part. Without being asked, Claude Opus implements logic to automatically deactivate interaction mode when the user clicks outside the node. Click away from the iframe and the mode turns off, returning the cursor and behavior to normal. Moritz flags this as the kind of edge-case handling Sonnet or another model might easily skip.
That extra behavior pushes Opus out of the “autocomplete on steroids” bucket and into something closer to a junior engineer who anticipates UX pitfalls. It doesn’t just follow instructions; it infers that a sticky interaction mode would confuse users and quietly fixes it. Anthropic leans hard on this kind of “proactive reasoning” in its own pitch for the model in Introducing Claude Opus 4.5 - Anthropic, and here it shows up in a very practical, very shippable way.
Thinking in Edge Cases
Edge cases usually show up at the end of a sprint, when a QA tester clicks somewhere they “shouldn’t” and everything falls apart. Here, Claude Opus quietly handled one of those cases upfront: when Moritz clicked outside the new “interact with content” mode, the feature automatically deactivated. No one asked for that behavior, but the model wired it in anyway.
That small detail matters. A sticky interaction mode that never turns off is exactly the kind of UX footgun that ships, annoys users, and then burns cycles in bug reports and hotfixes. By defaulting to a click-outside-to-dismiss pattern, Opus aligned with a familiar web idiom used in modals, dropdowns, and popovers.
More interesting than the code is the product thinking embedded in it. Moritz only requested a button that toggles interaction; he never specified what should happen when focus shifts away. Opus inferred a sane lifecycle for the feature: activate on click or double-click, visually signal the mode with a cursor change, and gracefully exit when the user clicks elsewhere.
That kind of anticipatory behavior is what Anthropic means when it talks about improved reasoning in frontier models. This is not just pattern matching on React snippets; it is modeling user intent and failure modes, then baking those assumptions into the implementation. For a “simple” UI tweak, Opus still reasoned about state, focus, and escape hatches.
Tiny edges like this compound into real time savings in a Real Coding Workflow. Each missed edge case usually costs at least one extra loop: - Manual testing to discover the bug - Re-explaining context to the model - Regenerating patches and re-running the app
Avoiding even 2–3 of those loops per feature translates into hours reclaimed across a week of development. Moritz barely touched the implementation beyond a tooltip copy edit; he did not have to specify interaction teardown, add event listeners, or chase weird stuck states.
Scaled across dozens of features, that behavior starts to look less like “autocomplete for code” and more like a junior product engineer embedded in your editor. Not perfect, not omniscient, but increasingly able to think about how real users will actually click around a screen.
The Medium Challenge: Database Duplication
Medium difficulty arrived with a deceptively simple request: a duplicate node button that touched both React UI and the backing database. Moritz wanted it slotted into the node toolbar’s action dropdown, living alongside existing actions, and spawning a copy of the current node inline on the canvas. No mock data, no client-only hack—this had to hit the real persistence layer.
The prompt he fed Claude Opus 4.5 was brutally specific. The model had to clone node data, generate a fresh UUID, and skip entire categories of state: no prompt history, no old versions, no hidden metadata that would accidentally drag along stale context. It also needed to respect Composalo’s versioning model, where edits stack as separate “versions” a user can cycle through.
Instead of naively copying every column, Opus 4.5 inferred a minimal duplication set. It kept the core node fields—title, content, layout, type—while omitting history tables and version records. For the visible label, it appended a “copy” suffix to the title, so “Landing Page v2” became something like “Landing Page v2 (copy),” a tiny UX detail that reduces confusion in dense canvases.
On the database side, the model wired up a proper insert with a new UUID rather than reusing or manually tweaking the original ID. That step sounds trivial, but it is exactly where many AI-generated patches fail, either by colliding IDs, mutating the original row, or forgetting foreign-key relationships. Here, Opus 4.5 created a clean new row that behaved like any other node created through the normal UI.
The flow spanned multiple layers: toolbar button, click handler, API call, database write, and UI refresh. Opus 4.5 kept those pieces consistent, passing the right node identifier from the React component into the backend and returning the newly created node so the frontend could place it “right next to” the original. No orphaned state, no ghost nodes, no manual cleanup.
This medium challenge exposed something benchmarks rarely measure: stateful reasoning across a stack. Opus 4.5 did not just write a SQL statement or a React component in isolation; it coordinated them, respected app-specific rules about versions and history, and shipped a feature that would survive real users hammering the duplicate button all day.
The Hard Test: Real-Time UI Streaming
Retrofitting Composalo’s existing “edit node” flow exposed where Claude Opus 4.5 shines and where it still stumbles. Moritz asked Opus to wire the app’s new real-time streaming UI into an already gnarly edit pipeline, not as greenfield code but as surgery on live tissue.
The catch: edits come in two flavors. A global edit rewrites the entire node—title, content, metadata—while a section edit only targets a specific slice, like a paragraph or block. The streaming layer had to understand that distinction and only re-render the relevant region, without nuking the rest of the React tree.
That sounds simple until you factor in existing state, optimistic updates, and server responses. The app already had non-streaming edit logic, so Opus had to thread streaming callbacks through: - The node editor UI - The backend mutation handlers - The per-section rendering components
Opus 4.5 mapped that architecture surprisingly well. It introduced streaming handlers that piped partial responses into the right components, wired them to both global and section edit paths, and ensured the rest of the node stayed stable while tokens flowed in.
On a global edit, the updated content streamed into the full node view, progressively replacing the old version. On a section edit, only that subsection updated in real time, while surrounding content stayed frozen. No full-page flicker, no wholesale state reset, no obvious race conditions during the demo.
The implementation still showed seams. Some edge cases—like rapidly switching sections mid-stream or canceling an edit halfway—didn’t have airtight guards. A few abstractions felt bolted on rather than integrated, with streaming logic bleeding into components that ideally should not know about transport details.
Moritz had to step in to clean up naming, factor out duplicated code, and tighten some TypeScript typings around the streaming payload. Opus got the core behavior working, but it did not deliver the kind of polished, library-grade API a senior engineer might extract.
For developers wiring similar patterns, tools like the Anthropic Python SDK - GitHub highlight how much ergonomic streaming support still needs to move from model prompts into stable client abstractions.
When 'Good Enough' Isn't Perfect
Good enough shipped, but it didn’t ship perfectly. When Moritz wired Claude Opus into his real-time editing UI, the new streaming behavior technically worked: text updates flowed in as the model generated them, the network calls fired correctly, and the feature matched the spec. Yet every time streaming kicked in, the editor flashed with a small but unmistakable UI flicker before the updates started.
That tiny glitch captured the 90/10 rule of modern AI-assisted development. Claude Opus handled the heavy lifting: understanding an existing “edit node” feature, threading through the new streaming pipeline, and touching the right React components and API handlers. But that last 10%—the part users actually feel—still depended on a human who understands render timing, state transitions, and how a React tree behaves under stress.
Under the hood, the model respected the app’s logic but missed the nuance of the render lifecycle. It updated state in the right places and wired streaming callbacks correctly, yet it didn’t anticipate that an intermediate empty state or double-render would cause the component to briefly unmount and remount. To a compiler, everything looked fine; to a user, the editor blinked at precisely the wrong moment.
That gap exposes what current frontier models actually optimize for. Claude Opus excels at structural reasoning: mapping data flow, inferring types, tracing API boundaries, and refactoring multi-file features without losing context. But frame-by-frame UX quality—avoiding layout thrash, preventing hydration mismatches, smoothing out loading skeletons—lives in a space of tacit knowledge that benchmarks do not measure.
Moritz didn’t need another plan; he needed taste and experience. He had to step in, recognize that the flicker came from how the component initialized streaming state, and adjust the rendering path so the view stayed stable while tokens arrived. The model got him from “nonexistent feature” to “working feature” in minutes, but “feels native to the app” still required manual debugging.
That tradeoff is the realistic picture of Claude Opus In Production. AI acts as a force multiplier for scaffolding features, exploring approaches, and handling boilerplate. Humans still own the last mile: the polish, the guardrails, and the invisible details that separate a demo from a product.
The Human-AI Handoff: A Practical Workflow
Human-AI pairing only works if it feels like a production habit, not a demo gimmick. Moritz’s Composalo build turns Claude Opus into a three-step loop that looks suspiciously like a real team: architect, implementer, reviewer. The result: shipping multiple features in a single sitting without turning the repo into spaghetti.
Step one is Architect & Plan. The human defines the “what” and “why” in plain language, then uses Claude Opus as a critical partner, not a code vending machine. Moritz drops into “plan mode,” tags the relevant component (“node toolbar”), states constraints (no scroll bar, minimal icon, interaction mode toggle), and forces the model to ask clarifying questions before touching a file.
That planning pass does two things. It surfaces UX decisions early (cursor icon, button placement, click-outside behavior) and it creates a lightweight spec that can live in a separate planning doc. For database work, Moritz adds a hard rule: ask for the schema first, then propose changes, which prevents the AI from hallucinating table names or fields.
Step two is AI Generates. Once the plan looks sane, Claude Opus handles the boring parts: boilerplate React components, wiring handlers, and threading state through existing hooks. In the “interact with content” feature, the model modified the toolbar, updated the iframe container, and wired activation/deactivation logic in one pass.
The same pattern scaled to the “duplicate node” feature, which touched both UI and database. Moritz let Opus sketch out the end-to-end flow: new toolbar action, server call, node cloning logic, and placement of the duplicate right next to the original. For long-horizon changes like the streaming editor, the model effectively acted as a mid-level engineer, connecting multiple files without losing the mental stack.
Step three is Human Refines & Polishes. Moritz drops back into the code as a senior reviewer: running the app, poking at edge cases, and making micro-edits faster by hand. That’s how he caught the real-time streaming bug—an initial UI flicker before streamed content rendered—and forced a second iteration that preserved state more cleanly.
Crucially, he does not outsource judgment. Small copy tweaks (“double click to interact”), visual polish, and production paranoia around data integrity stay human-owned. The AI moves first; the human decides what ships.
Run as a loop—plan, generate, review—this workflow maximizes speed without surrendering quality. Claude Opus accelerates the 80% grunt work, while the developer guards UX, architecture, and correctness where a single bad assumption can still burn a release.
Beyond the Demo: What This Means for You
Benchmarks and marketing copy tell one story; Moritz’s Real Coding Workflow shows what Anthropic’s new knobs actually mean when you ship code. The much‑touted effort parameter and computer-use upgrades like the zoom action stop being abstract once you watch Opus quietly thread changes through a real codebase without detonating the build.
For day‑to‑day development, effort maps cleanly to intent. You use low effort for the boring stuff: generating React boilerplate, wiring a button into an existing toolbar, sketching out a basic API handler, or drafting test stubs. You flip to high effort when you need the model to reason about tangled state, race conditions, or user flows, like the streaming edit UI that had to coordinate server responses, client state, and existing UX expectations.
That split suggests a practical pattern for most teams: - Low effort for scaffolding and repetitive glue code - Medium effort for feature work inside familiar patterns - High effort for cross‑cutting logic, data modeling, and tricky async flows
Moritz’s session also hints at why Anthropic keeps talking about reliability. Across multiple features, Opus generated edits that ran with minimal tool‑calling drama and no catastrophic build failures, aligning with external reports of 50–75% fewer tool and lint errors in production‑style tests. For a CI pipeline that runs dozens of times per day, shaving even 10–15% off failure noise can reclaim hours of engineering attention.
Viewed that way, Claude Opus 4.5 stops looking like “just a code generator” and starts to resemble a system‑aware collaborator. It remembers component boundaries, respects database contracts when guided, and navigates an existing architecture instead of bulldozing it. If you care about hard numbers, Claude Opus 4.5 Benchmarks - Vellum AI backs up that qualitative feel with pass‑rate and token‑efficiency data.
For you, the takeaway is simple: wire Opus into your actual stack, treat effort as a budget dial, and reserve your own time for the parts of the system an LLM still can’t see—product trade‑offs, architectural bets, and what “good enough” really means for your users.
The New Job Description for Developers
AI does not erase the developer role; it rewrites it. Watching Claude Opus 4.5 grind through Moritz’s backlog makes that obvious: the model chews through boilerplate, wiring, and refactors, while the human keeps steering the product. The job stops being “person who types code all day” and becomes “person who decides what should exist and when the AI is good enough.”
What Claude Opus actually automates looks suspiciously like the parts senior engineers complain about anyway. It scaffolds UI components, threads new buttons into existing toolbars, and mirrors data structures across frontend and backend. In Moritz’s Real Coding Workflow, Opus handled the “interact with content” button and the database-backed duplicate node feature with almost no human typing beyond prompts.
Where the model falters, the new developer steps in as editor-in-chief. That streaming UI retrofit worked functionally but introduced a subtle flicker—no benchmark catches that, but a human with product taste does. The developer’s job morphs into spotting UX seams, enforcing performance budgets, and deciding when to rip out AI-generated code for a cleaner architectural move.
Future-proof engineers lean harder into architecture and product thinking. You decide event flows, error boundaries, and data ownership before you ever ask Opus to write a single line. You define constraints—latency caps, accessibility rules, test coverage—and then judge whether the AI’s implementation actually respects them.
Day to day, that looks like a repeatable loop:
- Frame the problem in plan mode with precise constraints
- Let Claude Opus propose a design and patch set
- Review diffs like a staff engineer, not a code monkey
- Manually refine the 10–20% that touches UX, security, or performance
Developers who master this human-AI handoff gain leverage similar to moving from junior to tech lead. You are still accountable for correctness, maintainability, and user experience; you just delegate the repetitive labor to a system that never gets bored. The job description does not shrink—it expands into something more strategic, more creative, and, for those who adapt, far more powerful.
Frequently Asked Questions
What is Claude Opus 4.5?
Claude Opus 4.5 is the latest frontier reasoning model from Anthropic, specifically optimized for complex software engineering tasks, agentic workflows, and enhanced coding performance.
How does Claude Opus 4.5 improve coding workflows?
It improves workflows by understanding complex requirements, asking clarifying questions, handling edge cases proactively, and generating both frontend and backend code, significantly reducing initial development time.
Is Claude Opus 4.5 better than other models for coding?
While 'better' is subjective, Opus 4.5 shows significant improvements in long-horizon coding tasks and a deeper understanding of context, often requiring fewer iterations to achieve a working result, as shown in real-world tests.
What was the hardest task shown in the test?
The most challenging task was implementing a real-time streaming preview for an 'edit node' feature. While the model successfully implemented the core logic, it introduced minor UI bugs (a flicker), requiring human refinement.