Claude Opus 4.7 Review: Powerful AI With a Hidden Token Cost

TL;DR / Key Takeaways

Anthropic just dropped Claude Opus 4.7, a coding powerhouse that crushes benchmarks and designs stunning UIs. But a silent tokenizer change means you could be paying 35% more for the exact same prompts.

The Best AI Coder Just Got A Pay Rise

Anthropic’s latest flagship model, Opus 4.7, arrives as a formidable upgrade for developers and creators, promising significant leaps in AI capabilities. Released on April 16, 2026, this iteration genuinely enhances the Claude experience, pushing boundaries in critical areas like code generation and visual understanding. Its arrival immediately positions it as a top-tier contender in the rapidly evolving AI landscape.

Despite its impressive performance gains, Opus 4.7 carries a hidden cost that developers must navigate. While Anthropic maintains consistent pricing per token, an updated tokenizer and new default settings mean the same input prompts can now consume substantially more tokens in practice, translating to higher operational expenses. This subtle shift introduces a "secret tax" on what initially appears to be a free performance boost.

Opus 4.7 demonstrates radically improved coding benchmarks. It achieved a 10% leap over its predecessor, Opus 4.6, on SWE-bench Pro, reaching an impressive 64.3%. On SWE-bench Verified, the model scored 87.6%, reflecting a 7% gain. These numbers solidify its standing as a superior tool for complex code reasoning, systems engineering, and long-horizon autonomous tasks.

Beyond coding, Opus 4.7 dramatically elevates its multimodal support. The model now processes higher-resolution images, accepting inputs up to 2,576 pixels on the longest edge, approximately 3.75 megapixels—three times the resolution of previous models. This enhancement significantly improves tasks like data extraction from intricate documents and charts, alongside generating more "tasteful and creative" UI designs, as demonstrated in recent tests creating responsive cafe websites.

The increased token consumption stems from two primary changes. An updated tokenizer maps the same input content to roughly 1.0 to 1.35 times more tokens, depending on the data type. Furthermore, Opus 4.7 “thinks more” at higher effort levels, a default behavior in Claude Code where a new `xhigh` effort level is now enabled for all plans. This deeper reasoning improves reliability but invariably burns through more output tokens, directly impacting costs.

Crushing Code, Fumbling Facts?

Opus 4.7 truly excels in code generation and problem-solving, showcasing robust advancements over its predecessor. The model achieved a 10% leap on SWE-bench Pro, reaching an impressive 64.3% accuracy, significantly outperforming Opus 4.6. This substantial improvement positions Opus 4.7 as a formidable tool for developers tackling complex software engineering challenges.

Further solidifying its coding dominance, Opus 4.7 also posted a 7% gain on SWE-bench Verified, scoring 87.6%. These benchmark victories underscore Anthropic's commitment to enhancing the model's ability to handle long-horizon autonomy, systems engineering, and intricate code reasoning tasks.

Paradoxically, these coding triumphs arrive alongside a puzzling dip in cybersecurity scores. Anthropic's own benchmarks reveal a slight decline in this area, a deliberate outcome of new, stringent safeguards implemented within Opus 4.7. The company intentionally built these enhanced cyber protections to block requests indicating prohibited or high-risk cybersecurity uses.

This strategic choice means the model artificially keeps its cybersecurity performance lower than it might otherwise achieve. Anthropic aims to learn from these interactions, informing the development of even more powerful, yet safer, future models like the unreleased Mythos-class, highlighting a tension between raw capability and responsible AI design.

Despite its programming prowess, Opus 4.7 harbors a concerning regression in long-context understanding, a critical capability for many advanced AI applications. Internal "needle-in-a-haystack" evaluations reported a dramatic "nose dive" in long-context performance compared to Opus 4.6. This suggests Opus 4.7 struggles more to retrieve specific information buried deep within vast amounts of text.

This unexpected drop raises significant questions about the model's reliability when processing extensive documents, summarizing lengthy conversations, or maintaining coherence over prolonged, multi-session tasks. For users relying on Claude for deep contextual awareness, this potential degradation could severely impact real-world usage.

Say Goodbye to Your Old Prompts

Opus 4.7 introduces a radically different approach to instruction following, demanding a complete re-evaluation of established prompting strategies. Unlike previous Claude models that often interpreted directives loosely or even skipped less emphasized parts, Opus 4.7 is engineered for unparalleled literalism and precision. This fundamental shift means the model now adheres strictly to every instruction provided, executing commands with an exactness that fundamentally alters how users must interact with it.

Users deploying prompts designed for older, more forgiving models will almost certainly encounter unexpected or overly literal outputs. Where a previous iteration might have inferred intent or prioritized certain instructions over others, Opus 4.7 will execute all parts of a prompt with equal weight. This can lead to undesirable results if prompts haven't been meticulously crafted to account for its newfound rigor, potentially derailing complex workflows and demanding significant debugging.

This critical change necessitates a comprehensive audit and re-evaluation of existing prompt libraries. Developers and creators must now meticulously refine their prompts, stripping away any ambiguity and ensuring every instruction is explicit and intentional. Adapting to this precise paradigm is not merely an option but a requirement to fully leverage Opus 4.7's enhanced power, particularly for complex coding and agentic tasks where exact adherence to multi-step instructions is paramount.

Harnessing the model's improved instruction-following means embracing a more disciplined approach to prompt engineering. The payoff, however, is a model capable of delivering highly accurate and predictable results, provided the input matches its literalism. This investment in prompt refinement will unlock Opus 4.7's true potential, transforming it into a more reliable and powerful tool for intricate tasks. For those planning extensive prompt overhauls, understanding the latest tokenization and pricing structures is vital; refer to Anthropic's Model Pricing | Anthropic page for detailed information.

The 35% 'Tokenizer Tax' You're Now Paying

Opus 4.7 introduces a fundamental shift in how Anthropic's flagship model processes text, directly impacting operational costs for developers and power users. Anthropic updated the model's tokenizer, the internal mechanism breaking down input text into discrete units for the AI to understand. This technical adjustment, while improving internal processing, carries a significant financial implication for users.

Previously, a given input prompt mapped to a predictable number of tokens for API billing. With Opus 4.7, that exact same input can now map to 1.0 to 1.35 times more tokens, depending on content type. Users effectively pay more for identical information, despite Anthropic maintaining its original per-token pricing. This increased token consumption functions as an insidious "tokenizer tax" on every API call, silently inflating operational expenses.

Consider a practical example for an API developer using Opus 4.7 for a complex coding task. An input prompt that previously consumed 1,000 tokens on Opus 4.6, at Anthropic's input price of $15.00 per 1 million tokens, would have cost $0.015. This was a straightforward calculation.

With the new tokenizer, that same 1,000-token input could now translate to as many as 1,350 tokens for Opus 4.7. This directly translates to a new input cost of $0.02025 for the identical prompt, a stark 35% increase in expenditure solely due to the tokenizer change. This "tax" applies even before accounting for Opus 4.7's tendency to "think more" at higher effort levels, which further inflates overall token consumption.

Developers must now meticulously monitor token counts and adjust prompting strategies to mitigate these escalating costs. The seemingly minor technical update to the tokenizer demands a complete re-evaluation of budget forecasts and prompt optimization, turning a powerful upgrade into a more expensive proposition. Predictable token usage has ended, ushering in an era of careful cost management.

Your AI Is Working Overtime (By Default)

Opus 4.7 introduces a new `xhigh` effort level, positioned between `high` and `max` reasoning settings. This addition offers developers finer control over the model's processing, balancing deeper computational thought against response latency. At these higher effort levels, Opus 4.7 "thinks more," particularly during later turns in agentic settings, which significantly improves its reliability on complex, hard problems.

This enhanced reasoning comes with a crucial, often hidden, cost: Anthropic has set the extra high effort level as the default in Claude Code for all plans. Without user intervention, Opus 4.7 is now working overtime by default, consuming substantially more tokens than users might anticipate for their prompts. This change contributes directly to the "Tokenizer Tax" discussed previously, as the model's verbosity increases.

To put this into perspective, the new `extra high` effort level in Opus 4.7 uses roughly the same amount of tokens as Opus 4.6's max effort level. This means that users accustomed to Opus 4.6's performance at its highest setting are now getting a similar token burn rate as a baseline in Opus 4.7, even for routine tasks. This default dramatically impacts operational costs.

Savvy developers, however, can navigate this new cost landscape strategically. Experts strongly advise testing the various effort levels to find an optimal balance. A key recommendation: change the default setting in Claude Code to `high` instead of `extra high`.

This seemingly minor adjustment yields significant benefits. Opus 4.7's `high` effort level actually outscores Opus 4.6's `max` effort level, all while utilizing fewer tokens. By making this single configuration change, users can achieve superior performance compared to the previous generation's peak, but with a notable reduction in token consumption and corresponding costs. This presents a clear path to optimizing both output quality and expenditure.

The Ultimate UI Design Showdown

Beyond raw coding prowess, Anthropic also touted Opus 4.7's improved UI design capabilities. A straightforward 'cafe website' test, requiring only an `index.html` file, pitted Opus 4.7 against its predecessor, Opus 4.6, alongside competitors Gemini 3.1 and GPT 5.4. This real-world scenario aimed to assess the models' creative flair and ability to translate a simple concept into a visually appealing web page.

Opus 4.7 delivered a "pretty nice" result, generating a responsive cafe website with a tasteful font and well-integrated Unsplash images. The design evoked a genuine cafe feel, showcasing a clear step up from Opus 4.6's output. Its predecessor produced a less polished version, featuring a less appealing gradient background and a generally less refined aesthetic, making Opus 4.7's improvement tangible.

The real surprise, however, came from Gemini 3.1, which emerged as the preferred design for this specific creative task. Its output boasted a striking fixed background, well-executed image sections, and a nicely laid-out menu that impressed testers. Gemini 3.1 demonstrated a strong visual flair, proving that raw coding ability does not always equate to superior aesthetic judgment in UI design.

Conversely, GPT 5.4 landed in a distant last place. Its generated website suffered from a generic, immediately recognizable "GPT look and feel," characterized by an overuse of blurred card elements. This design failed to capture the desired cafe ambiance, highlighting the model's struggle with creative, stylistic interpretation compared to its peers.

This UI design showdown underscores that while benchmarks quantify technical performance, subjective creative tasks often reveal distinct model personalities and strengths. Understanding these nuances is crucial for developers choosing the right AI for diverse projects. For further details on the economic implications of these models and their evolving token usage, readers can explore Claude Opus 4.7 Pricing: The Real Cost Story Behind the “Unchanged” Price Tag - Finout.

Building a Fullstack App in One Shot

Moving beyond simple single-page websites, the ultimate test for modern AI coding prowess involves building a fullstack application from scratch. We challenged the leading models to construct a comprehensive personal finance dashboard, granting them full autonomy to select their preferred tech stack and implement core functionalities. This complex task probes not just UI design, but backend logic, data management, and architectural decision-making.

Opus 4.7 delivered a genuinely impressive initial result, showcasing a level of integration and design coherence unmatched by its rivals. The generated application featured a clean, intuitive user interface with a thoughtfully chosen color scheme. Its aesthetic appeal immediately stood out, reflecting the model's touted improvements in "tasteful and creative" UI generation.

Functionally, the frontend components were robust and well-implemented. Users could interact with various elements, inputting financial data and navigating through different sections of the dashboard. The code demonstrated a strong grasp of modern web development principles, producing a responsive and engaging user experience that felt production-ready at first glance.

However, Opus 4.7's sophisticated output harbored a critical design flaw deep within its chosen architecture. Despite the impressive frontend, the model opted for an in-memory database solution to handle all user data. This fundamental choice severely undermined the application's real-world utility, introducing a fatal flaw for any finance tracking tool.

An in-memory database means all information, from user accounts to transaction histories, resides solely in the application's active memory. Consequently, any restart of the server or application process instantly wipes every piece of stored data. This complete lack of data persistence renders the finance dashboard utterly impractical for its intended purpose.

While Opus 4.7 demonstrated exceptional skill in generating complex, well-structured code and attractive UIs, its architectural decision revealed a significant blind spot. The model failed to prioritize the most crucial aspect of a personal finance application: the secure and permanent storage of sensitive financial information. This oversight highlights a continued challenge for even the most advanced AI coders: understanding implicit user requirements beyond explicit instructions.

How The Competition Stacks Up

After Opus 4.7 successfully architected and built a personal finance dashboard, choosing its own tech stack and delivering a runnable application in a single pass, the comparative analysis reveals stark differences across leading models. Its ability to generate a coherent, functional full-stack solution from a high-level prompt sets a formidable standard for the competition in practical software development.

Anthropic's previous iteration, Opus 4.6, presented a more mixed result. While its user interface proved less impressive and aesthetically refined than 4.7's output, the model demonstrated a stronger grasp of backend persistence. It correctly implemented a persistent SQLite database and delivered more working features crucial for a functional application. This older model prioritized core application logic, highlighting a nuanced trade-off between visual polish and robust foundational functionality.

OpenAI's GPT-5.4 struggled significantly with the full-stack task, producing an attempt that was functionally unusable. It failed to deliver a cohesive or runnable application, generating fragmented code that required substantial manual intervention. Furthermore, the model opted for a technically basic approach, relying on plain JavaScript and HTML instead of demonstrating proficiency with modern frameworks. This output positions GPT-5.4 far behind in complex, multi-component application generation.

Google's Gemini 3.1 delivered the least effective performance in this demanding test. It fundamentally failed to produce a running application from the initial prompt, requiring multiple follow-up interactions and extensive user guidance to achieve even partial functionality. This inability to generate a self-contained, executable project without significant external intervention underscores its current limitations in autonomous, full-stack development scenarios, ranking it as the least capable in this benchmark.

These results from both the simple cafe website UI test and the more complex personal finance dashboard full-stack challenge paint a clear picture of the current AI coding landscape. While Opus 4.7 excels in creating polished, runnable applications with modern design principles and robust functionality, its rivals often fall short on either aesthetic quality, functional completeness, or the critical ability to deliver a working product without extensive user guidance and iterative prompting. This performance gap solidifies Opus 4.7's current lead in complex, multi-faceted code generation.

Meet Mythos: The AI We Can't Have Yet

While Opus 4.7 reigns as the most capable publicly available model, Anthropic's own benchmarks reveal a hidden, more powerful AI: Mythos. This advanced model, showcased in internal evaluations, demonstrably surpasses even the latest Claude iteration, yet remains inaccessible to developers and creators. Its existence underscores the rapid, often unseen, progress occurring within AI research labs.

Anthropic currently withholds Mythos from public release due to critical safety concerns. The model's immense capabilities, particularly its potential for misuse, necessitate robust guardrails and extensive testing before it can be deployed broadly. This cautious approach highlights the industry's ongoing struggle to balance innovation with responsible AI development.

Opus 4.7 plays a pivotal, strategic role in this delicate balancing act. Anthropic specifically engineered it as a vital testbed for new cyber safeguards, actively blocking requests that indicate prohibited or high-risk cybersecurity uses. This deliberate design choice explains Opus 4.7's unique performance anomaly: a slight, controlled dip in its cybersecurity benchmark scores compared to Opus 4.6, a reduction serving as an artificial constraint to mitigate potential risks.

Real-world data gathered from Opus 4.7's deployment under these strict protocols is invaluable. It allows Anthropic to rigorously assess the effectiveness of its safety mechanisms and understand the complex interactions between powerful AI and potential threats. This iterative learning process is fundamental for refining future models.

Ultimately, Opus 4.7 represents a crucial, foundational step towards the eventual, safe introduction of Mythos-class models. Its public release provides a controlled environment to validate advanced safety features, paving the way for more powerful, yet secure, AI systems. When Mythos or its successors finally arrive, they promise to radically reshape software development, offering unprecedented capabilities only after rigorous safety validation.

The Verdict: A Flawed Masterpiece?

Opus 4.7 presents a nuanced picture, delivering groundbreaking coding and UI design capabilities. Its 10% leap on SWE-bench Pro and 7% gain on Verified benchmarks over Opus 4.6 showcase its raw power, demonstrating impressive full-stack app generation in our tests. However, this enhanced performance arrives with a significant cost increase, specifically a potential 35% 'tokenizer tax' on existing prompts. Furthermore, questions persist about its long-context reliability, with some needle-in-a-haystack benchmarks suggesting a performance dip compared to Opus 4.6.

Anthropic's decision to default to the new `extra high` effort level in Claude Code further exacerbates token consumption. While this setting promises deeper reasoning and improved reliability on hard problems, it translates directly into higher operational costs for developers. Users must actively manage these settings, exploring the `high` effort level to strike a better balance between performance and economic efficiency. This vigilance becomes crucial for any sustained development work, especially for agentic settings.

For complex coding tasks, intricate systems engineering, and sophisticated UI design, Opus 4.7 stands as a phenomenal tool, arguably the best publicly available model. Its more literal instruction following demands prompt refinement, but rewards precision with highly accurate outputs. Developers can leverage its enhanced multimodal support and self-verification for remarkably robust outcomes, even across multi-session workflows.

Ultimately, Opus 4.7 is a flawed masterpiece: unparalleled in specific domains, but with hidden costs and potential long-context weaknesses that users must navigate. It demands a more strategic approach to usage, making conscious choices about effort levels and prompt optimization. What are your thoughts on Anthropic's latest update? Which AI model currently serves as your go-to for development, and what do you think of Opus 4.7's trade-offs?

Frequently Asked Questions

What is the main upgrade in Claude Opus 4.7?

Opus 4.7 offers major improvements in coding, agentic reasoning, and higher-resolution vision, showing a 10% leap on the SWE-bench Pro benchmark over its predecessor.

Why does Opus 4.7 cost more to use for the same prompt?

It uses an updated tokenizer that can map the same text to up to 35% more tokens. Combined with a default "extra high" effort level in Claude Code, this effectively increases the cost per task despite unchanged per-token pricing.

How does Opus 4.7's long-context performance compare to 4.6?

Some user tests and benchmarks, like the needle-in-a-haystack test, show a significant regression in long-context retrieval, suggesting a potential trade-off was made to boost other capabilities.

Is Claude Opus 4.7 better than GPT-5.4 for coding?

Based on a full-stack application test, Opus 4.7 produced a significantly more complete and well-designed application with a clean UI, while GPT-5.4 generated a basic and unusable project.

𝕏 in ↑↗

One weekly email of tools worth shipping. No drip funnel.

one email per week · unsubscribe in two clicks · no third-party tracking

Claude's New AI Has a Secret Tax