Claude Opus 4.7 Review: A Flawed Upgrade or a Total Disaster?

TL;DR / Key Takeaways

Anthropic just dropped Claude Opus 4.7, promising god-tier AI capabilities. But top experts like Matthew Berman are uncovering serious flaws that could make it a massive step backward.

The AI World Holds Its Breath

Matthew Berman, a prominent AI expert and founder of Forward Future, didn't mince words. His YouTube video, "Seeing if Opus 4.7 sucks [LIVE]," immediately set a provocative tone, challenging the default narrative of progress in artificial intelligence. This direct, no-holds-barred approach captured the attention of a community already brimming with anticipation for Anthropic's latest flagship model, Claude Opus 4.7. Berman’s title alone signaled a critical deep dive, moving beyond marketing hype to scrutinize real-world performance, echoing the sentiment of his resources like "The Subtle Art of Not Being Replaced" and "Humanity's Last Prompt Engineering Guide."

Anthropic positioned Claude Opus 4.7 as its most capable Opus model to date, a hybrid reasoning powerhouse featuring an impressive 1M context window. Released on April 16, 2026, this model arrived with significant expectations. The industry looked for a definitive leap in AI capabilities, particularly in areas like: - Agentic coding - Advanced vision processing - Complex multi-step reasoning

The AI community, spanning individual developers to large enterprise users, eagerly awaited Opus 4.7. Its broad availability across major platforms promised widespread integration: - Claude Pro, Max, Team, and Enterprise users - Developers via the Claude Platform API - Integrations on Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry - Rolling out on GitHub Copilot

Developers hoped for a robust tool to tackle more ambitious projects, while enterprises sought efficiency gains and innovative solutions, justifying the model's base pricing of $5 per million input tokens and $25 per million output tokens. However, an updated tokenizer could increase the real cost by up to 35% for the same input, adding another layer of scrutiny.

Beneath the surface of official claims and initial excitement, a critical question simmered: Did Opus 4.7 deliver on its promise, or did Anthropic stumble? Despite touted improvements, whispers and expert analyses, including Berman's, suggested potential regressions. Reports indicated a significant decrease in long-context retrieval performance, with the MRCR benchmark reportedly dropping from 78.3% in Opus 4.6. The community braced for an answer: was this an innovative leap forward for Anthropic, or a significant misstep that could redefine expectations for frontier AI models?

What Anthropic Promised: A New Frontier

Anthropic officially unveiled Claude Opus 4.7 on April 16, 2026, positioning it as their most capable and ambitious model to date. The company presented this new iteration as a significant leap forward, built upon three core pillars: enhanced agentic coding, advanced vision capabilities, and robust enterprise-grade reasoning. This release aimed to redefine the boundaries of what autonomous AI could achieve, setting a high bar for its performance expectations.

Anthropic’s claims for Opus 4.7 were particularly bold, focusing on its ability to tackle sophisticated, multi-step challenges. They asserted the model could autonomously build complex software from high-level instructions, a significant step towards more independent AI agents. Furthermore, its advanced vision allowed for the analysis of high-resolution documents and intricate visual data, facilitating deeper understanding and extraction of insights from diverse formats. The model’s 1M context window underpinned these capabilities, enabling it to process and reason over vast amounts of information.

Broad availability marked another strategic move for Anthropic. Opus 4.7 became generally accessible to a wide range of users, including Claude Pro, Max, Team, and Enterprise subscribers. For developers and large organizations, Anthropic ensured seamless integration through multiple platforms: - The Claude Platform API - Amazon Bedrock - Google Cloud's Vertex AI - Microsoft Foundry This widespread deployment strategy underscored Anthropic's intent to embed Opus 4.7 deeply within the existing AI ecosystem, making it a ubiquitous tool for development and deployment. Its rollout on GitHub Copilot further solidified its presence in the developer workflow.

Anthropic's marketing language for Opus 4.7 was unequivocally assertive, positioning the model directly against leading competitors in the LLM space. The company highlighted Opus 4.7's superior performance in complex, multi-modal tasks and its "enterprise-grade" moniker, signaling its suitability for critical business applications requiring high reliability and accuracy. This strategic messaging aimed to capture the high-value enterprise market, emphasizing the model's capacity for intricate problem-solving and robust deployment.

The pricing structure for Opus 4.7 reflected its premium positioning. Anthropic set the base cost at $5 per million input tokens and $25 per million output tokens. However, a crucial detail often overlooked was the impact of an updated tokenizer, which could increase the effective cost by up to 35% for processing the same input volume. This cost consideration became a critical factor for organizations planning large-scale deployments, adding another layer to the model's overall value proposition.

The Elephant in the Room: Context Failure

Anthropic's Opus 4.7 faces its most alarming regression in long-context retrieval, a foundational capability for any advanced AI. Benchmarks reveal a catastrophic drop in the Mean Reciprocal Rank (MRCR), plummeting from 78.3% in the previous Opus 4.6 to a dismal 32.2%. This isn't a minor performance dip; it represents a severe degradation in the model's ability to process and accurately recall information from extensive, multi-page inputs.

MRCR serves as a critical metric, quantifying how effectively an AI model can locate a specific "needle" of information within a vast "haystack" of text. A higher MRCR indicates the model identifies the correct answer quickly, often among its top initial suggestions, signifying robust contextual understanding. The precipitous fall to 32.2% means Opus 4.7 now frequently fails to identify crucial details or buries them so deep within its output that they become practically inaccessible. This severely compromises the utility of its expansive 1M context window, making it unreliable for complex document analysis.

This profound failure in needle-in-a-haystack scenarios undermines many of the enterprise-grade applications Anthropic promoted. Consider the practical implications for professionals relying on accurate, timely information from large datasets: - Researchers attempting to synthesize findings from extensive scientific literature, legal precedents, or historical archives. They cannot trust the model to pinpoint critical facts or counter-arguments. - Developers navigating sprawling codebases, debugging complex systems, or interpreting vast API documentation. The model might miss a crucial function definition or an an obscure error message. - Financial and market analysts needing to extract precise data points, trends, or regulatory clauses from comprehensive reports spanning hundreds of pages. Overlooking a single figure could lead to significant errors.

For these users, Opus 4.7's inability to reliably recall specific facts renders it significantly less useful, even counterproductive. The model effectively "forgets" or overlooks critical information embedded within the very context it is supposed to understand, turning its large context window into a liability rather than an asset.

Anthropic touted Opus 4.7 as a superior model, boasting advancements in agentic coding, advanced vision, and sophisticated enterprise-grade reasoning. Therefore, the drastic degradation of such a fundamental capability raises immediate and serious questions about its development and testing. How could a supposedly more capable model exhibit such a severe, counterintuitive step backward in a core function, especially one so vital to its advertised strengths? This glaring oversight directly contradicts the narrative of progress and casts a long shadow over the model’s overall reliability. For further details on the model's announced features, refer to Anthropic’s official release: Introducing Claude Opus 4.7 - Anthropic.

The Cost You Didn't See Coming

Anthropic's Opus 4.7 arrived with an unadvertised financial impact, immediately evident to developers monitoring their API usage. A new, more verbose tokenizer significantly inflates token counts for identical input text, effectively raising real-world costs by up to 35%. While the published rates remain $5 per million input tokens and $25 per million output tokens, this behind-the-scenes change means developers pay considerably more for the same computational effort, creating a hidden surcharge on every interaction.

Further exacerbating this financial opacity, Anthropic inexplicably removed transparency around thinking tokens. Previous Opus iterations provided crucial insight into the internal processing steps, allowing developers to anticipate and manage API consumption with greater precision. This sudden lack of visibility now forces engineers to operate in the dark, hindering their ability to accurately forecast expenses and optimize complex prompt engineering strategies.

This new cost paradigm fundamentally shifts Opus 4.7's competitive standing against both its predecessor, Opus 4.6, and rival models. Opus 4.6 offered a more predictable cost model, crucial for budget-conscious enterprise deployments. Now, Anthropic's flagship model presents a less transparent, potentially far more expensive proposition compared to offerings from OpenAI or Google, where developers often find clearer pricing structures for comparable capabilities.

The critical question remains: do Opus 4.7's touted performance gains truly justify this increased, less predictable expenditure? Anthropic highlights advancements in agentic coding, advanced vision, and enterprise-grade reasoning as key selling points. Yet, these improvements must now be weighed against a higher effective price point and the model's alarming regression in long-context retrieval, as evidenced by the MRCR benchmark. For many developers, the value proposition has become considerably murkier, demanding a careful re-evaluation of their AI investment strategy.

'Adaptive Thinking': A Feature or a Flaw?

Anthropic controversially removed the Extended Thinking toggle, a crucial feature that previously granted users granular control over Claude Opus's reasoning depth. This user-controlled mechanism allowed professionals to explicitly guide the model through intricate problem-solving, ensuring thoroughness for high-stakes applications. Its disappearance marks a significant shift in how users interact with the model's cognitive processes.

Replacing this explicit control is Adaptive Thinking, an autonomous feature that operates without user input or transparency. Anthropic offers little clarity on how this new system functions, when it activates, or what parameters it considers. Users now face a black box, unable to influence or even understand the model's internal deliberative phases.

For complex, multi-step tasks—like agentic coding or enterprise-grade reasoning—the ability to direct the model’s thought process proves indispensable. Losing this direct user control feels like a substantial downgrade, undermining the predictability and reliability essential for critical workflows. This change forces users to cede agency to an opaque, automated system.

User feedback immediately highlighted widespread frustration over the loss of a valuable tool. Many professionals relied on the 'Extended Thinking' toggle to prevent superficial responses and ensure comprehensive analysis. The transition to an uncontrollable 'Adaptive Thinking' system has left many feeling disempowered, questioning Anthropic's commitment to user agency in advanced AI interactions.

Matthew Berman's Live Teardown

Matthew Berman’s live stream, provocatively titled "Seeing if Opus 4.7 sucks," offered a stark, real-world evaluation of Anthropic’s latest flagship model. As an influential voice for prompt engineers and AI builders, Berman’s teardown quickly surfaced critical discrepancies between Anthropic’s promises and Opus 4.7’s actual performance. His rigorous testing provided tangible evidence of the model’s regressions.

Berman’s live demonstrations repeatedly exposed Opus 4.7's struggles with long-context retrieval, echoing the alarming drop in the MRCR benchmark. He presented specific prompts where the model either hallucinated or entirely failed to recall information from earlier in the conversation, a task its predecessor, Opus 4.6, handled with far greater reliability. This directly undermined claims of "enterprise-grade reasoning" for complex, multi-step operations.

His expert opinion highlighted Opus 4.7's diminished practical utility for his audience. Berman, whose resources include "Download The Subtle Art of Not Giving a Fck of Not Giving a Fck" and "Download Humanity's Last Prompt Engineering Guide," emphasized that unpredictable context handling renders the model unreliable for professional AI development. He pointed out that while Anthropic touted improvements in agentic coding and advanced vision, these features become largely irrelevant if the model cannot maintain coherent understanding over extended interactions.

Berman's findings resonate deeply with the broader user sentiment swirling online. Numerous reports from the developer community corroborate his observations of inconsistent performance and a noticeable degradation in core capabilities. This widespread dissatisfaction intensifies given the hidden cost increases; the new tokenizer effectively inflates real expenses by up to 35% for identical inputs, compounding the frustration over reduced efficacy.

The removal of the user-controlled 'Extended Thinking' toggle further exacerbated Berman’s concerns, suggesting a lack of transparency and user agency. His live teardown served as a crucial public audit, solidifying the narrative that Opus 4.7, despite its official claims, represents a significant step backward for many critical applications. For more details on Anthropic's official announcements and how to access the model, readers can consult resources like Anthropic releases Claude Opus 4.7: How to try it, benchmarks, safety | Mashable.

When Good Code Goes Bad

Reports quickly emerged detailing Opus 4.7’s overly cautious Claude Code, frequently flagging benign snippets as harmful. This aggressive security posture immediately raised concerns among developers relying on Anthropic’s promise of advanced agentic coding. The model’s hyper-vigilance proved more hindrance than help.

Developers shared numerous instances of simple, innocuous code triggering alerts. Basic Python functions for file handling, common utility scripts, or even standard library imports sometimes received "malware" or "security risk" warnings, despite being perfectly safe. This created a frustrating and inefficient user experience.

This constant stream of false positives severely erodes developer trust in Opus 4.7 as a reliable coding assistant. Each incorrect flag demands manual review and override, disrupting efficient workflows and negating the very productivity gains AI coding tools are supposed to deliver. Engineers cannot confidently delegate tasks to an overly suspicious AI.

For enterprise users, where code integrity and security are paramount, this unreliability presents a significant barrier. Integrating a model that frequently misidentifies harmless code introduces unacceptable friction and potential delays in critical development cycles. The cost of false alarms quickly outweighs any perceived benefit.

Industry experts speculate Opus 4.7's over-sensitivity stems from aggressive updates to its safety alignment protocols. Anthropic may have tightened guardrails significantly to prevent any potential misuse or generation of harmful code, inadvertently creating a system prone to excessive caution. This trade-off prioritized safety above practical utility.

Balancing robust safety with practical utility remains a critical challenge for all large language model developers. Opus 4.7’s coding issues highlight the fine line between preventing truly harmful outputs and stifling legitimate development with an overly restrictive, cautious approach. The current implementation leans too heavily into the latter.

Matthew Berman’s live teardown likely observed these significant coding struggles, adding another layer to his provocative "sucks" assessment. The model’s inability to accurately discern safe from unsafe code diminishes its value, particularly for its vaunted agentic coding features, which require trust and precision.

Benchmarking Against Ghosts

Anthropic’s persistent tactic of benchmarking Opus 4.7 against its unreleased 'Mythos' model increasingly frustrates the AI community. This hypothetical competitor, perpetually on the horizon, serves more as a marketing phantom than a tangible yardstick, leaving users and developers alike questioning the relevance of such comparisons. The strategy feels less like a demonstration of current prowess and more like a deliberate distraction from Opus 4.7’s immediate, observed performance challenges.

Instead of demonstrating Opus 4.7’s capabilities against actual, formidable rivals like GPT-5.4 or Gemini 1.5 Pro, Anthropic continually points to a future, unverified ideal. This practice sidesteps crucial, real-world evaluations, making it exceedingly difficult for enterprises and developers to accurately assess Opus 4.7’s true competitive standing. Objective comparisons against available market leaders become virtually impossible without official, transparent data.

This marketing approach actively erodes trust. Benchmarking against a ghost model suggests either an unwillingness to face current competition head-on or, perhaps, an implicit admission that Opus 4.7 struggles in direct, objective comparisons. Such tactics force potential adopters to speculate about the model’s true value, rather than relying on verifiable, head-to-head performance metrics crucial for high-stakes AI integrations.

The industry demands more transparency and accountability. Anthropic must pivot to openly benchmark Opus 4.7 against existing market leaders, providing concrete, comparable data that genuinely informs purchasing and development decisions. Moving beyond the 'Mythos' mirage is critical for rebuilding community confidence and fostering an environment of honest, competitive innovation, where models are judged by what they deliver today, not what they promise tomorrow.

The Verdict: Is Opus 4.7 a Step Back?

Anthropic’s Opus 4.7 presents a stark dichotomy: advertised advancements against documented regressions. While Anthropic touted significant strides in agentic coding, advanced vision, and enterprise-grade reasoning, the model also introduced critical setbacks that challenge its overall utility. This isn't a simple upgrade; it's a complex re-prioritization of capabilities.

Does Opus 4.7 "suck"? Not entirely, but it certainly disappoints in crucial areas. The catastrophic drop in long-context retrieval performance, evidenced by the MRCR benchmark's decline from 78.3% in Opus 4.6, represents a severe regression for many users. Furthermore, the new tokenizer’s impact, increasing effective costs by up to 35%, adds an unexpected financial burden.

The removal of the user-controlled 'Extended Thinking' toggle and reports of Claude Code being overly cautious with false positives further complicate the picture. Matthew Berman’s live teardown and community feedback consistently highlight these issues, painting a portrait of an upgrade with significant trade-offs.

Recommendations for users are nuanced: - Upgrade: Developers or enterprises prioritizing the new agentic coding and advanced vision capabilities, where Opus 4.7 shows demonstrable gains, should consider it. - Wait: Users heavily reliant on long-context retrieval or those sensitive to the increased effective costs should hold off. - Avoid: If your workflow depends on the 'Extended Thinking' feature or if your applications are critically impacted by Claude Code's reported caution, Opus 4.7 might be a downgrade.

Progress in AI is rarely linear. Opus 4.7 underscores this reality, demonstrating that new features can arrive alongside significant, and sometimes inexplicable, regressions. While Opus 4.7 is generally available across various platforms, including Claude Opus 4.7 on Vertex AI | Google Cloud Blog, users must carefully evaluate its specific strengths against its considerable weaknesses before deployment. Anthropic's continued benchmarking against their unreleased 'Mythos' model only adds to the community's fatigue and uncertainty regarding the true state of their current offerings.

Anthropic at a Crossroads

Anthropic navigates a hyper-competitive AI landscape, where rivals like OpenAI and Google push aggressive release cycles. This intense environment amplifies every misstep, placing immense pressure on the company to innovate while simultaneously upholding reliability and user trust—a delicate balance Opus 4.7 demonstrably failed to strike, risking its standing in the fiercely contested market.

Promised gains in agentic coding and advanced vision arrived alongside alarming regressions impacting core functionality. The catastrophic drop in the MRCR benchmark for long-context retrieval directly contradicted the narrative of advancement. Furthermore, a new tokenizer effectively increased operational costs by up to 35% for the same input, creating a hidden financial burden for enterprise users and developers.

Removing the user-controlled "Extended Thinking" toggle, replacing it with the opaque "Adaptive Thinking" feature, further eroded user confidence. This change limited granular control and contributed to reports of Claude Code on Opus 4.7 being overly cautious and prone to false positives. Community fatigue with Anthropic's consistent benchmarking against their unreleased "Mythos" model also highlights a growing demand for transparency over aspirational comparisons.

To win back trust, Anthropic must prioritize stability and transparency. Addressing the core regressions, particularly the context failure and the hidden cost increases, is paramount. Reinstating user control over model behavior and providing clear, actionable roadmaps, rather than vague benchmarks, would signal a renewed commitment to its user base. Future releases must demonstrate tangible improvements in real-world scenarios.

This episode serves as a stark lesson for the entire AI industry. Marketing hype and internal benchmarks mean little when demonstrable, consistent real-world performance falters. Transparency in development, honest communication about limitations, and a relentless focus on reliability must precede grand claims about future capabilities. Matthew Berman’s provocative title, "Seeing if Opus 4.7 sucks," unfortunately proved prescient, underscoring the community's urgent demand for unvarnished truth.

Frequently Asked Questions

What are the main new features of Claude Opus 4.7?

Anthropic claims Opus 4.7 has enhanced performance in agentic coding, substantially better vision capabilities for analyzing complex documents, and improved reasoning for professional tasks like financial analysis.

What are the biggest criticisms of Opus 4.7?

Major criticisms include a severe drop in long-context retrieval performance, a new tokenizer that increases costs by up to 35%, the removal of user controls like 'Extended Thinking', and an overly sensitive code interpreter.

Should I upgrade to Claude Opus 4.7?

It depends on your use case. If you need state-of-the-art vision or agentic coding, it may be worth testing. However, if you rely on long-context retrieval or predictable costs, you might want to stick with a previous version or competitor for now.

Who is Matthew Berman?

Matthew Berman is an AI expert and creator behind the 'Forward Future' brand. He is known for providing critical, no-hype reviews and practical guides on new AI tools and models.

𝕏 in ↑↗

One weekly email of tools worth shipping. No drip funnel.

one email per week · unsubscribe in two clicks · no third-party tracking

Anthropic's AI Gamble: Does Opus 4.7 Suck?