Quit Yapping
This model is kind of a disaster.
37:57
Watch on YouTube ↗
T
Theo - t3.gg·Tech

This model is kind of a disaster.

TL;DR

Claude Opus 4.7 disappoints in practice due to broken Claude Code harness, over-aggressive safety filters, and inconsistent outputs compared to GPT models.

Key Points

  • 1.Opus 4.7 is Anthropic's latest public model but not their best. Claude Mythos preview is more capable; Opus 4.7 is positioned as a test bed for new cyber safeguards before a broader Mythos release, priced at $5/M input and $25/M output tokens.
  • 2.Over-aggressive safety filters lobotomize normal workflows. Claude Code's system prompt caused Opus 4.7 to flag Theo's own website as malware on first use; a Defcon cryptography puzzle (no hacking involved) triggered a hard chat lock redirecting to the weaker Sonnet 4 model.
  • 3.Improved instruction-following paradoxically causes new failures. Opus 4.7 follows prompts so literally it skips proactive research — it repeatedly recommended Next.js 15 instead of the actual latest Next.js 16, even when told to use 'latest versions.'
  • 4.GPT-o3/5.4 outperformed Opus 4.7 on the same modernization task. When given the identical prompt, GPT-5.4 independently searched for current package versions, correctly identified Next.js 16, and completed the migration without the version errors Opus 4.7 repeatedly made.
  • 5.The core problem is Claude Code's harness, not the model itself. Theo argues Anthropic's engineers are shipping poorly maintained tooling — broken bypass-permissions mode, vibe-coded system prompts, and a harness Anthropic's own employees don't use internally.
  • 6.Anthropic uses an entirely different internal stack than public users. Unlike OpenAI and Google, whose employees use the same tools as customers, Anthropic staff have proprietary versions of Claude Code with features unavailable externally, causing internal hype that doesn't translate to public experience.
  • 7.New Opus 4.7 features include vision up to 2,576px (4MP), memory improvements, and an 'extra-high' effort level. It also adds an Ultra Review slash-command in Claude Code and an auto-permissions mode — though Theo found auto-mode broken and bypass-permissions inexplicably stopped working mid-session.
  • 8.Benchmark gains are narrow and inconsistent. Opus 4.7 scores best on SWE-Bench Pro and Verified agentic coding but actually performs worse than Opus 4.6 on Agentic Search bench and slightly lower on cybersecurity vulnerability reproduction, with OpenAI dominating Humanity's Last Exam with tools (58.7% vs Opus 4.7's lower score).
  • 9.Theo concludes the model has a higher ceiling but worse consistency than Opus 4.5/4.6. He recommends upgrading from 4.6 since there's no extra cost, but warns the experience is unpredictable — impressive on task one, broken on the next five — and attributes ongoing regressions to deteriorating engineering culture at Anthropic.

Life's too short for long videos.

Summarize any YouTube video in seconds.

Quit Yapping — Try it Free →
This model is kind of a disaster. | Quit Yapping