Quit Yapping
Did Claude really get dumber again?
44:15
Watch on YouTube ↗
T
Theo - t3.gg·Tech

Did Claude really get dumber again?

TL;DR

Claude's performance regressions are real and stem from multiple compounding causes: bad harness code, tokenizer changes, thinking redaction, and 1M context window defaults.

Key Points

  • 1.Claude Code's harness engineering is a primary regression driver. Benchmarks show Opus performs 15% worse in Claude Code versus Cursor, and on TerminalBench, Claude Code scores only 58% while competing harnesses like Forge Code and Cappy reach 75–82%.
  • 2.Anthropic's 1M context window rollout is a confirmed quality downgrade. Their own September postmortem revealed that routing requests to the 1M-token model version degrades quality — and as of March, 1M context is the default for all Claude Code users with no easy in-app toggle.
  • 3.Tokenizer changes in Opus 4.7 inflate context by up to 47%. Anthropic confirmed a 1.0–1.35x token increase; independent testing found 1.47x on tech docs, meaning users burn context faster and the model processes proportionally more noise.
  • 4.Thinking redaction starting March 5th correlates precisely with quality regression. AMD's AI director analyzed 17,000 thinking blocks across 235,000 tool calls and found 100% of thinking was redacted by March 12th, with thinking depth dropping 73%, stripping models of reasoning context between API calls.
  • 5.Behavioral metrics from AMD's team show dramatic degradation. The read-to-edit ratio collapsed from 6.6:1 to 2:1, permission-seeking phrases appeared 173 times in 17 days (previously zero), and user frustration indicators jumped 68% — all correlating with the thinking redaction rollout.
  • 6.API token consumption exploded 80x with no productivity gain. AMD's team maintained the same number of user prompts (~5,700) but saw 80x more API requests and 170x more input tokens, producing demonstrably worse results.
  • 7.Harness pollution from bad system prompt code wastes tokens and misleads models. A file-read enforcement bug caused unnecessary extra API calls per edit; prompt injection false-positives added garbage context; and excessive MCPs/skills added by users further degrade outputs.
  • 8.Refusal behavior and 'getting lost' are distinct regression types. Refusals include API hard blocks (e.g., rejecting a Gold Bug cryptographic puzzle as hacking) and model soft refusals (Claude Code refusing to help with a Dropbox menu bar issue as 'outside software engineering'); getting lost is when the model misinterprets prior context and makes wrong edits.
  • 9.OpenAI/Codex models have not shown comparable regression patterns. Community consensus and benchmark data confirm no sustained regressions from OpenAI, with Tibo confirming they don't adjust thinking budgets post-release, contrasting sharply with Anthropic's repeated infrastructure missteps.

Life's too short for long videos.

Summarize any YouTube video in seconds.

Quit Yapping — Try it Free →