T
Theo - t3.gg·TechDid Claude really get dumber again?
TL;DR
Claude's performance regressions are real and stem from multiple compounding causes: bad harness code, tokenizer changes, thinking redaction, and 1M context window defaults.
Key Points
- 1.Claude Code's harness engineering is a primary regression driver. Benchmarks show Opus performs 15% worse in Claude Code versus Cursor, and on TerminalBench, Claude Code scores only 58% while competing harnesses like Forge Code and Cappy reach 75–82%.
- 2.Anthropic's 1M context window rollout is a confirmed quality downgrade. Their own September postmortem revealed that routing requests to the 1M-token model version degrades quality — and as of March, 1M context is the default for all Claude Code users with no easy in-app toggle.
- 3.Tokenizer changes in Opus 4.7 inflate context by up to 47%. Anthropic confirmed a 1.0–1.35x token increase; independent testing found 1.47x on tech docs, meaning users burn context faster and the model processes proportionally more noise.
- 4.Thinking redaction starting March 5th correlates precisely with quality regression. AMD's AI director analyzed 17,000 thinking blocks across 235,000 tool calls and found 100% of thinking was redacted by March 12th, with thinking depth dropping 73%, stripping models of reasoning context between API calls.
- 5.Behavioral metrics from AMD's team show dramatic degradation. The read-to-edit ratio collapsed from 6.6:1 to 2:1, permission-seeking phrases appeared 173 times in 17 days (previously zero), and user frustration indicators jumped 68% — all correlating with the thinking redaction rollout.
- 6.API token consumption exploded 80x with no productivity gain. AMD's team maintained the same number of user prompts (~5,700) but saw 80x more API requests and 170x more input tokens, producing demonstrably worse results.
- 7.Harness pollution from bad system prompt code wastes tokens and misleads models. A file-read enforcement bug caused unnecessary extra API calls per edit; prompt injection false-positives added garbage context; and excessive MCPs/skills added by users further degrade outputs.
- 8.Refusal behavior and 'getting lost' are distinct regression types. Refusals include API hard blocks (e.g., rejecting a Gold Bug cryptographic puzzle as hacking) and model soft refusals (Claude Code refusing to help with a Dropbox menu bar issue as 'outside software engineering'); getting lost is when the model misinterprets prior context and makes wrong edits.
- 9.OpenAI/Codex models have not shown comparable regression patterns. Community consensus and benchmark data confirm no sustained regressions from OpenAI, with Tibo confirming they don't adjust thinking budgets post-release, contrasting sharply with Anthropic's repeated infrastructure missteps.
Life's too short for long videos.
Summarize any YouTube video in seconds.
Quit Yapping — Try it Free →