Did Claude really get dumber again?

Name: Did Claude really get dumber again?
Uploaded: 2026-04-20T16:06:03.751374+00:00
Duration: 44 min 15 s
Description: Claude's performance regressions are real and stem from multiple compounding causes: bad harness code, tokenizer changes, thinking redaction, and 1M context window defaults.

TL;DR

Claude's performance regressions are real and stem from multiple compounding causes: bad harness code, tokenizer changes, thinking redaction, and 1M context window defaults.

Key Points

1.Claude Code's harness engineering is a primary regression driver. Benchmarks show Opus performs 15% worse in Claude Code versus Cursor, and on TerminalBench, Claude Code scores only 58% while competing harnesses like Forge Code and Cappy reach 75–82%.
2.Anthropic's 1M context window rollout is a confirmed quality downgrade. Their own September postmortem revealed that routing requests to the 1M-token model version degrades quality — and as of March, 1M context is the default for all Claude Code users with no easy in-app toggle.
3.Tokenizer changes in Opus 4.7 inflate context by up to 47%. Anthropic confirmed a 1.0–1.35x token increase; independent testing found 1.47x on tech docs, meaning users burn context faster and the model processes proportionally more noise.
4.Thinking redaction starting March 5th correlates precisely with quality regression. AMD's AI director analyzed 17,000 thinking blocks across 235,000 tool calls and found 100% of thinking was redacted by March 12th, with thinking depth dropping 73%, stripping models of reasoning context between API calls.
5.Behavioral metrics from AMD's team show dramatic degradation. The read-to-edit ratio collapsed from 6.6:1 to 2:1, permission-seeking phrases appeared 173 times in 17 days (previously zero), and user frustration indicators jumped 68% — all correlating with the thinking redaction rollout.
6.API token consumption exploded 80x with no productivity gain. AMD's team maintained the same number of user prompts (~5,700) but saw 80x more API requests and 170x more input tokens, producing demonstrably worse results.
7.Harness pollution from bad system prompt code wastes tokens and misleads models. A file-read enforcement bug caused unnecessary extra API calls per edit; prompt injection false-positives added garbage context; and excessive MCPs/skills added by users further degrade outputs.
8.Refusal behavior and 'getting lost' are distinct regression types. Refusals include API hard blocks (e.g., rejecting a Gold Bug cryptographic puzzle as hacking) and model soft refusals (Claude Code refusing to help with a Dropbox menu bar issue as 'outside software engineering'); getting lost is when the model misinterprets prior context and makes wrong edits.
9.OpenAI/Codex models have not shown comparable regression patterns. Community consensus and benchmark data confirm no sustained regressions from OpenAI, with Tibo confirming they don't adjust thinking budgets post-release, contrasting sharply with Anthropic's repeated infrastructure missteps.

Life's too short for long videos.

Summarize any YouTube video in seconds.

Quit Yapping — Try it Free →