Quit Yapping
Two AI Models Set to "stir government urgency", But Will This Challenge Undo Them?
16:26
Watch on YouTube ↗
A
AI Explained·Tech

Two AI Models Set to "stir government urgency", But Will This Challenge Undo Them?

TL;DR

OpenAI's Spud and Anthropic's Claude are poised for major leaps, but ARC-AGI 3 reveals frontier models score under 0.5% against humans' 100%.

Key Points

  • 1.OpenAI is shelving side projects to prioritize its Spud model. The Sora video app and an erotic chatbot were scrapped to free compute; Sam Altman says Spud will be ready in weeks and will 'accelerate the economy.'
  • 2.Anthropic warned US government officials that Claude's next release will supercharge offensive and defensive cyber capabilities. This warning may 'stir government urgency' and revive the Pentagon deal that broke down, with a key negotiating official saying they are still 'very close to an agreement.'
  • 3.ARC-AGI 3 is a new interactive benchmark where humans score 100% but the best current AI models score under 0.5%. Unlike static grid tests, it requires exploration, planning, memory, and goal inference across progressive levels — with no stated objectives.
  • 4.ARC-AGI 1 and 2 were gamed by frontier models through high-density task-space sampling, not direct memorization. Models like Gemini 3 showed giveaways in chain-of-thought that their training resembled ARC-AGI tasks; ARC-AGI 3 uses out-of-distribution private test sets to prevent this.
  • 5.OpenAI's North Star for the next few years is a fully automated AI researcher. They aim for an 'intern-level' AI tackling specific research problems by September 2025, though historical data shows even AI-first workflows only produce ~40% speedups, not immediate exponential gains.
  • 6.A vibe-coded hack co-opted a key open-source Python library to export secrets to the dark web, highlighting risks of agentic AI swarms. Nvidia's Jim Fan argues AI agents need many layers of nested 'shells' for oversight, especially as models can now hack benchmarks in real time.

Life's too short for long videos.

Summarize any YouTube video in seconds.

Quit Yapping — Try it Free →