?
The RL Irony in LLMs (and its insane new meta)
TL;DR
RL improves LLMs dramatically but can't achieve true AGI — and pairing it with LoRA is the new efficient meta that makes personalized AI scalable and cheap.
Key Points
- 1.The irony: RL gives LLMs powerful capabilities (reasoning, self-correction, agentic tasks) but fundamentally can't achieve AGI because it narrows the model's generalization rather than expanding it
- 2.Why RL is noisy: RL only gets a single reward signal per episode (≈1 bit of info) vs. next-token prediction which corrects every single token — sparse signal, but it enables exploration and self-correction that pre-training can't reliably produce
- 3.The key discovery: RL inherently makes tiny, targeted weight updates — affecting as little as 5% of parameters — which maps perfectly onto how LoRA works
- 4.LoRA matches full fine-tuning for RL: Even at rank-1 (minimum capacity), LoRA performs identically to full fine-tuning on RL tasks, using only 2/3 the compute
- 5.The conditions that matter: Apply LoRA to ALL layers (not just attention), use 10x higher learning rate than full fine-tuning, and avoid oversized batch sizes
- 6.The new meta: LoRA + RL enables fast, cheap experimentation and modular personalization at scale — swappable adapters per user or task, making "personalized AGI" the commercial reality even if true AGI isn't
Life's too short for long videos.
Summarize any YouTube video in seconds.
Quit Yapping — Try it Free →