AI Dev Tools

GitHub Copilot CLI Rubber Duck Explained

GitHub Copilot CLI just got a sassy sidekick: Rubber Duck, an AI from another model family to critique your main agent's plans. But after 20 years watching Valley hype cycles, I'm asking if this fixes real coding pains or just pads the bill.

Terminal screenshot of GitHub Copilot CLI with Rubber Duck critique on a data pipeline bug

Key Takeaways

  • Rubber Duck pairs rival AI models for targeted critiques, closing big performance gaps on tough coding benchmarks.
  • Best for multi-file beasts; overkill for simple fixes, with token costs adding up.
  • GitHub's ecosystem lock-in: clever tech, but primes the pump for more paid API usage.

I stared at my terminal this afternoon, coffee gone cold, as GitHub Copilot CLI’s Rubber Duck tore apart a half-baked data pipeline plan—catching a Solr facet glitch that’d slipped right past Claude Sonnet.

GitHub Copilot CLI’s Rubber Duck lands in experimental mode today, promising a ‘second opinion’ from a rival AI family to spot what your primary agent misses. It’s not some fluffy toy; this thing’s built to review plans, implementations, and edge cases before you commit to disaster. Select Claude as your orchestrator (Opus, Sonnet, or Haiku), and bam—GPT-5.4 steps in as the contrarian duck, flagging assumptions, inefficiencies, the works.

But here’s the thing. We’ve heard this tune before. Back in the ’90s, us coders rubber-ducked our bugs to plush toys on our desks—talking through code to our feathered friends until the logic clicked. GitHub’s riffing on that, sure, but swapping the duck for Claude’s frenemy, GPT. Cute nod. Except now it’s paywalled behind slash commands and model access.

“Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.”

That’s straight from GitHub’s eval on SWE-Bench Pro, their benchmark for gnarly, real-world GitHub issues spanning multiple files and marathon step counts. Impressive numbers—Sonnet plus Duck beats baseline by 3.8% on tough multi-file beasts, even more on the hairiest ones. They caught stuff like dict keys iterating wrong, Solr facets vanishing silently. Solid.

Does Rubber Duck Actually Fix Copilot’s Blind Spots?

Look, self-reflection in AI agents? It’s table stakes now—Claude mulling its own code before executing. But as GitHub admits, a model critiquing itself is like a fox guarding the henhouse: same biases, same training scars. Enter Rubber Duck, from a ‘complementary family.’ Claude leads, GPT snipes. Or vice versa down the line.

Invocations are smart—sparse, targeted. Auto-triggers at plan checkpoints, loop jams, or your manual /critique nudge. No endless chit-chat; it surfaces a tight list of high-impact gripes. Copilot reasons over them, tweaks, shows diffs. Feels like pair programming, minus the egos and bad breath.

Yet. My unique gripe—and this ain’t in GitHub’s post—is how this reeks of the old Microsoft playbook. Remember Visual Studio’s IntelliSense? Solid, but always pushing you deeper into the ecosystem. Rubber Duck’s the same: pairs premium models you pay for (hello, token costs), locks you into Copilot CLI. Who’s making bank? Not you, debugging faster. GitHub, billing API calls from dueling AIs. Prediction: this boosts retention, not revolutionizes dev velocity. We’ve seen ‘agentic’ loops hyped before— Devin, Cursor—same story, incremental wins at experimental scale.

It shines on hard problems. Three-plus files, 70-step marathons? Duck closes gaps Opus couldn’t solo. Examples abound: dropped query facets, unhandled edges. But for quick fixes? Probably overkill, wasting cycles.

Why Bother with GitHub Copilot CLI’s Experimental Mode?

/enable it via /experimental in Copilot CLI. Pick Claude from the model picker, ensure GPT-5.4 access (paywall alert). Critiques pop proactively or on demand. GitHub’s iterating—more model pairs coming, maybe GPT orchestrating with Claude ducking.

Skeptical me wonders: does this scale? SWE-Bench Pro’s no silver bullet; real repos have proprietary cruft, flaky deps. And that 74.7% gap closure? Sounds great until you hit diminishing returns on easy tasks. Plus, experimental means bugs—feedback thread’s already buzzing.

Fleet mode pairs nicely (/fleet for parallel agents), but that’s another layer. Bottom line: if you’re wrestling multi-file nightmares in CLI, try it. Otherwise? Stick to your IDE Copilot, save the tokens.

Corporate spin calls it ‘independent reviewer.’ I call it cross-pollinating hallucinations—better than solo, sure, but still AI roulette. Twenty years in, I’ve seen tools promise to ‘think like a senior dev.’ Most fizzle. This? Might stick, if pricing doesn’t gouge.

Who wins? Power users on tough OSS bugs. Casual coders? Meh. GitHub? Token jackpot.


🧬 Related Insights

Frequently Asked Questions

What is Rubber Duck in GitHub Copilot CLI?

Rubber Duck is an experimental feature that uses a second AI model (like GPT-5.4 with Claude) to review and critique your primary agent’s code plans and implementations, catching misses early.

How do I enable Rubber Duck in Copilot CLI?

Install Copilot CLI, run /experimental, select a Claude model as orchestrator, and ensure GPT-5.4 access—critiques trigger automatically or via /critique.

Does Rubber Duck improve GitHub Copilot performance on hard tasks?

Yes, evals show it closes 74.7% of the gap to top models on multi-file, long-step problems, boosting Sonnet by 3-5% where it counts.

Priya Sundaram
Written by

Hardware and infrastructure reporter. Tracks GPU wars, chip design, and the compute economy.

Frequently asked questions

What is Rubber Duck in <a href="/tag/github-copilot-cli/">GitHub Copilot CLI</a>?
Rubber Duck is an experimental feature that uses a second AI model (like GPT-5.4 with Claude) to review and critique your primary agent's code plans and implementations, catching misses early.
How do I enable Rubber Duck in Copilot CLI?
Install Copilot CLI, run /experimental, select a Claude model as orchestrator, and ensure GPT-5.4 access—critiques trigger automatically or via /critique.
Does Rubber Duck improve GitHub Copilot performance on hard tasks?
Yes, evals show it closes 74.7% of the gap to top models on multi-file, long-step problems, boosting Sonnet by 3-5% where it counts.

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by GitHub Blog

Stay in the loop

The week's most important stories from DevTools Feed, delivered once a week.