GitHub Copilot CLI Rubber Duck Explained

I stared at my terminal this afternoon, coffee gone cold, as GitHub Copilot CLI’s Rubber Duck tore apart a half-baked data pipeline plan—catching a Solr facet glitch that’d slipped right past Claude Sonnet.

GitHub Copilot CLI’s Rubber Duck lands in experimental mode today, promising a ‘second opinion’ from a rival AI family to spot what your primary agent misses. It’s not some fluffy toy; this thing’s built to review plans, implementations, and edge cases before you commit to disaster. Select Claude as your orchestrator (Opus, Sonnet, or Haiku), and bam—GPT-5.4 steps in as the contrarian duck, flagging assumptions, inefficiencies, the works.

But here’s the thing. We’ve heard this tune before. Back in the ’90s, us coders rubber-ducked our bugs to plush toys on our desks—talking through code to our feathered friends until the logic clicked. GitHub’s riffing on that, sure, but swapping the duck for Claude’s frenemy, GPT. Cute nod. Except now it’s paywalled behind slash commands and model access.

“Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus.”

That’s straight from GitHub’s eval on SWE-Bench Pro, their benchmark for gnarly, real-world GitHub issues spanning multiple files and marathon step counts. Impressive numbers—Sonnet plus Duck beats baseline by 3.8% on tough multi-file beasts, even more on the hairiest ones. They caught stuff like dict keys iterating wrong, Solr facets vanishing silently. Solid.

Does Rubber Duck Actually Fix Copilot’s Blind Spots?

Look, self-reflection in AI agents? It’s table stakes now—Claude mulling its own code before executing. But as GitHub admits, a model critiquing itself is like a fox guarding the henhouse: same biases, same training scars. Enter Rubber Duck, from a ‘complementary family.’ Claude leads, GPT snipes. Or vice versa down the line.

Invocations are smart—sparse, targeted. Auto-triggers at plan checkpoints, loop jams, or your manual /critique nudge. No endless chit-chat; it surfaces a tight list of high-impact gripes. Copilot reasons over them, tweaks, shows diffs. Feels like pair programming, minus the egos and bad breath.

Yet. My unique gripe—and this ain’t in GitHub’s post—is how this reeks of the old Microsoft playbook. Remember Visual Studio’s IntelliSense? Solid, but always pushing you deeper into the ecosystem. Rubber Duck’s the same: pairs premium models you pay for (hello, token costs), locks you into Copilot CLI. Who’s making bank? Not you, debugging faster. GitHub, billing API calls from dueling AIs. Prediction: this boosts retention, not revolutionizes dev velocity. We’ve seen ‘agentic’ loops hyped before— Devin, Cursor—same story, incremental wins at experimental scale.

It shines on hard problems. Three-plus files, 70-step marathons? Duck closes gaps Opus couldn’t solo. Examples abound: dropped query facets, unhandled edges. But for quick fixes? Probably overkill, wasting cycles.

Why Bother with GitHub Copilot CLI’s Experimental Mode?

/enable it via /experimental in Copilot CLI. Pick Claude from the model picker, ensure GPT-5.4 access (paywall alert). Critiques pop proactively or on demand. GitHub’s iterating—more model pairs coming, maybe GPT orchestrating with Claude ducking.

Skeptical me wonders: does this scale? SWE-Bench Pro’s no silver bullet; real repos have proprietary cruft, flaky deps. And that 74.7% gap closure? Sounds great until you hit diminishing returns on easy tasks. Plus, experimental means bugs—feedback thread’s already buzzing.

Fleet mode pairs nicely (/fleet for parallel agents), but that’s another layer. Bottom line: if you’re wrestling multi-file nightmares in CLI, try it. Otherwise? Stick to your IDE Copilot, save the tokens.

Corporate spin calls it ‘independent reviewer.’ I call it cross-pollinating hallucinations—better than solo, sure, but still AI roulette. Twenty years in, I’ve seen tools promise to ‘think like a senior dev.’ Most fizzle. This? Might stick, if pricing doesn’t gouge.

Who wins? Power users on tough OSS bugs. Casual coders? Meh. GitHub? Token jackpot.

🧬 Related Insights

Read more: ccheckpoints: Git for Your Claude Code CLI Sessions — No More Lost AI Magic
Read more: Outlook Just Butchered Your Perfect HTML Email—Here’s the Free Tool That Stops It

Frequently Asked Questions

What is Rubber Duck in GitHub Copilot CLI?

Rubber Duck is an experimental feature that uses a second AI model (like GPT-5.4 with Claude) to review and critique your primary agent’s code plans and implementations, catching misses early.

How do I enable Rubber Duck in Copilot CLI?

Install Copilot CLI, run /experimental, select a Claude model as orchestrator, and ensure GPT-5.4 access—critiques trigger automatically or via /critique.

Does Rubber Duck improve GitHub Copilot performance on hard tasks?

Yes, evals show it closes 74.7% of the gap to top models on multi-file, long-step problems, boosting Sonnet by 3-5% where it counts.

GitHub Copilot CLI Rubber Duck Explained

Key Takeaways

Does Rubber Duck Actually Fix Copilot’s Blind Spots?

Why Bother with GitHub Copilot CLI’s Experimental Mode?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

Does Rubber Duck Actually Fix Copilot’s Blind Spots?

Why Bother with GitHub Copilot CLI’s Experimental Mode?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Font Flubs: 1 in 5 Sites Misidentify Primary Typeface

AI Code Agents: New Attack Surface Threatens Devs

AI Code Bots Get Smarter for Pennies

GitHub Copilot CLI Builds Roguelikes from Your Codebase

Stay in the loop

Key Takeaways