Selective Repo Fetch: Docs-as-Code Gets Faster

Docs-as-code.

It’s the seductive promise: your documentation, treated with the same respect as your application code. Version control, pull requests, automated deployments – all the bells and whistles that make development hum. For a while, it works beautifully. Then your repo balloons. You hit 100,000 files, then 200,000, and the elegant simplicity of git clone crumbles into a slow, agonizing crawl.

This isn’t some fringe problem for hobbyists. Teams managing large documentation portals, pulling content from dozens of sprawling repositories, face a build-time nightmare. Imagine needing just a handful of markdown files and a few images from a repo that’s become a digital landfill. The traditional approach, a full git clone, transforms a quick doc build into an exercise in extreme patience. Minutes spent downloading gigabytes of data for a few kilobytes of content.

We’ve seen the workarounds. Sparse checkouts? Still bogged down by Git’s history negotiation. Shallow clones? They miss crucial context. Directly hitting provider APIs? You’ll quickly find yourself bumping against rate limits, a frustratingly common hurdle when dealing with massive file sets. Each path, a dead end or a compromise that still leaves you wanting.

Here’s the fundamental, almost comical, flaw in the old way: the manifest already declares exactly which files are needed. Your docfx.json (or your static site generator’s equivalent) meticulously lists every content glob, every resource pattern. The information is there. We just weren’t using it early enough in the process.

The AI Imperative: Docs, Not Deploys

This problem has escalated beyond mere build speed. The rise of AI agents – those promised assistants for product Q&A, developer onboarding, or internal process queries – fundamentally shifts the stakes. These agents need access to your documentation. Not your entire codebase, not your test suites. Precisely the documented truths of your product. But how do you efficiently feed an AI RAG pipeline that needs to ingest documentation from dozens of repos when cloning all of them is, frankly, absurd? How do you enable incremental indexing when the manifest already tells you which files are docs versus code? How do you build multi-repo knowledge bases with a method that’s both fast and exquisitely selective?

The answer, it turns out, is to flip the script. Instead of the wasteful clone everything → build → throw away 99% cycle, we can embrace a more intelligent workflow: get the file listing → match against manifest → fetch only what matches.

This architectural shift is elegantly captured by the open-source TypeScript library, selective-repo-fetch. It’s a small, MIT-licensed tool designed to be provider-agnostic, stripping away the cruft of full repository clones. The core idea is simple, yet profound: a file tree listing from any Git provider (think GitHub, GitLab, Azure DevOps) is a single, cheap API call that returns metadata – paths and statuses – not the file contents themselves. Match this listing against your project’s manifest, and you instantaneously know precisely what you need to fetch.

┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ Git Provider │ │ selective-repo-fetch │ │ Doc Pipeline │
│ (file listing) │────▶│ (manifest matching │────▶│ (build only │
│ │ │ + reference filter) │ │ matched files)│
└─────────────────┘ └──────────────────────┘ └─────────────────┘

A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.

This isn’t just about speed; it’s about precision. Glob matching, while useful, can be overly generous. A **/*.png pattern might grab every image in a folder, even those that aren’t actually referenced in any of your markdown content. In a large repository, these unreferenced assets can quietly accumulate, bloating download sizes and build times. selective-repo-fetch tackles this with a second, crucial pass: filtering resources to include only those explicitly referenced within your content files. It intelligently scans markdown (![](path)), links ([text](path)), and even HTML (src="path", href="path") to ensure you’re only fetching what’s actively used.

Consider the impact: moving from a colossal 200,000 files down to the critical 50. Achieved with a single function call. This granular control is what unlocks truly efficient CI/CD pipelines and, more importantly, provides the fresh, accurate data that AI documentation agents desperately need.

Why Does This Matter for Developers?

At its heart, this is a story about embracing the information already available in your development workflow. The manifest file, a staple of static site generators, is the latent superpower waiting to be unleashed. By integrating selective fetching early, we can dramatically reduce the overhead associated with managing documentation in large, multi-repo environments. For teams building internal tooling, developer portals, or even public-facing documentation sites, the ability to quickly and accurately pull only the relevant assets can mean the difference between a laggy, frustrating experience and a slick, responsive one.

It’s also a clear signal about the future of developer tools. As complexity grows and repositories expand, naive approaches will simply break. The industry is moving towards more intelligent, metadata-driven operations. selective-repo-fetch is a tangible example of this architectural shift, demonstrating how a deep understanding of existing tooling can lead to elegant, performant solutions.

The core workflow:

```typescript import { resolveFileMatches, filterReferencedResources } from ‘selective-repo-fetch’;

// Your manifest declares what your doc site needs const manifest = { build: { content: [{ files: [‘*/.md’]

🧬 Related Insights

Read more: 187 Claude Code Sessions Burned $6,744 in Tokens – 97% Was Pointless Cache Reads
Read more: Multi-Model AI Code Review Outsmarts Single-Model Pitfalls

The AI Imperative: Docs, Not Deploys

Why Does This Matter for Developers?

🧬 Related Insights

Share this article

Worth sharing?

Related Stories

AI Code Surge: Developers Use It Constantly in 2026

AI Gets Memory: The Engine That Learns

AI Becomes CTO: Antigravity OS Builds OS in 12 Hours

AI Agents Now Fueling Government Impact: Here's How

Stay in the loop