Databases & Backend

SQLite Scaling Puzzle: Why More Cores Didn't Help

Autentico's journey to 2.0 hit a wall, not in code, but in expectation. A dive into SQLite performance revealed a bizarre scaling plateau that defied common sense, forcing a rethink of distributed computation.

{# Always render the hero — falls back to the theme OG image when article.image_url is empty (e.g. after the audit's repair_hero_images cleared a blocked Unsplash hot-link). Without this fallback, evergreens with cleared image_url render no hero at all → the JSON-LD ImageObject loses its visual counterpart and LCP attrs go missing. #}
A diagram illustrating the Autentico and Verifico architecture, showing a central Autentico server communicating with multiple Verifico worker processes.

Key Takeaways

  • Offloading CPU-bound tasks like bcrypt verification can significantly improve application performance, but scaling isn't always linear with more cores.
  • Modern CPU architectures and their inherent efficiency can create unexpected scaling plateaus, where adding more cores provides diminishing returns.
  • Understanding the specific computational characteristics of a task is crucial for designing effective scaling strategies, especially in the context of performance-critical applications.

The faint hum of a laptop fan, a common soundtrack to late-night coding sessions, was the only witness to a quiet unraveling of assumptions.

Autentico 2.0, poised for release, was about to embark on a performance odyssey that would redefine its architecture. This wasn’t just a tweak; it was a week-long immersion into profiling, deep architectural rethinking, relentless benchmarking, and a humbling encounter with the hard truths of scaling.

Autentico itself is a marvel of self-sufficiency: a complete OAuth 2.0 and OpenID Connect identity provider, all packed into a single Go binary with a single SQLite database file. No external fuss, just pure, unadulterated functionality. The benchmark? A full PKCE authorization code flow – a series of requests involving authorization, password login, token exchange, introspection, and refresh. Think five HTTP requests and four to five SQLite writes, all punctuated by a single, deliberately slow bcrypt password verification. Heavy lifting, indeed.

Initial stress tests on an older i5 laptop, with a modest 100 virtual users, yielded results that were… adequate. But ‘adequate’ isn’t the stuff of platform shifts, and a quick profiling session laid bare the culprit. A staggering 90% of CPU time was swallowed whole by bcrypt.CompareHashAndPassword. This is by design, of course – bcrypt’s slowness is its strength, a bulwark against brute-force attacks. SQLite writes? Microseconds. JWT signing? Negligible. HTTP routing? Invisible. It was all bcrypt, hogging every available core.

The seemingly obvious conclusion: bcrypt was the bottleneck, and you can’t speed up a slow function. You can only run more of them in parallel. But here’s the rub: SQLite, in its beautiful simplicity, is a single-writer, single-file champion. Traditional horizontal scaling – spinning up more instances behind a load balancer – hits a wall. One writer, one file. A single point of contention.

Or is it? The bottleneck wasn’t the database; it was that single, CPU-hungry function call. So, what if you scaled just that function?

This led to a systematic exploration of options.

First, CQRS with SQLite replication. LiteFS, a nifty tool, can indeed replicate SQLite across nodes, setting up a primary for writes and read replicas. It’s a solid solution for general scaling problems, but Autentico’s issue was far more specific. It didn’t need to distribute reads and writes across a fleet; it needed to distribute the sheer computational load of bcrypt.

Then there’s the Postgres path. The go-to answer when SQLite just won’t cut it. But Postgres, while powerful, doesn’t magically solve the CPU crunch of bcrypt. The CompareHashAndPassword function still lives on the application server. More instances behind a load balancer would spread the load, sure, but you’d be provisioning entire application stacks – database connections, memory, middleware – when all you really needed was more raw CPU power for one specific task.

Child processes? Spawn separate processes for bcrypt work? Go’s goroutines and runtime scheduler already do a fantastic job of parallelizing CPU-bound tasks across available cores on a single machine. Introducing separate processes just adds the overhead of inter-process communication, a net loss.

Sticky sessions? Routing users to specific instances sounds neat, but it requires a shared lookup table, which in turn necessitates a shared database – precisely the problem we were trying to sidestep.

And then, like a bolt of lightning in a darkened server room, the idea clicked. Keep Autentico as a single, monolithic instance, master of its database, handling all core operations. But when a password verification is needed, send the hash and plaintext not to a local CPU core, but to a remote worker. These workers would be stateless, trivial to spin up, and could run on the most cost-effective hardware imaginable, dedicated solely to the computationally expensive task of bcrypt.

This new component, christened Verifico (a nod to Autentico’s Italian flair), would live as a new subcommand: autentico verifico start. A single HTTP endpoint, a shared secret for authentication, and a round-robin load balancer with an automatic fallback to local bcrypt if the workers were offline. The security model also evolved; mTLS was too heavy, AES encryption felt like reinventing TLS poorly, and finally, a simple shared secret over a private network proved strong enough, especially since the password already traversed the public internet to reach Autentico.

On the i5, Verifico was a revelation. Constraining the server to just 2 cores and offloading bcrypt to dedicated workers, non-login endpoints plummeted from seconds to single-digit milliseconds. The server’s precious cores were freed up for HTTP handling, SQLite queries, and JWT signing. Throughput scaled beautifully, mirroring the worker count linearly, up to about 6 cores. Then, it flattened out. At 8 cores, adding more workers yielded no further benefit.

This felt like a triumph. A clean, benchmarked solution. Ready for prime time.

But then, the same benchmarks were unleashed on a modern Ryzen 7 desktop. Sixteen cores, screaming-fast single-thread performance, gargantuan cache. The anticipation was electric.

Constraining Autentico to its 2 cores, we started adding 2-core workers: 2+2, 2+2+2, all the way up to a formidable 2+7x2. On the i5, throughput had climbed with each added worker up to a point. On the Ryzen, however? Stagnation.

Config iter/s Login p95
2 server + 2 worker 15.4/s 3.61s
2 server + 4 worker 15.4/s 3.68s
2 server + 6 worker 15.2/s 3.58s
2 server + 10 worker 15.0/s 3.60s
2 server + 14 worker 14.7/s 3.76s

Flat. Five different configurations, spanning from 2 to 14 worker cores, and the throughput barely budged. Adding more workers did absolutely nothing. The Ryzen, in its sheer brute force, was simply faster at bcrypt. Even at the default cost setting of 10, each individual core was so much more capable than the i5’s, that distributing the load across more cores became an exercise in futility.

The lesson here isn’t about the inadequacy of SQLite, or the shortcomings of Go. It’s a profound statement about the nature of computational bottlenecks. We often assume more CPU equals more speed, but this experiment vividly demonstrates that the efficiency of the CPU, the architecture it’s running on, and the specific task at hand can create baffling plateaus. The Verifico architecture, a brilliant solution for scaling the task, hit a ceiling not imposed by the network or the database, but by the diminishing returns of distributing a highly parallelizable, yet inherently limited, computation across increasingly powerful—but ultimately capped—cores.

This isn’t just about bcrypt; it’s about understanding the physics of computation in the age of AI. When the AI itself becomes the platform, understanding these scaling dynamics becomes paramount. The assumption that you can infinitely scale complex computations by simply adding more generic compute is a dangerous one. It’s like trying to fill a teacup with a fire hose – eventually, the cup is full, and the excess simply spills over, offering no further utility. Autentico’s journey highlights a critical insight: scaling isn’t just about more, it’s about smarter. And sometimes, the smartest path involves understanding when ‘enough’ is truly enough, even when faced with the siren song of boundless cores.

The bottleneck wasn’t the database. It was one function call. So what if you scaled just that function?

This entire ordeal hammered home the idea that for certain compute-bound tasks, especially those with an inherent computational cost like modern cryptography, we’ve entered an era where the raw power of a single modern core can dwarf the effectiveness of distributing that load across many older or less efficient ones. It’s a subtle, yet critical, distinction for anyone building distributed systems or relying on compute-intensive services in the age of AI.

Is this a problem for SQLite itself? Not really. Autentico’s core database operations remained lightning fast. The challenge was a specific, high-CPU-bound task. But it serves as a potent reminder that while SQLite is incredibly capable, the surrounding architecture and the nature of your workloads will always dictate the true scaling frontier.

This is the future unfolding – not just faster processors, but a deeper, more nuanced understanding of how computation scales and where the true limits lie. And sometimes, those limits are far closer than we imagine.

Why Did Adding More Worker Cores Not Increase Throughput?

Even though Autentico’s architecture scaled the bcrypt task by offloading it to separate worker processes, the modern Ryzen 7 desktop’s cores were so significantly faster and more efficient at executing bcrypt compared to the older i5, that adding more cores beyond a certain point yielded diminishing returns. The overall throughput of the system became capped by other factors or the inherent limits of efficiently distributing that specific computation, rather than the availability of raw CPU cycles for bcrypt.

How Did Verifico Improve Performance on the i5?

On the older i5 laptop, Verifico dramatically improved performance by offloading the heavily CPU-bound bcrypt verification task to dedicated worker processes. This freed up the Autentico server’s limited CPU cores to focus on its primary responsibilities: handling HTTP requests, interacting with the SQLite database, and signing JWTs. As a result, response times for non-login endpoints dropped significantly, and the server’s overall throughput scaled effectively with the number of worker cores added, up to a point.

What is Autentico?

Autentico is a self-contained OAuth 2.0 and OpenID Connect identity provider. It’s built with Go and SQLite, featuring a single binary and a single database file with no external dependencies, aiming for simplicity and ease of deployment.


🧬 Related Insights

Written by
DevTools Feed Editorial Team

Curated insights, explainers, and analysis from the editorial team.

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from DevTools Feed, delivered once a week.