DevOps & Platform Eng

KVM Migrations: The Silent Disaster

They say the VM conversion completed without errors. Every workload made it across. Turns out, the migration itself was the real problem.

{# Always render the hero — falls back to the theme OG image when article.image_url is empty (e.g. after the audit's repair_hero_images cleared a blocked Unsplash hot-link). Without this fallback, evergreens with cleared image_url render no hero at all → the JSON-LD ImageObject loses its visual counterpart and LCP attrs go missing. #}
A cracked server rack with visible wires tangled, symbolizing a broken migration.

Key Takeaways

  • VMware environments are more than just hypervisors; they are integrated control planes for operations.
  • Lift-and-shift KVM migrations often fail to account for the operating model and tooling built around vCenter, leading to post-migration operational failures.
  • Replacing vCenter requires not just a new management tool, but a complete rebuild of operational confidence and expertise.

Did you know that most supposedly successful VM migrations are actually time bombs? It’s a grim thought, but one we need to confront. Especially when consultants pack up their briefcases, the project lead closes the ticket, and everyone breathes a sigh of relief. Three weeks later, the silence is deafening. Backup jobs are failing. Monitoring dashboards are dark. Nobody knows what ‘normal’ even looks like on the new platform. The VM conversion worked, alright. The migration? Not so much.

This is the insidious lift-and-shift KVM fallacy. And let’s be clear: it’s not a KVM problem. It’s a scoping problem. A colossal, blindingly obvious, scoping problem. Most VMware-to-KVM migration plans fixate on the hypervisor itself – the shiny new ESXi replacement. Everything built around that hypervisor? Suddenly, it’s “someone else’s project.” That’s where the Operating Model Gap creeps in, a gaping hole left by faulty assumptions.

Lift-and-shift KVM means compute moves. Disk images transfer. Network configurations get ported. VM settings are painstakingly recreated on the other side. From a purely data-plane perspective, it looks like a success. The workloads are running, aren’t they? But that’s like saying a car is fixed because the engine starts, even though the steering wheel is gone.

What doesn’t move? Oh, just the entire operational backbone.

  • Operational runbooks referencing vCenter constructs. Gone.
  • Backup architecture built against vSphere APIs. Poof.
  • Monitoring thresholds calibrated to vSphere metrics. Meaningless.
  • Provisioning workflows targeting vCenter endpoints. Dead.
  • Snapshot behavior assumptions encoded in recovery procedures. Useless.
  • Storage policy logic tied to vSAN semantics. Erased.
  • Identity and access models mapped to vCenter RBAC. Invalid.
  • Operator muscle memory built over years of vCenter navigation. Unusable.

None of this makes it into the migration plan. All of it breaks after cutover. The Operating Model Gap is the yawning chasm between what the plan claimed to capture and what the platform actually required to function. Every single item on that list is a component of the operating model. The hypervisor swap? It touches precisely none of them.

The framing that spawns these disastrous lift-and-shift KVM plans is deceptively simple: VMware equals ESXi. Replace ESXi with KVM. Migration complete. That framing is, to put it mildly, horseshit. VMware was never just ESXi. VMware was the control plane your entire operating model was built around, the invisible hand guiding everything.

What the plan says What actually changes
ESXi → KVM vCenter (lifecycle and provisioning control)
vMotion semantics (live migration behavior)
vSAN (storage abstraction and policy model)
NSX (network policy and microsegmentation)
vROps / vRealize (observability and alerting logic)
VADP (backup API framework)
DRS (scheduling and placement policy)
Snapshot behavior (application-consistent logic)

A VMware environment isn’t some hypervisor with a few tacked-on features. It’s a deeply integrated control surface. Compute scheduling, storage policy, network segmentation, observability, recovery operations—they all converge there. Replace ESXi with KVM, and every single one of those layers needs a replacement or a complete rebuild. And unlike ESXi, KVM doesn’t ship with an instruction manual for assembling them all.

KVM is a kernel module. The management plane, the storage architecture, the network abstraction, the observability stack—that’s all on you to assemble, integrate, and operate. That assembly is the real migration work. The work that most lift-and-shift plans conveniently forget to scope.

The Operating Model Test: What If vCenter Vanished?

If vCenter disappeared tomorrow, what percentage of your operating model would vanish with it? For the vast majority of VMware shops, the honest answer is likely between 60% and 90%. That percentage is the scope of what a lift-and-shift to KVM spectacularly fails to address. These migrations don’t fail at cutover. They fail in the trenches of operations. The failures are predictable, they arrive in a sequence, and they are almost never, ever, ever in the migration plan.

Why Did It All Break?

You didn’t just replace ESXi. You nuked vCenter. vCenter was the operational control surface for everything: provisioning new workloads, managing VM lifecycle, enforcing placement policy, controlling access, automating tasks. Move to KVM, and vCenter is gone. Poof. And everything that pointed at it? It needs a new target. The KVM ecosystem offers options—libvirt for direct management, Proxmox VE for a GUI-centric approach, oVirt for a vCenter-like experience, OpenStack for massive cloud-scale orchestration. Each represents a fundamentally different operating model. None is a drop-in replacement. A team that spent a decade operating vCenter doesn’t magically know how to operate any of these under pressure at 2 AM. This is the first stall point. Not because a management plane doesn’t exist, but because the operating model loses its control surface. The team has to rebuild operational confidence from absolute zero.

And it’s not just the control plane. You didn’t lose shared storage. You lost the storage abstraction your platform behavior depended on. vSAN provided a distributed storage fabric with defined behavior around replication, failure domains, snapshot consistency, and policy-based placement. That abstraction encoded a set of assumptions that your entire backup architecture, your recovery procedures, and your performance baselines were built against. In a KVM environment, that abstraction is gone. You’re now operating raw storage—whether that’s Ceph, NFS, or something else entirely—and all the assumptions you made about its behavior need a complete re-evaluation. Suddenly, your backup verification jobs aren’t just silently failing; they’re screaming in your face.


🧬 Related Insights

Written by
DevTools Feed Editorial Team

Curated insights, explainers, and analysis from the editorial team.

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from DevTools Feed, delivered once a week.