I'm running a scale regression setup on : ========================================= OCP 4.12.3 OpenShift Virtualization 4.12.3 I'm running a large-scale setup with 130 nodes running 6000 VMs using an external RHCS as storage. during mass VM migration testing in which I initiated 2000 VMs migration, both virt-controllers started crashing in a loop due to panic, I found myself in a situation in which I was unable to initiate any actions, and currently unable to recover. ================================================================================ virt-controller-7887c7c647-8v4t4 0/1 CrashLoopBackOff 40 (3m57s ago) 10d virt-controller-7887c7c647-pnjpq 0/1 CrashLoopBackOff 40 (2m59s ago) 10d ================================================================================ {"component":"virt-controller","level":"info","msg":"Starting disruption budget controller.","pos":"disruptionbudget.go:316","timestamp":"2023-06-04T16:14:40.853832Z"} {"component":"virt-controller","level":"info","msg":"Starting snapshot controller.","pos":"snapshot_base.go:199","timestamp":"2023-06-04T16:14:40.853820Z"} {"component":"virt-controller","level":"info","msg":"Starting clone controller","pos":"clone_base.go:149","timestamp":"2023-06-04T16:14:40.853885Z"} {"component":"virt-controller","level":"info","msg":"Starting vmi controller.","pos":"vmi.go:229","timestamp":"2023-06-04T16:14:40.853842Z"} {"component":"virt-controller","level":"info","msg":"Starting export controller.","pos":"export.go:290","timestamp":"2023-06-04T16:14:40.854063Z"} {"component":"virt-controller","level":"info","msg":"TSC Freqency node update status: 0 updated, 129 skipped, 0 errors","pos":"nodetopologyupdater.go:44","timestamp":"2023-06-04T16:14:41.166980Z"} {"component":"virt-controller","level":"info","msg":"certificate with common name 'virt-controller.openshift-cnv.pod.cluster.local' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537128Z"} {"component":"virt-controller","level":"info","msg":"certificate with common name 'virt-controller.openshift-cnv.pod.cluster.local' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537270Z"} {"component":"virt-controller","level":"info","msg":"certificate with common name 'export.kubevirt.io@1685870363' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537273Z"} {"component":"virt-controller","level":"info","msg":"certificate with common name 'export.kubevirt.io@1685870363' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537395Z"} E0604 16:14:43.755257 1 runtime.go:78] Observed a panic: runtime.boundsError{x:-2, y:0, signed:true, code:0x2} (runtime error: slice bounds out of range [:-2]) goroutine 1279 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bcac20?, 0xc02b374e10}) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00096e260?}) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75 panic({0x1bcac20, 0xc02b374e10}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).sync(0xc003787880, 0xc003ead4b0, {0xc02b6635e0?, 0x4, 0x4}, {0xc00234e200?, 0x16, 0x20}) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:415 +0x997 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).execute(0xc003787880, {0xc003d12bb0, 0x9}) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:335 +0x176 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Execute(0xc003787880) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:296 +0x108 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).runWorker(0xc003333ea0?) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:286 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e69c0?, {0x212aa80, 0xc02b981da0}, 0x1, 0xc003cf8b40) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x1f776b8?, 0xc003333f88?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25 created by kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Run /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:278 +0x275 panic: runtime error: slice bounds out of range [:-2] [recovered] panic: runtime error: slice bounds out of range [:-2] goroutine 1279 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00096e260?}) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd7 panic({0x1bcac20, 0xc02b374e10}) /usr/lib/golang/src/runtime/panic.go:884 +0x212 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).sync(0xc003787880, 0xc003ead4b0, {0xc02b6635e0?, 0x4, 0x4}, {0xc00234e200?, 0x16, 0x20}) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:415 +0x997 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).execute(0xc003787880, {0xc003d12bb0, 0x9}) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:335 +0x176 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Execute(0xc003787880) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:296 +0x108 kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).runWorker(0xc003333ea0?) /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:286 +0x25 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e69c0?, {0x212aa80, 0xc02b981da0}, 0x1, 0xc003cf8b40) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x1f776b8?, 0xc003333f88?) /remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25 created by kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Run /remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:278 +0x275 ================================================================================ logs: http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/virt_controller_panic_during_migration.gz ================================================================================
This is fixed in https://github.com/kubevirt/kubevirt/commit/5d1b049a5154c72e1b888da5ca392a9b97858995 which was merged to v58 about a week ago. It should be fixed in the next z-stream.
@iholder are you aware o can you think of any procedure once this bug is hit in order to get to a stable cluster again?
*** This bug has been marked as a duplicate of bug 2185068 ***