Bug 2212198

Summary: both virt-controllers are crashing due to panic
Product: Container Native Virtualization (CNV) Reporter: Boaz <bbenshab>
Component: VirtualizationAssignee: sgott
Status: CLOSED DUPLICATE QA Contact: Kedar Bidarkar <kbidarka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.12.3CC: fdeutsch, iholder
Target Milestone: ---   
Target Release: 4.12.4   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-06 09:42:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Boaz 2023-06-04 16:33:14 UTC
I'm running a scale regression setup on :
=========================================
OCP 4.12.3
OpenShift Virtualization 4.12.3

I'm running a large-scale setup with 130 nodes running 6000 VMs using an external RHCS as storage.

during mass VM migration testing in which I initiated 2000 VMs migration, both virt-controllers started crashing in a loop due to panic, I found myself in a situation in which I was unable to initiate any actions, and currently unable to recover.
================================================================================
virt-controller-7887c7c647-8v4t4                       0/1     CrashLoopBackOff   40 (3m57s ago)   10d
virt-controller-7887c7c647-pnjpq                       0/1     CrashLoopBackOff   40 (2m59s ago)   10d
================================================================================
{"component":"virt-controller","level":"info","msg":"Starting disruption budget controller.","pos":"disruptionbudget.go:316","timestamp":"2023-06-04T16:14:40.853832Z"}
{"component":"virt-controller","level":"info","msg":"Starting snapshot controller.","pos":"snapshot_base.go:199","timestamp":"2023-06-04T16:14:40.853820Z"}
{"component":"virt-controller","level":"info","msg":"Starting clone controller","pos":"clone_base.go:149","timestamp":"2023-06-04T16:14:40.853885Z"}
{"component":"virt-controller","level":"info","msg":"Starting vmi controller.","pos":"vmi.go:229","timestamp":"2023-06-04T16:14:40.853842Z"}
{"component":"virt-controller","level":"info","msg":"Starting export controller.","pos":"export.go:290","timestamp":"2023-06-04T16:14:40.854063Z"}
{"component":"virt-controller","level":"info","msg":"TSC Freqency node update status: 0 updated, 129 skipped, 0 errors","pos":"nodetopologyupdater.go:44","timestamp":"2023-06-04T16:14:41.166980Z"}
{"component":"virt-controller","level":"info","msg":"certificate with common name 'virt-controller.openshift-cnv.pod.cluster.local' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537128Z"}
{"component":"virt-controller","level":"info","msg":"certificate with common name 'virt-controller.openshift-cnv.pod.cluster.local' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537270Z"}
{"component":"virt-controller","level":"info","msg":"certificate with common name 'export.kubevirt.io@1685870363' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537273Z"}
{"component":"virt-controller","level":"info","msg":"certificate with common name 'export.kubevirt.io@1685870363' retrieved.","pos":"cert-manager.go:198","timestamp":"2023-06-04T16:14:43.537395Z"}
E0604 16:14:43.755257       1 runtime.go:78] Observed a panic: runtime.boundsError{x:-2, y:0, signed:true, code:0x2} (runtime error: slice bounds out of range [:-2])
goroutine 1279 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1bcac20?, 0xc02b374e10})
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x86
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00096e260?})
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x75
panic({0x1bcac20, 0xc02b374e10})
	/usr/lib/golang/src/runtime/panic.go:884 +0x212
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).sync(0xc003787880, 0xc003ead4b0, {0xc02b6635e0?, 0x4, 0x4}, {0xc00234e200?, 0x16, 0x20})
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:415 +0x997
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).execute(0xc003787880, {0xc003d12bb0, 0x9})
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:335 +0x176
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Execute(0xc003787880)
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:296 +0x108
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).runWorker(0xc003333ea0?)
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:286 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e69c0?, {0x212aa80, 0xc02b981da0}, 0x1, 0xc003cf8b40)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x1f776b8?, 0xc003333f88?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Run
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:278 +0x275
panic: runtime error: slice bounds out of range [:-2] [recovered]
	panic: runtime error: slice bounds out of range [:-2]

goroutine 1279 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00096e260?})
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0xd7
panic({0x1bcac20, 0xc02b374e10})
	/usr/lib/golang/src/runtime/panic.go:884 +0x212
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).sync(0xc003787880, 0xc003ead4b0, {0xc02b6635e0?, 0x4, 0x4}, {0xc00234e200?, 0x16, 0x20})
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:415 +0x997
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).execute(0xc003787880, {0xc003d12bb0, 0x9})
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:335 +0x176
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Execute(0xc003787880)
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:296 +0x108
kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).runWorker(0xc003333ea0?)
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:286 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e69c0?, {0x212aa80, 0xc02b981da0}, 0x1, 0xc003cf8b40)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x1f776b8?, 0xc003333f88?)
	/remote-source/app/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x25
created by kubevirt.io/kubevirt/pkg/virt-controller/watch/drain/evacuation.(*EvacuationController).Run
	/remote-source/app/pkg/virt-controller/watch/drain/evacuation/evacuation.go:278 +0x275
================================================================================
logs:

http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/virt_controller_panic_during_migration.gz
================================================================================

Comment 1 Itamar Holder 2023-06-05 07:04:12 UTC
This is fixed in https://github.com/kubevirt/kubevirt/commit/5d1b049a5154c72e1b888da5ca392a9b97858995 which was merged to v58 about a week ago.
It should be fixed in the next z-stream.

Comment 2 Fabian Deutsch 2023-06-05 08:34:39 UTC
@iholder are you aware o can you think of any procedure once this bug is hit in order to get to a stable cluster again?

Comment 3 Kedar Bidarkar 2023-06-06 09:42:46 UTC

*** This bug has been marked as a duplicate of bug 2185068 ***