Description of problem: When a VM has `runstrategy: Always`, it is possible for a failed VM to get into a crash loop state where VMI pods are scheduled, fail, and the VM controller hot loops their recreation. If a VMI's pod never successfully makes it to phase: Running, the VM controller should begin backing off on the recreation of that VMI in order to not increase load in a situation where the cluster is likely already not completely healthy. This is similar in concept to the crashloop backoff that occurs at the Pod level. We need to perform backoff in the virt-controller's handling of VMs that are unable to successfully start.
A PR to kubevirt main branch has been posted. https://github.com/kubevirt/kubevirt/pull/5905 This PR is more invasive than what was anticipated. We need to take a critical look at where we backport this.
Per comment #1, retargetting this BZ to the next release. The complexity of the fix lowers the ROI of backporting this refinement.
]$ oc describe vm vm2-rhel84-ocs Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulDataVolumeCreate 145m virtualmachine-controller Created DataVolume rhel84-ocs-dv2 Normal SuccessfulDelete 142m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance 6bee3915-9f54-47d9-a644-1e761ababb43 Normal SuccessfulDelete 142m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance b6776b2c-e12d-422b-b0da-181e98416666 Normal SuccessfulDelete 141m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance 16025a9e-da9a-4778-8a93-e627b25b3252 Normal SuccessfulDelete 138m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance b9b765c4-e956-4ca0-b109-7fcc7333fa91 Normal SuccessfulDelete 135m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance ae61e8dc-fd7a-45d1-8812-5bfdde075fc2 Normal SuccessfulDelete 130m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance fb565920-008d-428f-b7e7-bb4761ec8f05 Normal SuccessfulDelete 124m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance ef761285-8daf-4302-b9d1-93e81a48d059 Normal SuccessfulDelete 119m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance 68a526e2-869d-4b0c-b1d0-b02ee065fc84 Normal SuccessfulDelete 114m virtualmachine-controller Stopped the virtual machine by deleting the virtual machine instance ce61d066-26d8-4c43-8b4d-e8f1ca63c3d2 Normal SuccessfulDelete 20m (x18 over 109m) virtualmachine-controller (combined from similar events): Stopped the virtual machine by deleting the virtual machine instance a0677d50-1132-4aa7-a1a5-929594f2cba0 Normal SuccessfulCreate 4m48s (x30 over 142m) virtualmachine-controller Started the virtual machine by creating the new virtual machine instance vm2-rhel84-ocs [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy" Run Strategy: Always [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status" Printable Status: Starting [kbidarka@localhost secureboot]$ oc get vm NAME AGE STATUS READY vm2-rhel84-ocs 146m CrashLoopBackOff False [kbidarka@localhost secureboot]$ virtctl stop vm2-rhel84-ocs VM vm2-rhel84-ocs was scheduled to stop [kbidarka@localhost secureboot]$ oc get vm NAME AGE STATUS READY vm2-rhel84-ocs 146m Stopped False [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy" Run Strategy: Halted [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status" Printable Status: Stopped [kbidarka@localhost secureboot]$ oc get vm NAME AGE STATUS READY vm2-rhel84-ocs 147m Stopped False --- Calling 'virctl stop vm2-rhel84-ocs' for this VM Stopped successfully, even with runStrategy as 'Always' or 'RerunOnFailure' even when an active VMI is not present. CrashLoop detection and Exponential Backoff, seems to work fine. VERIFIED: with 'virt-operator-container-v4.9.0-35'
[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy" Run Strategy: RerunOnFailure [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status" Printable Status: CrashLoopBackOff [kbidarka@localhost secureboot]$ oc get vm NAME AGE STATUS READY vm2-rhel84-ocs 158m CrashLoopBackOff False [kbidarka@localhost secureboot]$ virtctl stop vm2-rhel84-ocs VM vm2-rhel84-ocs was scheduled to stop [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy" Run Strategy: Halted [kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status" Printable Status: Stopped
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days