1973852 – Introduce VM crashloop backoff

Bug 1973852 - Introduce VM crashloop backoff

Summary: Introduce VM crashloop backoff

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	David Vossel
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-18 20:51 UTC by David Vossel
Modified:	2024-10-01 18:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	virt-operator-container-v4.9.0-35 hco-bundle-registry-container-v4.9.0-155
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-02 15:59:33 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-12578	0	None	None	None	2024-10-01 18:45:55 UTC
Red Hat Product Errata	RHSA-2021:4104	0	None	None	None	2021-11-02 15:59:53 UTC

Description David Vossel 2021-06-18 20:51:03 UTC

Description of problem:

When a VM has `runstrategy: Always`, it is possible for a failed VM to get into a crash loop state where VMI pods are scheduled, fail, and the VM controller hot loops their recreation. 

If a VMI's pod never successfully makes it to phase: Running, the VM controller should begin backing off on the recreation of that VMI in order to not increase load in a situation where the cluster is likely already not completely healthy.

This is similar in concept to the crashloop backoff that occurs at the Pod level. We need to perform backoff in the virt-controller's handling of VMs that are unable to successfully start.

Comment 1 David Vossel 2021-06-22 16:31:09 UTC

A PR to kubevirt main branch has been posted. https://github.com/kubevirt/kubevirt/pull/5905

This PR is more invasive than what was anticipated. We need to take a critical look at where we backport this.

Comment 2 sgott 2021-06-23 18:28:45 UTC

Per comment #1, retargetting this BZ to the next release. The complexity of the fix lowers the ROI of backporting this refinement.

Comment 8 Kedar Bidarkar 2021-09-01 14:11:48 UTC

]$ oc describe vm vm2-rhel84-ocs

Events:
  Type    Reason                      Age                    From                       Message
  ----    ------                      ----                   ----                       -------
  Normal  SuccessfulDataVolumeCreate  145m                   virtualmachine-controller  Created DataVolume rhel84-ocs-dv2
  Normal  SuccessfulDelete            142m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance 6bee3915-9f54-47d9-a644-1e761ababb43
  Normal  SuccessfulDelete            142m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance b6776b2c-e12d-422b-b0da-181e98416666
  Normal  SuccessfulDelete            141m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance 16025a9e-da9a-4778-8a93-e627b25b3252
  Normal  SuccessfulDelete            138m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance b9b765c4-e956-4ca0-b109-7fcc7333fa91
  Normal  SuccessfulDelete            135m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance ae61e8dc-fd7a-45d1-8812-5bfdde075fc2
  Normal  SuccessfulDelete            130m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance fb565920-008d-428f-b7e7-bb4761ec8f05
  Normal  SuccessfulDelete            124m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance ef761285-8daf-4302-b9d1-93e81a48d059
  Normal  SuccessfulDelete            119m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance 68a526e2-869d-4b0c-b1d0-b02ee065fc84
  Normal  SuccessfulDelete            114m                   virtualmachine-controller  Stopped the virtual machine by deleting the virtual machine instance ce61d066-26d8-4c43-8b4d-e8f1ca63c3d2
  Normal  SuccessfulDelete            20m (x18 over 109m)    virtualmachine-controller  (combined from similar events): Stopped the virtual machine by deleting the virtual machine instance a0677d50-1132-4aa7-a1a5-929594f2cba0
  Normal  SuccessfulCreate            4m48s (x30 over 142m)  virtualmachine-controller  Started the virtual machine by creating the new virtual machine instance vm2-rhel84-ocs

[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy"
  Run Strategy:  Always
[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status"
  Printable Status:        Starting


[kbidarka@localhost secureboot]$ oc get vm 
NAME             AGE    STATUS             READY
vm2-rhel84-ocs   146m   CrashLoopBackOff   False

[kbidarka@localhost secureboot]$ virtctl stop vm2-rhel84-ocs
VM vm2-rhel84-ocs was scheduled to stop
[kbidarka@localhost secureboot]$ oc get vm 
NAME             AGE    STATUS    READY
vm2-rhel84-ocs   146m   Stopped   False

[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy"
  Run Strategy:  Halted
[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status"
  Printable Status:        Stopped

[kbidarka@localhost secureboot]$ oc get vm 
NAME             AGE    STATUS    READY
vm2-rhel84-ocs   147m   Stopped   False


---

Calling 'virctl stop vm2-rhel84-ocs' for this VM Stopped successfully, even with runStrategy as 'Always' or 'RerunOnFailure' even when an active VMI is not present.

CrashLoop detection and Exponential Backoff, seems to work fine.

VERIFIED: with 'virt-operator-container-v4.9.0-35'

Comment 9 Kedar Bidarkar 2021-09-01 14:16:03 UTC

[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy"
  Run Strategy:  RerunOnFailure
[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status"
  Printable Status:        CrashLoopBackOff

[kbidarka@localhost secureboot]$ oc get vm 
NAME             AGE    STATUS             READY
vm2-rhel84-ocs   158m   CrashLoopBackOff   False

[kbidarka@localhost secureboot]$ virtctl stop vm2-rhel84-ocs
VM vm2-rhel84-ocs was scheduled to stop

[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Run Strategy"
  Run Strategy:  Halted
[kbidarka@localhost secureboot]$ oc describe vm vm2-rhel84-ocs | grep "Printable Status"
  Printable Status:        Stopped

Comment 13 errata-xmlrpc 2021-11-02 15:59:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104

Comment 14 Red Hat Bugzilla 2023-09-15 01:10:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.