Bug 1867904 - Exec probes on VMI pods have various issues, let's remove them
Summary: Exec probes on VMI pods have various issues, let's remove them
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 2.4.1
Assignee: sgott
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-11 07:42 UTC by Roman Mohr
Modified: 2023-12-15 18:51 UTC (History)
7 users (show)

Fixed In Version: virt-operator-container-v2.4.1-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-03 20:31:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 3946 0 None closed Remove exec readiness probes in kubevirt 2021-01-31 10:47:28 UTC
Github kubevirt kubevirt pull 3971 0 None closed [release-0.30] Remove exec readiness probes in kubevirt 2021-01-31 10:47:30 UTC
Red Hat Bugzilla 1855067 0 urgent CLOSED nproc * 1/2 MiB of each container memory may be taken by RHEL8 kernel slabs 2023-12-15 18:25:13 UTC
Red Hat Issue Tracker CNV-36418 0 None None None 2023-12-15 18:51:20 UTC
Red Hat Product Errata RHBA-2020:3629 0 None None None 2020-09-03 20:31:20 UTC

Description Roman Mohr 2020-08-11 07:42:24 UTC
Description of problem:


We use exec probes when we launch VMIs to know when virt-launcher is fully started.


We ran into the following issues with the exec probes:

https://bugzilla.redhat.com/show_bug.cgi?id=1817057
https://bugzilla.redhat.com/show_bug.cgi?id=1855067
https://bugzilla.redhat.com/show_bug.cgi?id=1848524
https://bugzilla.redhat.com/show_bug.cgi?id=1850168

Since it benefits kubevirt overall to just remove the exec probes, independent of seeing them fixed in OCP, I am proposing https://github.com/kubevirt/kubevirt/pull/3971, which is merged on master already.

This has the advantage that we don't rely on exec probes anymore (at the moment), removes exec probe warning events which we can't fully eliminate with the generic readiness mechanisms and speeds up VMI start.

I see this as the proper solution, because as one can see in https://bugzilla.redhat.com/show_bug.cgi?id=1855067 it may even require kernel setting changes to make exec probes work nicely with pod limits.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Roman Mohr 2020-08-12 09:08:10 UTC
Merged and part of kubevirt 0.30.6 which is the base for CNV 2.4.1.

Comment 6 Kedar Bidarkar 2020-08-17 19:17:00 UTC
As seen below, we no longer have the readinessProbe

[kbidarka@kbidarka-host osdc]$ oc get pod virt-launcher-vmi-fedora32-resource-v2d6p -o yaml | grep readinessProbe
[kbidarka@kbidarka-host osdc]$ oc get pod virt-launcher-vmi-fedora32-resource2-77vzg -o yaml | grep readinessProbe
[kbidarka@kbidarka-host osdc]$ oc get pod virt-launcher-vm-rhel81-mb2q5 -o yaml | grep readinessProbe

Comment 10 Kedar Bidarkar 2020-08-24 12:11:44 UTC
Summary: 
1) The VM and pods are running successfully for 5+ days, as seen above.
2) Also the memory consumption for volumecontainerdisk is 4Mi  seen as per the kubectl top command.
3) The cDisk and DV based VMI's are accessible and stable even when running for 5+ days.
4) we no longer have the readinessProbe in the VMI Pod.

Comment 14 errata-xmlrpc 2020-09-03 20:31:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.4.1 images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3629

Comment 15 Chris Paquin 2020-09-10 15:40:27 UTC
Note that customer 0 plans to migrate to OCP 4.5/CNV 2.4.1 which contains fix for this issue. The specific workload that is affected is only in being testing on OCP 4.4. Not sure that we need to try to fix in 4.4 at this point.


Note You need to log in before you can comment on or make changes to this bug.