Bug 2006201 - ovirt-csi-driver-node pods are crashing intermittently
Summary: ovirt-csi-driver-node pods are crashing intermittently
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.11.0
Assignee: Janos Bonic
QA Contact: Michael Burman
URL:
Whiteboard:
Depends On:
Blocks: 2056479
TreeView+ depends on / blocked
 
Reported: 2021-09-21 07:16 UTC by Aditya Deshpande
Modified: 2022-08-10 10:38 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:37:47 UTC
Target Upstream Version:
Embargoed:
bzlotnik: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovirt-csi-driver-operator pull 86 0 None open Bug 2006201: Increase timeouts for CSI driver 2022-02-17 13:16:49 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:38:04 UTC

Description Aditya Deshpande 2021-09-21 07:16:33 UTC
Description of problem:
Health check failure messages are seen some times in a day in container csi-liveness-probe of pod ovirt-csi-driver-node related to ovirt-csi-driver-operator.
~~~
2021-09-02T11:29:00.220309913Z E0902 11:29:00.220161       1 main.go:55] health check failed: rpc error: code = Canceled desc = context canceled
~~~

On the node where pod is scheduled, we see
~~~
Sep 02 11:29:00 ocjc4-xxx-worker-0-xxx hyperkube[2116]: I0902 11:29:00.220018    2116 prober.go:117] Liveness probe for "ovirt-csi-driver-node-xxx_openshift-cluster-csi-drivers(d2d919fd-xxxx-4949-a813-9d2e51087f2e):csi-driver" failed (failure): Get "http://<POD_IP>:10300/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 02 11:29:00 ocjc4-xxx-worker-0-pro-xxx hyperkube[2116]: I0902 11:29:00.220209    2116 event.go:291] "Event occurred" object="openshift-cluster-csi-drivers/ovirt-csi-driver-node-xxx" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Liveness probe failed: Get \"http://<POD_IP>:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
~~~

Version-Release number of selected component (if applicable):
OCP 4.6.40

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
 ovirt-csi-driver-node pods are failing/crashing sometimes in a day on multiple nodes.

Expected results:
 ovirt-csi-driver-node pod should run properly.


Additional info:
Attaching other logs and must gather

Comment 7 Aditya Deshpande 2021-10-21 18:19:08 UTC
We have given option to try out increasing the timeout value for probes. But currently alerts are not getting generated or crashes not happening from csi pods. This got monitored for whole month and support case is closed today.

So, I am closing this bugzilla as well.

Comment 8 Robert Sandu 2022-02-04 16:27:34 UTC
Hi Janos

As per our discussion, I'm reopening this one to further pursue a better, more permanent solution concerning the issue tracked in this BZ.

Comment 9 Janos Bonic 2022-02-08 15:27:31 UTC
To record my observations: the probes only call the ping function on the oVirt client, which sends a simple HTTP request to the engine. If this call is slow, the Engine itself responds slowly. This needs to be checked with someone from RHV who can provide tuning options for the RHV Engine to respond to 100+ clients simultaneously. @rsandu can you loop someone from RHV in here please?

Comment 10 Robert Sandu 2022-02-14 09:04:31 UTC
Hi Janos.

If we're confident this issue concerns RHV Engine slowness, I'd suggest moving this BZ to the RHV team and moving forward from there.

For now, I should note that the customer is rather hesitant to apply the earlier suggested (probe-timeout increase) workaround in their biggest production environment (where this issue was spotted), as this would require them to place the storage operator in an Unmanaged state (which would render their environment as "unsupported"). 

In the above context, the customer's expectation is to have a more permanent, fully supported fix, whether it's in the context of the ovirt CSI driver or performance tuning of the RHV Engine. If it's the latter is true, the expectation is also for this be part of our ovirt CSI perf & scale recommendations in docs.openshift.com (or, alternatively, as a KCS).

For now, I'd suggest moving this BZ to the proper RHV engine team.

Comment 11 Janos Bonic 2022-02-14 09:59:05 UTC
I wouldn't call it "confident" as we haven't tested this scenario in a lab setup, nor have I been able to inspect the customer environment. However, I would recommend temporarily moving this BZ to the RHV engine and asking if scale testing on this magnitude has ever been done. If not, I can assist with a client that can perform this kind of scale testing. If that avenue doesn't pan out we need to investigate further.

As mentioned before, the only thing the CSI driver does for health checks is send pings (GET request to /ovirt-engine/api from every node). If this request hangs, the health check fails. If the request fails immediately, the CSI driver will try to open a new connection and subsequently fail.

Comment 12 Janos Bonic 2022-02-17 12:49:21 UTC
Decision from bug triage:

- Increase initialDelaySeconds to 15 seconds
- Increase timeoutSeconds to 15 seconds
- Increase periodSeconds to 300 seconds
- Backport this change to 4.8.

Comment 15 Michal Skrivanek 2022-02-18 07:25:22 UTC
the final change was to:

initialDelaySeconds: 30
timeoutSeconds: 30
periodSeconds: 180
failureThreshold: 2

Comment 16 Michael Burman 2022-02-21 15:00:44 UTC
Verified on - 4.11.0-0.nightly-2022-02-18-121223

No regression found in the ovirt-csi-driver suite.

Comment 18 errata-xmlrpc 2022-08-10 10:37:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.