2006201 – ovirt-csi-driver-node pods are crashing intermittently

Bug 2006201 - ovirt-csi-driver-node pods are crashing intermittently

Summary: ovirt-csi-driver-node pods are crashing intermittently

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Janos Bonic
QA Contact:	Michael Burman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2056479
TreeView+	depends on / blocked

Reported:	2021-09-21 07:16 UTC by Aditya Deshpande
Modified:	2022-08-10 10:38 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:37:47 UTC
Target Upstream Version:
Embargoed:
Flags:	bzlotnik: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovirt-csi-driver-operator pull 86	0	None	open	Bug 2006201: Increase timeouts for CSI driver	2022-02-17 13:16:49 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:38:04 UTC

Description Aditya Deshpande 2021-09-21 07:16:33 UTC

Description of problem:
Health check failure messages are seen some times in a day in container csi-liveness-probe of pod ovirt-csi-driver-node related to ovirt-csi-driver-operator.
~~~
2021-09-02T11:29:00.220309913Z E0902 11:29:00.220161       1 main.go:55] health check failed: rpc error: code = Canceled desc = context canceled
~~~

On the node where pod is scheduled, we see
~~~
Sep 02 11:29:00 ocjc4-xxx-worker-0-xxx hyperkube[2116]: I0902 11:29:00.220018    2116 prober.go:117] Liveness probe for "ovirt-csi-driver-node-xxx_openshift-cluster-csi-drivers(d2d919fd-xxxx-4949-a813-9d2e51087f2e):csi-driver" failed (failure): Get "http://<POD_IP>:10300/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 02 11:29:00 ocjc4-xxx-worker-0-pro-xxx hyperkube[2116]: I0902 11:29:00.220209    2116 event.go:291] "Event occurred" object="openshift-cluster-csi-drivers/ovirt-csi-driver-node-xxx" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Liveness probe failed: Get \"http://<POD_IP>:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
~~~

Version-Release number of selected component (if applicable):
OCP 4.6.40

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
 ovirt-csi-driver-node pods are failing/crashing sometimes in a day on multiple nodes.

Expected results:
 ovirt-csi-driver-node pod should run properly.


Additional info:
Attaching other logs and must gather

Comment 7 Aditya Deshpande 2021-10-21 18:19:08 UTC

We have given option to try out increasing the timeout value for probes. But currently alerts are not getting generated or crashes not happening from csi pods. This got monitored for whole month and support case is closed today.

So, I am closing this bugzilla as well.

Comment 8 Robert Sandu 2022-02-04 16:27:34 UTC

Hi Janos

As per our discussion, I'm reopening this one to further pursue a better, more permanent solution concerning the issue tracked in this BZ.

Comment 9 Janos Bonic 2022-02-08 15:27:31 UTC

To record my observations: the probes only call the ping function on the oVirt client, which sends a simple HTTP request to the engine. If this call is slow, the Engine itself responds slowly. This needs to be checked with someone from RHV who can provide tuning options for the RHV Engine to respond to 100+ clients simultaneously. @rsandu can you loop someone from RHV in here please?

Comment 10 Robert Sandu 2022-02-14 09:04:31 UTC

Hi Janos.

If we're confident this issue concerns RHV Engine slowness, I'd suggest moving this BZ to the RHV team and moving forward from there.

For now, I should note that the customer is rather hesitant to apply the earlier suggested (probe-timeout increase) workaround in their biggest production environment (where this issue was spotted), as this would require them to place the storage operator in an Unmanaged state (which would render their environment as "unsupported"). 

In the above context, the customer's expectation is to have a more permanent, fully supported fix, whether it's in the context of the ovirt CSI driver or performance tuning of the RHV Engine. If it's the latter is true, the expectation is also for this be part of our ovirt CSI perf & scale recommendations in docs.openshift.com (or, alternatively, as a KCS).

For now, I'd suggest moving this BZ to the proper RHV engine team.

Comment 11 Janos Bonic 2022-02-14 09:59:05 UTC

I wouldn't call it "confident" as we haven't tested this scenario in a lab setup, nor have I been able to inspect the customer environment. However, I would recommend temporarily moving this BZ to the RHV engine and asking if scale testing on this magnitude has ever been done. If not, I can assist with a client that can perform this kind of scale testing. If that avenue doesn't pan out we need to investigate further.

As mentioned before, the only thing the CSI driver does for health checks is send pings (GET request to /ovirt-engine/api from every node). If this request hangs, the health check fails. If the request fails immediately, the CSI driver will try to open a new connection and subsequently fail.

Comment 12 Janos Bonic 2022-02-17 12:49:21 UTC

Decision from bug triage:

- Increase initialDelaySeconds to 15 seconds
- Increase timeoutSeconds to 15 seconds
- Increase periodSeconds to 300 seconds
- Backport this change to 4.8.

Comment 15 Michal Skrivanek 2022-02-18 07:25:22 UTC

the final change was to:

initialDelaySeconds: 30
timeoutSeconds: 30
periodSeconds: 180
failureThreshold: 2

Comment 16 Michael Burman 2022-02-21 15:00:44 UTC

Verified on - 4.11.0-0.nightly-2022-02-18-121223

No regression found in the ovirt-csi-driver suite.

Comment 18 errata-xmlrpc 2022-08-10 10:37:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.