Description of problem: Health check failure messages are seen some times in a day in container csi-liveness-probe of pod ovirt-csi-driver-node related to ovirt-csi-driver-operator. ~~~ 2021-09-02T11:29:00.220309913Z E0902 11:29:00.220161 1 main.go:55] health check failed: rpc error: code = Canceled desc = context canceled ~~~ On the node where pod is scheduled, we see ~~~ Sep 02 11:29:00 ocjc4-xxx-worker-0-xxx hyperkube[2116]: I0902 11:29:00.220018 2116 prober.go:117] Liveness probe for "ovirt-csi-driver-node-xxx_openshift-cluster-csi-drivers(d2d919fd-xxxx-4949-a813-9d2e51087f2e):csi-driver" failed (failure): Get "http://<POD_IP>:10300/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Sep 02 11:29:00 ocjc4-xxx-worker-0-pro-xxx hyperkube[2116]: I0902 11:29:00.220209 2116 event.go:291] "Event occurred" object="openshift-cluster-csi-drivers/ovirt-csi-driver-node-xxx" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Liveness probe failed: Get \"http://<POD_IP>:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" ~~~ Version-Release number of selected component (if applicable): OCP 4.6.40 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: ovirt-csi-driver-node pods are failing/crashing sometimes in a day on multiple nodes. Expected results: ovirt-csi-driver-node pod should run properly. Additional info: Attaching other logs and must gather
We have given option to try out increasing the timeout value for probes. But currently alerts are not getting generated or crashes not happening from csi pods. This got monitored for whole month and support case is closed today. So, I am closing this bugzilla as well.
Hi Janos As per our discussion, I'm reopening this one to further pursue a better, more permanent solution concerning the issue tracked in this BZ.
To record my observations: the probes only call the ping function on the oVirt client, which sends a simple HTTP request to the engine. If this call is slow, the Engine itself responds slowly. This needs to be checked with someone from RHV who can provide tuning options for the RHV Engine to respond to 100+ clients simultaneously. @rsandu can you loop someone from RHV in here please?
Hi Janos. If we're confident this issue concerns RHV Engine slowness, I'd suggest moving this BZ to the RHV team and moving forward from there. For now, I should note that the customer is rather hesitant to apply the earlier suggested (probe-timeout increase) workaround in their biggest production environment (where this issue was spotted), as this would require them to place the storage operator in an Unmanaged state (which would render their environment as "unsupported"). In the above context, the customer's expectation is to have a more permanent, fully supported fix, whether it's in the context of the ovirt CSI driver or performance tuning of the RHV Engine. If it's the latter is true, the expectation is also for this be part of our ovirt CSI perf & scale recommendations in docs.openshift.com (or, alternatively, as a KCS). For now, I'd suggest moving this BZ to the proper RHV engine team.
I wouldn't call it "confident" as we haven't tested this scenario in a lab setup, nor have I been able to inspect the customer environment. However, I would recommend temporarily moving this BZ to the RHV engine and asking if scale testing on this magnitude has ever been done. If not, I can assist with a client that can perform this kind of scale testing. If that avenue doesn't pan out we need to investigate further. As mentioned before, the only thing the CSI driver does for health checks is send pings (GET request to /ovirt-engine/api from every node). If this request hangs, the health check fails. If the request fails immediately, the CSI driver will try to open a new connection and subsequently fail.
Decision from bug triage: - Increase initialDelaySeconds to 15 seconds - Increase timeoutSeconds to 15 seconds - Increase periodSeconds to 300 seconds - Backport this change to 4.8.
the final change was to: initialDelaySeconds: 30 timeoutSeconds: 30 periodSeconds: 180 failureThreshold: 2
Verified on - 4.11.0-0.nightly-2022-02-18-121223 No regression found in the ovirt-csi-driver suite.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069