2027685 – openshift-cluster-csi-drivers pods crashing on PSI

Bug 2027685 - openshift-cluster-csi-drivers pods crashing on PSI

Summary: openshift-cluster-csi-drivers pods crashing on PSI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Emilien Macchi
QA Contact:	Itay Matza
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2037080
TreeView+	depends on / blocked

Reported:	2021-11-30 12:56 UTC by rlobillo
Modified:	2022-03-10 16:31 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The csi-driver pod's livenessProbe was too strict Consequence: the probe would fail on slower clouds causing the cluster to be degraded. Fix: Relax the livenessProbe to more realistic values to accommodate slower environments Result: the cluster is no longer degraded on clouds with slow cinder.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:31:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openstack-cinder-csi-driver-operator pull 63	0	None	open	Bug 2027685: relax health probes against Cinder API	2022-01-04 15:51:48 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:31:22 UTC

Description rlobillo 2021-11-30 12:56:33 UTC

Description of problem:

After a successful IPI installation, the pods inside the openshift-cluster-csi-drivers related to cinder are crashing in loop, so the storage cluster operator is not ready:

$ oc get pods -n openshift-cluster-csi-drivers -o wide && oc get co/storage
NAME                                                    READY   STATUS             RESTARTS          AGE   IP              NODE                                NOMINATED NODE   READINESS GATES
manila-csi-driver-operator-c9cdfd98b-n9ssp              1/1     Running            0                 46m   10.128.0.70     rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-6mw65   9/10    CrashLoopBackOff   32 (2m4s ago)     46m   192.168.0.243   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-q4dl4   10/10   Running            15 (11m ago)      44m   192.168.3.150   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-node-86pxz                  3/3     Running            133 (14m ago)     21h   192.168.2.93    rlob-storage-6lgfl-master-2         <none>           <none>
openstack-cinder-csi-driver-node-c8v85                  3/3     Running            121 (19m ago)     21h   192.168.0.243   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-node-d2z28                  3/3     Running            120 (6m12s ago)   21h   192.168.0.86    rlob-storage-6lgfl-worker-0-d8rsn   <none>           <none>
openstack-cinder-csi-driver-node-gj9wb                  3/3     Running            134 (6m32s ago)   20h   192.168.2.102   rlob-storage-6lgfl-worker-0-p8hm7   <none>           <none>
openstack-cinder-csi-driver-node-qkqsz                  3/3     Running            135 (6m ago)      20h   192.168.1.197   rlob-storage-6lgfl-worker-0-rwrkk   <none>           <none>
openstack-cinder-csi-driver-node-zqd2z                  3/3     Running            156 (8m49s ago)   21h   192.168.3.150   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-operator-85db5749f8-8cg7j   1/1     Running            0                 44m   10.130.0.96     rlob-storage-6lgfl-master-1         <none>           <none>
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.9.0-0.nightly-2021-11-29-063111   True        True          False      11m     OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods

Inside master-0, it is observed below log lines on journal:

Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637364    1764 patch_prober.go:29] interesting pod/openstack-cinder-csi-driver-node-c8v85 container/csi-driver namespace/openshift-cluster-csi
-drivers: Liveness probe status=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" start-of-body=
Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637508    1764 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-c8v85" podUID=ab368efa-0e8c-
4e66-8ea3-f444e642149e containerName="csi-driver" probeResult=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"


The issue is observed on 4.9 and 4.10 nightlies only on installations performed on PSI. However, the same kind of installations for 4.8 on PSI are working fine.


Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-11-29-063111

How reproducible: Always on PSI

Steps to Reproduce:
1. Install any OCP QE CI profile for openstack on PSI.
2.
3.

Actual results: cluster operator storage is degraded.


Expected results: Pods are stable and all cluster operators are healty.

must-gather: http://file.rdu.redhat.com/rlobillo/storage_co_degraded.tgz

Comment 3 To Hung Sze 2021-11-30 17:21:56 UTC

This is kind of blocking our testing as it makes our cluster fail - before we can run our tests.

Comment 4 To Hung Sze 2021-11-30 17:26:11 UTC

For us, it only happens on top of rhos-d OpenStack PSI.

Comment 13 Jan Safranek 2021-12-02 10:50:34 UTC

I checked the cluster, manual bump of liveness probe log level shows:

I1202 10:44:05.042761       1 connection.go:183] GRPC call: /csi.v1.Identity/Probe
I1202 10:44:05.042765       1 connection.go:184] GRPC request: {}
I1202 10:44:08.042348       1 connection.go:186] GRPC response: {}
I1202 10:44:08.042398       1 connection.go:187] GRPC error: rpc error: code = Canceled desc = context canceled

I.e. CSI driver Probe() call timed out after 3 seconds. This timeout leads to the driver container to be declared unhealthy and restarted.

Probe() should be ideally cheap (it's called often to check driver health). There is a cmdline option -probe-timeout to increase the probe timeout if it's expected that it can take longer time on a random OpenStack.
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L265

Then I would probably suggest to increase the liveness probe timeout to the same number and increase period to much larger value than every 10 seconds to give OpenStack some rest:
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L72-L73

(note that the driver DaemonSet needs to be changed in the same fashion).

Comment 15 ShiftStack Bugwatcher 2021-12-23 07:03:19 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 18 Itay Matza 2022-01-11 12:48:34 UTC

Verified with OCP 4.10.0-0.nightly-2022-01-10-144202 on top of OSP RHOS-16.1-RHEL-8-20211126.n.1:

- Confirmed that the values of the parameters were updated successfully.
- Confirmed on several different CI jobs - the current timeout values are working in the PSI cloud.

Comment 22 errata-xmlrpc 2022-03-10 16:31:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.