Bug 2027685

Summary:	openshift-cluster-csi-drivers pods crashing on PSI
Product:	OpenShift Container Platform	Reporter:	rlobillo
Component:	Storage	Assignee:	Emilien Macchi <emacchi>
Storage sub component:	OpenStack CSI Drivers	QA Contact:	Itay Matza <imatza>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, emacchi, jsafrane, m.andre, mbooth, ppitonak, pprinett, tsze
Version:	4.9	Keywords:	TestBlocker, Triaged
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The csi-driver pod's livenessProbe was too strict Consequence: the probe would fail on slower clouds causing the cluster to be degraded. Fix: Relax the livenessProbe to more realistic values to accommodate slower environments Result: the cluster is no longer degraded on clouds with slow cinder.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:31:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2037080

Description rlobillo 2021-11-30 12:56:33 UTC

Description of problem:

After a successful IPI installation, the pods inside the openshift-cluster-csi-drivers related to cinder are crashing in loop, so the storage cluster operator is not ready:

$ oc get pods -n openshift-cluster-csi-drivers -o wide && oc get co/storage
NAME                                                    READY   STATUS             RESTARTS          AGE   IP              NODE                                NOMINATED NODE   READINESS GATES
manila-csi-driver-operator-c9cdfd98b-n9ssp              1/1     Running            0                 46m   10.128.0.70     rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-6mw65   9/10    CrashLoopBackOff   32 (2m4s ago)     46m   192.168.0.243   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-q4dl4   10/10   Running            15 (11m ago)      44m   192.168.3.150   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-node-86pxz                  3/3     Running            133 (14m ago)     21h   192.168.2.93    rlob-storage-6lgfl-master-2         <none>           <none>
openstack-cinder-csi-driver-node-c8v85                  3/3     Running            121 (19m ago)     21h   192.168.0.243   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-node-d2z28                  3/3     Running            120 (6m12s ago)   21h   192.168.0.86    rlob-storage-6lgfl-worker-0-d8rsn   <none>           <none>
openstack-cinder-csi-driver-node-gj9wb                  3/3     Running            134 (6m32s ago)   20h   192.168.2.102   rlob-storage-6lgfl-worker-0-p8hm7   <none>           <none>
openstack-cinder-csi-driver-node-qkqsz                  3/3     Running            135 (6m ago)      20h   192.168.1.197   rlob-storage-6lgfl-worker-0-rwrkk   <none>           <none>
openstack-cinder-csi-driver-node-zqd2z                  3/3     Running            156 (8m49s ago)   21h   192.168.3.150   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-operator-85db5749f8-8cg7j   1/1     Running            0                 44m   10.130.0.96     rlob-storage-6lgfl-master-1         <none>           <none>
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
storage   4.9.0-0.nightly-2021-11-29-063111   True        True          False      11m     OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods

Inside master-0, it is observed below log lines on journal:

Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637364    1764 patch_prober.go:29] interesting pod/openstack-cinder-csi-driver-node-c8v85 container/csi-driver namespace/openshift-cluster-csi
-drivers: Liveness probe status=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" start-of-body=
Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637508    1764 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-c8v85" podUID=ab368efa-0e8c-
4e66-8ea3-f444e642149e containerName="csi-driver" probeResult=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"


The issue is observed on 4.9 and 4.10 nightlies only on installations performed on PSI. However, the same kind of installations for 4.8 on PSI are working fine.


Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-11-29-063111

How reproducible: Always on PSI

Steps to Reproduce:
1. Install any OCP QE CI profile for openstack on PSI.
2.
3.

Actual results: cluster operator storage is degraded.


Expected results: Pods are stable and all cluster operators are healty.

must-gather: http://file.rdu.redhat.com/rlobillo/storage_co_degraded.tgz

Comment 3 To Hung Sze 2021-11-30 17:21:56 UTC

This is kind of blocking our testing as it makes our cluster fail - before we can run our tests.

Comment 4 To Hung Sze 2021-11-30 17:26:11 UTC

For us, it only happens on top of rhos-d OpenStack PSI.

Comment 13 Jan Safranek 2021-12-02 10:50:34 UTC

I checked the cluster, manual bump of liveness probe log level shows:

I1202 10:44:05.042761       1 connection.go:183] GRPC call: /csi.v1.Identity/Probe
I1202 10:44:05.042765       1 connection.go:184] GRPC request: {}
I1202 10:44:08.042348       1 connection.go:186] GRPC response: {}
I1202 10:44:08.042398       1 connection.go:187] GRPC error: rpc error: code = Canceled desc = context canceled

I.e. CSI driver Probe() call timed out after 3 seconds. This timeout leads to the driver container to be declared unhealthy and restarted.

Probe() should be ideally cheap (it's called often to check driver health). There is a cmdline option -probe-timeout to increase the probe timeout if it's expected that it can take longer time on a random OpenStack.
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L265

Then I would probably suggest to increase the liveness probe timeout to the same number and increase period to much larger value than every 10 seconds to give OpenStack some rest:
https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L72-L73

(note that the driver DaemonSet needs to be changed in the same fashion).

Comment 15 ShiftStack Bugwatcher 2021-12-23 07:03:19 UTC

Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 18 Itay Matza 2022-01-11 12:48:34 UTC

Verified with OCP 4.10.0-0.nightly-2022-01-10-144202 on top of OSP RHOS-16.1-RHEL-8-20211126.n.1:

- Confirmed that the values of the parameters were updated successfully.
- Confirmed on several different CI jobs - the current timeout values are working in the PSI cloud.

Comment 22 errata-xmlrpc 2022-03-10 16:31:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056