Bug 2027685 - openshift-cluster-csi-drivers pods crashing on PSI
Summary: openshift-cluster-csi-drivers pods crashing on PSI
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Emilien Macchi
QA Contact: Itay Matza
Depends On:
Blocks: 2037080
TreeView+ depends on / blocked
Reported: 2021-11-30 12:56 UTC by rlobillo
Modified: 2022-03-10 16:31 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The csi-driver pod's livenessProbe was too strict Consequence: the probe would fail on slower clouds causing the cluster to be degraded. Fix: Relax the livenessProbe to more realistic values to accommodate slower environments Result: the cluster is no longer degraded on clouds with slow cinder.
Clone Of:
Last Closed: 2022-03-10 16:31:00 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift openstack-cinder-csi-driver-operator pull 63 0 None open Bug 2027685: relax health probes against Cinder API 2022-01-04 15:51:48 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:31:22 UTC

Description rlobillo 2021-11-30 12:56:33 UTC
Description of problem:

After a successful IPI installation, the pods inside the openshift-cluster-csi-drivers related to cinder are crashing in loop, so the storage cluster operator is not ready:

$ oc get pods -n openshift-cluster-csi-drivers -o wide && oc get co/storage
NAME                                                    READY   STATUS             RESTARTS          AGE   IP              NODE                                NOMINATED NODE   READINESS GATES
manila-csi-driver-operator-c9cdfd98b-n9ssp              1/1     Running            0                 46m     rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-6mw65   9/10    CrashLoopBackOff   32 (2m4s ago)     46m   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-controller-ff8bfb4d-q4dl4   10/10   Running            15 (11m ago)      44m   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-node-86pxz                  3/3     Running            133 (14m ago)     21h    rlob-storage-6lgfl-master-2         <none>           <none>
openstack-cinder-csi-driver-node-c8v85                  3/3     Running            121 (19m ago)     21h   rlob-storage-6lgfl-master-0         <none>           <none>
openstack-cinder-csi-driver-node-d2z28                  3/3     Running            120 (6m12s ago)   21h    rlob-storage-6lgfl-worker-0-d8rsn   <none>           <none>
openstack-cinder-csi-driver-node-gj9wb                  3/3     Running            134 (6m32s ago)   20h   rlob-storage-6lgfl-worker-0-p8hm7   <none>           <none>
openstack-cinder-csi-driver-node-qkqsz                  3/3     Running            135 (6m ago)      20h   rlob-storage-6lgfl-worker-0-rwrkk   <none>           <none>
openstack-cinder-csi-driver-node-zqd2z                  3/3     Running            156 (8m49s ago)   21h   rlob-storage-6lgfl-master-1         <none>           <none>
openstack-cinder-csi-driver-operator-85db5749f8-8cg7j   1/1     Running            0                 44m     rlob-storage-6lgfl-master-1         <none>           <none>
storage   4.9.0-0.nightly-2021-11-29-063111   True        True          False      11m     OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods

Inside master-0, it is observed below log lines on journal:

Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637364    1764 patch_prober.go:29] interesting pod/openstack-cinder-csi-driver-node-c8v85 container/csi-driver namespace/openshift-cluster-csi
-drivers: Liveness probe status=failure output="Get \"\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" start-of-body=
Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637508    1764 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-c8v85" podUID=ab368efa-0e8c-
4e66-8ea3-f444e642149e containerName="csi-driver" probeResult=failure output="Get \"\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

The issue is observed on 4.9 and 4.10 nightlies only on installations performed on PSI. However, the same kind of installations for 4.8 on PSI are working fine.

Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-11-29-063111

How reproducible: Always on PSI

Steps to Reproduce:
1. Install any OCP QE CI profile for openstack on PSI.

Actual results: cluster operator storage is degraded.

Expected results: Pods are stable and all cluster operators are healty.

must-gather: http://file.rdu.redhat.com/rlobillo/storage_co_degraded.tgz

Comment 3 To Hung Sze 2021-11-30 17:21:56 UTC
This is kind of blocking our testing as it makes our cluster fail - before we can run our tests.

Comment 4 To Hung Sze 2021-11-30 17:26:11 UTC
For us, it only happens on top of rhos-d OpenStack PSI.

Comment 13 Jan Safranek 2021-12-02 10:50:34 UTC
I checked the cluster, manual bump of liveness probe log level shows:

I1202 10:44:05.042761       1 connection.go:183] GRPC call: /csi.v1.Identity/Probe
I1202 10:44:05.042765       1 connection.go:184] GRPC request: {}
I1202 10:44:08.042348       1 connection.go:186] GRPC response: {}
I1202 10:44:08.042398       1 connection.go:187] GRPC error: rpc error: code = Canceled desc = context canceled

I.e. CSI driver Probe() call timed out after 3 seconds. This timeout leads to the driver container to be declared unhealthy and restarted.

Probe() should be ideally cheap (it's called often to check driver health). There is a cmdline option -probe-timeout to increase the probe timeout if it's expected that it can take longer time on a random OpenStack.

Then I would probably suggest to increase the liveness probe timeout to the same number and increase period to much larger value than every 10 seconds to give OpenStack some rest:

(note that the driver DaemonSet needs to be changed in the same fashion).

Comment 15 ShiftStack Bugwatcher 2021-12-23 07:03:19 UTC
Removing the Triaged keyword because:
* the QE automation assessment (flag qe_test_coverage) is missing

Comment 18 Itay Matza 2022-01-11 12:48:34 UTC
Verified with OCP 4.10.0-0.nightly-2022-01-10-144202 on top of OSP RHOS-16.1-RHEL-8-20211126.n.1:

- Confirmed that the values of the parameters were updated successfully.
- Confirmed on several different CI jobs - the current timeout values are working in the PSI cloud.

Comment 22 errata-xmlrpc 2022-03-10 16:31:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.