Description of problem: After a successful IPI installation, the pods inside the openshift-cluster-csi-drivers related to cinder are crashing in loop, so the storage cluster operator is not ready: $ oc get pods -n openshift-cluster-csi-drivers -o wide && oc get co/storage NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES manila-csi-driver-operator-c9cdfd98b-n9ssp 1/1 Running 0 46m 10.128.0.70 rlob-storage-6lgfl-master-0 <none> <none> openstack-cinder-csi-driver-controller-ff8bfb4d-6mw65 9/10 CrashLoopBackOff 32 (2m4s ago) 46m 192.168.0.243 rlob-storage-6lgfl-master-0 <none> <none> openstack-cinder-csi-driver-controller-ff8bfb4d-q4dl4 10/10 Running 15 (11m ago) 44m 192.168.3.150 rlob-storage-6lgfl-master-1 <none> <none> openstack-cinder-csi-driver-node-86pxz 3/3 Running 133 (14m ago) 21h 192.168.2.93 rlob-storage-6lgfl-master-2 <none> <none> openstack-cinder-csi-driver-node-c8v85 3/3 Running 121 (19m ago) 21h 192.168.0.243 rlob-storage-6lgfl-master-0 <none> <none> openstack-cinder-csi-driver-node-d2z28 3/3 Running 120 (6m12s ago) 21h 192.168.0.86 rlob-storage-6lgfl-worker-0-d8rsn <none> <none> openstack-cinder-csi-driver-node-gj9wb 3/3 Running 134 (6m32s ago) 20h 192.168.2.102 rlob-storage-6lgfl-worker-0-p8hm7 <none> <none> openstack-cinder-csi-driver-node-qkqsz 3/3 Running 135 (6m ago) 20h 192.168.1.197 rlob-storage-6lgfl-worker-0-rwrkk <none> <none> openstack-cinder-csi-driver-node-zqd2z 3/3 Running 156 (8m49s ago) 21h 192.168.3.150 rlob-storage-6lgfl-master-1 <none> <none> openstack-cinder-csi-driver-operator-85db5749f8-8cg7j 1/1 Running 0 44m 10.130.0.96 rlob-storage-6lgfl-master-1 <none> <none> NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.9.0-0.nightly-2021-11-29-063111 True True False 11m OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods Inside master-0, it is observed below log lines on journal: Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637364 1764 patch_prober.go:29] interesting pod/openstack-cinder-csi-driver-node-c8v85 container/csi-driver namespace/openshift-cluster-csi -drivers: Liveness probe status=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" start-of-body= Nov 30 12:36:15 rlob-storage-6lgfl-master-0 hyperkube[1764]: I1130 12:36:15.637508 1764 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-cluster-csi-drivers/openstack-cinder-csi-driver-node-c8v85" podUID=ab368efa-0e8c- 4e66-8ea3-f444e642149e containerName="csi-driver" probeResult=failure output="Get \"http://192.168.0.243:10300/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" The issue is observed on 4.9 and 4.10 nightlies only on installations performed on PSI. However, the same kind of installations for 4.8 on PSI are working fine. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-11-29-063111 How reproducible: Always on PSI Steps to Reproduce: 1. Install any OCP QE CI profile for openstack on PSI. 2. 3. Actual results: cluster operator storage is degraded. Expected results: Pods are stable and all cluster operators are healty. must-gather: http://file.rdu.redhat.com/rlobillo/storage_co_degraded.tgz
This is kind of blocking our testing as it makes our cluster fail - before we can run our tests.
For us, it only happens on top of rhos-d OpenStack PSI.
I checked the cluster, manual bump of liveness probe log level shows: I1202 10:44:05.042761 1 connection.go:183] GRPC call: /csi.v1.Identity/Probe I1202 10:44:05.042765 1 connection.go:184] GRPC request: {} I1202 10:44:08.042348 1 connection.go:186] GRPC response: {} I1202 10:44:08.042398 1 connection.go:187] GRPC error: rpc error: code = Canceled desc = context canceled I.e. CSI driver Probe() call timed out after 3 seconds. This timeout leads to the driver container to be declared unhealthy and restarted. Probe() should be ideally cheap (it's called often to check driver health). There is a cmdline option -probe-timeout to increase the probe timeout if it's expected that it can take longer time on a random OpenStack. https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L265 Then I would probably suggest to increase the liveness probe timeout to the same number and increase period to much larger value than every 10 seconds to give OpenStack some rest: https://github.com/openshift/openstack-cinder-csi-driver-operator/blob/286f1ba0f478a3443fa223721787f09f51531dc2/assets/controller.yaml#L72-L73 (note that the driver DaemonSet needs to be changed in the same fashion).
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Verified with OCP 4.10.0-0.nightly-2022-01-10-144202 on top of OSP RHOS-16.1-RHEL-8-20211126.n.1: - Confirmed that the values of the parameters were updated successfully. - Confirmed on several different CI jobs - the current timeout values are working in the PSI cloud.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056