Description of problem:
It is not easy to find that csi driver is in non-work status.
Due to bug 1918140, openstack-cinder-csi-driver-controller did not be installed on OSP, but there is no status specifying csi driver doesn't work when checking the clustercsidrivers/cinder.csi.openstack.org, here is no anything degrade or not available, and there is no type "OpenStackCinderDriverControllerServiceController"
$ oc get clustercsidrivers cinder.csi.openstack.org -o json | jq .status
And there is no explicit error/warning info from openstack-cinder-csi-driver-operator:
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^E"
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^W"
W0120 08:56:20.426994 1 cmd.go:204] Using insecure, self-signed certificates
W0120 08:56:21.457470 1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0120 08:56:21.457499 1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.
When checking the CSO, in most cases it is in normal status, you can only see it degrades in a very short moment with “oc get co storage -w”, so it's hard to find the issue early.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Note that the error message will show up in the operator logs only after 10 minutes. Here's an example :
"F0826 15:07:40.104515 1 base_controller.go:96] unable to sync caches for ConfigObserver"
Since the error message is being recorded in the logs, I'm moving back to ON_QA.
Just a note about the issue: note that in order to trigger this error from happening, the developer working on the CSI operator needs to NOT start the informers. Even though this happened once, it's something unlikely to happen and should be caught by code review. However, if it does happen again, this mistake would've been caught by the presubmit job added recently for Cinder (not sure if Manila operator has that too). That's because the absense of the CSI controller Deployment would cause volume provision to fail, which would be definitely caught by the CI job .
Other than that, we could add a check in CSO to make sure the CSI controller Deployment has started correctly, however, I believe it's not worth the effort given the odds of this happening again.
*** Bug 1918564 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.