Bug 1918562

Summary: [cinder-csi-driver-operator] does not detect csi driver work status
Product: OpenShift Container Platform Reporter: Wei Duan <wduan>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: OpenStack CSI Drivers QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: low CC: aos-bugs, juriarte, mfedosin, pprinett
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:29:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wei Duan 2021-01-21 03:59:51 UTC
Description of problem:
It is not easy to find that csi driver is in non-work status. 
Due to bug 1918140, openstack-cinder-csi-driver-controller did not be installed on OSP, but there is no status specifying csi driver doesn't work when checking the clustercsidrivers/cinder.csi.openstack.org, here is no anything degrade or not available, and there is no type "OpenStackCinderDriverControllerServiceController"

$ oc get clustercsidrivers cinder.csi.openstack.org -o json | jq .status
{
  "conditions": [
    {
      "lastTransitionTime": "2021-01-20T09:48:43Z",
      "status": "False",
      "type": "ManagementStateDegraded"
    },
    {
      "lastTransitionTime": "2021-01-20T09:48:52Z",
      "status": "True",
      "type": "OpenStackCinderDriverNodeServiceControllerAvailable"
    },
    {
      "lastTransitionTime": "2021-01-20T10:02:44Z",
      "status": "False",
      "type": "OpenStackCinderDriverNodeServiceControllerProgressing"
    },
    {
      "lastTransitionTime": "2021-01-20T16:03:02Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "OpenStackCinderDriverNodeServiceControllerDegraded"
    },
    {
      "lastTransitionTime": "2021-01-20T21:36:58Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "OpenStackCinderDriverStaticResourcesControllerDegraded"
    }
  ],

And there is no explicit error/warning info from openstack-cinder-csi-driver-operator:  
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^E" 
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^W" 
W0120 08:56:20.426994       1 cmd.go:204] Using insecure, self-signed certificates
W0120 08:56:21.457470       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0120 08:56:21.457499       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.

When checking the CSO, in most cases it is in normal status, you can only see it degrades in a very short moment with “oc get co storage -w”, so it's hard to find the issue early.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-19-095812

How reproducible:
On condition

Steps to Reproduce:
See Description

Actual results:

Expected results:

Comment 9 Fabio Bertinatto 2021-08-26 17:44:47 UTC
Note that the error message will show up in the operator logs only after 10 minutes. Here's an example [1]:

"F0826 15:07:40.104515       1 base_controller.go:96] unable to sync caches for ConfigObserver"

Since the error message is being recorded in the logs, I'm moving back to ON_QA.

Just a note about the issue: note that in order to trigger this error from happening, the developer working on the CSI operator needs to NOT start the informers. Even though this happened once, it's something unlikely to happen and should be caught by code review. However, if it does happen again, this mistake would've been caught by the presubmit job added recently for Cinder (not sure if Manila operator has that too). That's because the absense of the CSI controller Deployment would cause volume provision to fail, which would be definitely caught by the CI job [2].

Other than that, we could add a check in CSO to make sure the CSI controller Deployment has started correctly, however, I believe it's not worth the effort given the odds of this happening again.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/39/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack-csi/1430884632865280000/artifacts/e2e-openstack-csi/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_openstack-cinder-csi-driver-operator-bddfdc65b-9sdnn_openstack-cinder-csi-driver-operator_previous.log
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/39/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack-csi/1430884632865280000

Comment 10 Fabio Bertinatto 2021-08-26 17:47:52 UTC
*** Bug 1918564 has been marked as a duplicate of this bug. ***

Comment 11 Wei Duan 2021-08-31 12:21:26 UTC
Verified pass.

Comment 14 errata-xmlrpc 2021-10-18 17:29:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 15 Red Hat Bugzilla 2023-09-15 00:58:43 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days