1918562 – [cinder-csi-driver-operator] does not detect csi driver work status

Bug 1918562 - [cinder-csi-driver-operator] does not detect csi driver work status

Summary: [cinder-csi-driver-operator] does not detect csi driver work status

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Fabio Bertinatto
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1918564 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-21 03:59 UTC by Wei Duan
Modified:	2023-09-15 00:58 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:29:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openstack-cinder-csi-driver-operator pull 30	0	None	closed	Bug 1918562: bump library-go	2021-03-23 09:47:41 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:29:49 UTC

Description Wei Duan 2021-01-21 03:59:51 UTC

Description of problem:
It is not easy to find that csi driver is in non-work status. 
Due to bug 1918140, openstack-cinder-csi-driver-controller did not be installed on OSP, but there is no status specifying csi driver doesn't work when checking the clustercsidrivers/cinder.csi.openstack.org, here is no anything degrade or not available, and there is no type "OpenStackCinderDriverControllerServiceController"

$ oc get clustercsidrivers cinder.csi.openstack.org -o json | jq .status
{
  "conditions": [
    {
      "lastTransitionTime": "2021-01-20T09:48:43Z",
      "status": "False",
      "type": "ManagementStateDegraded"
    },
    {
      "lastTransitionTime": "2021-01-20T09:48:52Z",
      "status": "True",
      "type": "OpenStackCinderDriverNodeServiceControllerAvailable"
    },
    {
      "lastTransitionTime": "2021-01-20T10:02:44Z",
      "status": "False",
      "type": "OpenStackCinderDriverNodeServiceControllerProgressing"
    },
    {
      "lastTransitionTime": "2021-01-20T16:03:02Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "OpenStackCinderDriverNodeServiceControllerDegraded"
    },
    {
      "lastTransitionTime": "2021-01-20T21:36:58Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "OpenStackCinderDriverStaticResourcesControllerDegraded"
    }
  ],

And there is no explicit error/warning info from openstack-cinder-csi-driver-operator:  
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^E" 
$ oc -n openshift-cluster-csi-drivers logs openstack-cinder-csi-driver-operator-557ffdc94d-r9pst | grep "^W" 
W0120 08:56:20.426994       1 cmd.go:204] Using insecure, self-signed certificates
W0120 08:56:21.457470       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256' detected.
W0120 08:56:21.457499       1 secure_serving.go:69] Use of insecure cipher 'TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256' detected.

When checking the CSO, in most cases it is in normal status, you can only see it degrades in a very short moment with “oc get co storage -w”, so it's hard to find the issue early.

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-19-095812

How reproducible:
On condition

Steps to Reproduce:
See Description

Actual results:

Expected results:

Comment 9 Fabio Bertinatto 2021-08-26 17:44:47 UTC

Note that the error message will show up in the operator logs only after 10 minutes. Here's an example [1]:

"F0826 15:07:40.104515       1 base_controller.go:96] unable to sync caches for ConfigObserver"

Since the error message is being recorded in the logs, I'm moving back to ON_QA.

Just a note about the issue: note that in order to trigger this error from happening, the developer working on the CSI operator needs to NOT start the informers. Even though this happened once, it's something unlikely to happen and should be caught by code review. However, if it does happen again, this mistake would've been caught by the presubmit job added recently for Cinder (not sure if Manila operator has that too). That's because the absense of the CSI controller Deployment would cause volume provision to fail, which would be definitely caught by the CI job [2].

Other than that, we could add a check in CSO to make sure the CSI controller Deployment has started correctly, however, I believe it's not worth the effort given the odds of this happening again.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/39/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack-csi/1430884632865280000/artifacts/e2e-openstack-csi/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_openstack-cinder-csi-driver-operator-bddfdc65b-9sdnn_openstack-cinder-csi-driver-operator_previous.log
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_openstack-cinder-csi-driver-operator/39/pull-ci-openshift-openstack-cinder-csi-driver-operator-master-e2e-openstack-csi/1430884632865280000

Comment 10 Fabio Bertinatto 2021-08-26 17:47:52 UTC

*** Bug 1918564 has been marked as a duplicate of this bug. ***

Comment 11 Wei Duan 2021-08-31 12:21:26 UTC

Verified pass.

Comment 14 errata-xmlrpc 2021-10-18 17:29:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 15 Red Hat Bugzilla 2023-09-15 00:58:43 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.