Bug 1952891

Summary: Upgrade failed due to cinder csi driver not deployed
Product: OpenShift Container Platform Reporter: Wei Duan <wduan>
Component: StorageAssignee: Matthew Booth <mbooth>
Storage sub component: OpenStack CSI Drivers QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: urgent CC: adduarte, aos-bugs, lwan, mbooth, mfedosin
Version: 4.8Keywords: Regression, Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:03:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Wei Duan 2021-04-23 12:56:49 UTC
Description of problem:
We have a chained upgrade from 4.2->4.3->4.4->4.5->4.6->4.7->4.8 and failed at 4.7 to 4.8.0-0.nightly-2021-04-21-131512. Storage CO is not successfully rolled out.

must-gather log is http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.5100242347722931169.tar.gz	

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-04-21-093400   True        True          3h      Unable to apply 4.8.0-0.nightly-2021-04-21-131512: the cluster operator storage has not yet successfully rolled out

From the CSO log:
2021-04-22T05:00:57.296086215Z I0422 05:00:57.293677       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"b2db9af5-a2ed-11eb-bd49-fa163e847ddc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods"

From the csi-driver log:
2021-04-22T06:56:03.105512023Z I0422 06:56:03.105456       1 openstack.go:130] InitOpenStackProvider configFile: /etc/kubernetes/config/cloud.conf
2021-04-22T06:56:03.106889195Z I0422 06:56:03.106822       1 openstack.go:88] Block storage opts: {0 false false}
2021-04-22T06:56:03.107006064Z W0422 06:56:03.106963       1 main.go:108] Failed to GetOpenStackProvider: failed to read and parse /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem certificate: open /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem: no such file or directory

From the csi-provisioner log:
2021-04-22T04:50:43.149940243Z I0422 04:50:43.149830       1 csi-provisioner.go:155] Building kube configs for running in cluster...
2021-04-22T04:50:43.180958482Z I0422 04:50:43.180881       1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
2021-04-22T04:50:53.181856106Z W0422 04:50:53.181806       1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
2021-04-22T04:51:03.182074976Z W0422 04:51:03.181951       1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
2021-04-22T04:51:13.181783506Z W0422 04:51:13.181689       1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
2021-04-22T04:51:23.181836266Z W0422 04:51:23.181776       1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
2021-04-22T04:51:33.181890408Z W0422 04:51:33.181796       1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

Version-Release number of selected component (if applicable):
Upgrade from 4.7 to 4.8.0-0.nightly-2021-04-21-131512

How reproducible:
Upgrage from 4.2->4.3->4.4->4.5->4.6->4.7->4.8

Steps to Reproduce:
1.
2.
3.

Actual results:
Storage CO is not available during upgrade

Expected results:
Storage CO should available during upgrade

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 2 Mike Fedosin 2021-04-29 13:40:47 UTC
The cause of this issue is that long time ago there was a bug in gophercloud utils which caused incorrect yaml marshaling for empty fields
I fixed that bug https://github.com/gophercloud/utils/pull/100 and the fix was available only in 4.3.
It means that in 4.2 if cacert was not in the original clouds.yaml the utils produced `cacert: ""` in the system clouds.yaml, which technically not a problem. But when we introduced https://github.com/openshift/cloud-credential-operator/pull/314, it just checks that the key (cacert) exists in the file, omitting the fact it can be empty. To fix the issue we need to ignore empty cacert

Comment 3 Matthew Booth 2021-04-29 16:03:40 UTC
I'm going to post an alternative patch shortly.

Comment 5 Matthew Booth 2021-05-12 09:12:52 UTC
Doc text not required, as this bug was never released.

Comment 7 Jon Uriarte 2021-06-18 09:17:33 UTC
Verified in 4.8.0-0.nightly-2021-06-14-145150 on top of OSP 16.1.6 (RHOS-16.1-RHEL-8-20210506.n.1), after the next upgrade chain:

4.2.0-0.nightly-2021-02-22-141219 -> 4.3.40 -> 4.4.33 -> 4.5.40 -> 4.6.35 -> 4.7.17 -> 4.8.0-0.nightly-2021-06-14-145150


The underlying OSP is without SSL enabled and the clouds.yaml doesn't contain the cacert param:

clouds:
...
 shiftstack:
    auth:
        auth_url: http://10.0.0.104:5000
        password: hidden
        project_domain_name: Default
        project_name: shiftstack
        project_id: a2de4b65f83341d1942c201750fffdf6
        user_domain_name: Default
        username: shiftstack_user
    identity_api_version: '3'
    region_name: regionOne

$ openstack server list
+--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+
| ID                                   | Name                      | Status | Networks                            | Image | Flavor |
+--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+
| e52f1172-dfb1-4328-bb47-db2ebf14ead8 | ostest-28klf-worker-lx9bv | ACTIVE | ostest-28klf-openshift=10.196.2.211 | rhcos |        |
| b9e7e76e-e74e-4206-99d2-c94d9b2b4e3d | ostest-28klf-worker-2fvwv | ACTIVE | ostest-28klf-openshift=10.196.0.140 | rhcos |        |
| 32f09721-a9e5-4e3a-8092-ba127f1e7609 | ostest-28klf-master-0     | ACTIVE | ostest-28klf-openshift=10.196.0.47  | rhcos |        |
| 155849f7-27d8-45d1-99be-4d837252acdc | ostest-28klf-master-1     | ACTIVE | ostest-28klf-openshift=10.196.0.32  | rhcos |        |
| ac37f988-b9cb-433e-be21-4f5328a66cd8 | ostest-28klf-master-2     | ACTIVE | ostest-28klf-openshift=10.196.1.107 | rhcos |        |
+--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2021-02-22-141219   True        False         60s     Cluster version is 4.2.0-0.nightly-2021-02-22-141219

Upgrades:
        
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.40                              True        False         3m49s   Cluster version is 4.3.40

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.33                              True        False         3m33s   Cluster version is 4.4.33

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.40                              True        False         8s      Cluster version is 4.5.40

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.35                              True        False         3m53s   Cluster version is 4.6.35

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.17                              True        False         3m44s   Cluster version is 4.7.17
        
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-14-145150   True        False         12h     Cluster version is 4.8.0-0.nightly-2021-06-14-145150

$ oc get pods -A | grep csi
openshift-cluster-csi-drivers                      manila-csi-driver-operator-5f55dbb4cb-772r4               1/1     Running             0          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-controller-78f5ff7789-hrp5p   9/9     Running             0          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-bbxrc                    2/2     Running             2          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-gtljz                    2/2     Running             2          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-rwq56                    2/2     Running             2          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-w5k92                    2/2     Running             2          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-node-x86fc                    2/2     Running             2          14h
openshift-cluster-csi-drivers                      openstack-cinder-csi-driver-operator-69579789f5-5p8g8     1/1     Running             0          14h
openshift-cluster-storage-operator                 csi-snapshot-controller-84f9687fc6-6rklp                  1/1     Running             0          14h
openshift-cluster-storage-operator                 csi-snapshot-controller-84f9687fc6-jlg7n                  1/1     Running             0          14h
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-694786cd9-v6v9v          1/1     Running             0          14h
openshift-cluster-storage-operator                 csi-snapshot-webhook-6b7dbc67d4-cm9dw                     1/1     Running             0          14h
openshift-cluster-storage-operator                 csi-snapshot-webhook-6b7dbc67d4-jjq7h                     1/1     Running             0          14h


$ oc get secret -n openshift-cluster-csi-drivers openstack-cloud-credentials -o yaml
apiVersion: v1
data:
  clouds.yaml: <content>
kind: Secret
...

$ echo "<content>" | base64 -d
clouds:
    openstack:
        auth:
            application_credential_id: ""
            application_credential_name: ""
            application_credential_secret: ""
            auth_url: http://10.0.0.104:5000
            default_domain: ""
            domain_id: ""
            domain_name: ""
            password: hidden
            project_domain_id: ""
            project_domain_name: Default
            project_id: a2de4b65f83341d1942c201750fffdf6
            project_name: shiftstack
            token: ""
            user_domain_id: ""
            user_domain_name: Default
            user_id: ""
            username: shiftstack_user
        auth_type: ""
        cert: ""
        cloud: ""
        identity_api_version: "3"
        key: ""
        profile: ""
        region_name: regionOne
        regions: null
        verify: true
        volume_api_version: ""


The cluster is ok after the upgrade chain.

Comment 9 errata-xmlrpc 2021-07-27 23:03:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438