Description of problem: We have a chained upgrade from 4.2->4.3->4.4->4.5->4.6->4.7->4.8 and failed at 4.7 to 4.8.0-0.nightly-2021-04-21-131512. Storage CO is not successfully rolled out. must-gather log is http://virt-openshift-05.lab.eng.nay.redhat.com/wduan/logs/must-gather.local.5100242347722931169.tar.gz NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-04-21-093400 True True 3h Unable to apply 4.8.0-0.nightly-2021-04-21-131512: the cluster operator storage has not yet successfully rolled out From the CSO log: 2021-04-22T05:00:57.296086215Z I0422 05:00:57.293677 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-storage-operator", Name:"cluster-storage-operator", UID:"b2db9af5-a2ed-11eb-bd49-fa163e847ddc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/storage changed: Progressing message changed from "ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" to "OpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods\nOpenStackCinderCSIDriverOperatorCRProgressing: OpenStackCinderDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods" From the csi-driver log: 2021-04-22T06:56:03.105512023Z I0422 06:56:03.105456 1 openstack.go:130] InitOpenStackProvider configFile: /etc/kubernetes/config/cloud.conf 2021-04-22T06:56:03.106889195Z I0422 06:56:03.106822 1 openstack.go:88] Block storage opts: {0 false false} 2021-04-22T06:56:03.107006064Z W0422 06:56:03.106963 1 main.go:108] Failed to GetOpenStackProvider: failed to read and parse /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem certificate: open /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem: no such file or directory From the csi-provisioner log: 2021-04-22T04:50:43.149940243Z I0422 04:50:43.149830 1 csi-provisioner.go:155] Building kube configs for running in cluster... 2021-04-22T04:50:43.180958482Z I0422 04:50:43.180881 1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock 2021-04-22T04:50:53.181856106Z W0422 04:50:53.181806 1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock 2021-04-22T04:51:03.182074976Z W0422 04:51:03.181951 1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock 2021-04-22T04:51:13.181783506Z W0422 04:51:13.181689 1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock 2021-04-22T04:51:23.181836266Z W0422 04:51:23.181776 1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock 2021-04-22T04:51:33.181890408Z W0422 04:51:33.181796 1 connection.go:172] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock Version-Release number of selected component (if applicable): Upgrade from 4.7 to 4.8.0-0.nightly-2021-04-21-131512 How reproducible: Upgrage from 4.2->4.3->4.4->4.5->4.6->4.7->4.8 Steps to Reproduce: 1. 2. 3. Actual results: Storage CO is not available during upgrade Expected results: Storage CO should available during upgrade Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
The cause of this issue is that long time ago there was a bug in gophercloud utils which caused incorrect yaml marshaling for empty fields I fixed that bug https://github.com/gophercloud/utils/pull/100 and the fix was available only in 4.3. It means that in 4.2 if cacert was not in the original clouds.yaml the utils produced `cacert: ""` in the system clouds.yaml, which technically not a problem. But when we introduced https://github.com/openshift/cloud-credential-operator/pull/314, it just checks that the key (cacert) exists in the file, omitting the fact it can be empty. To fix the issue we need to ignore empty cacert
I'm going to post an alternative patch shortly.
Doc text not required, as this bug was never released.
Verified in 4.8.0-0.nightly-2021-06-14-145150 on top of OSP 16.1.6 (RHOS-16.1-RHEL-8-20210506.n.1), after the next upgrade chain: 4.2.0-0.nightly-2021-02-22-141219 -> 4.3.40 -> 4.4.33 -> 4.5.40 -> 4.6.35 -> 4.7.17 -> 4.8.0-0.nightly-2021-06-14-145150 The underlying OSP is without SSL enabled and the clouds.yaml doesn't contain the cacert param: clouds: ... shiftstack: auth: auth_url: http://10.0.0.104:5000 password: hidden project_domain_name: Default project_name: shiftstack project_id: a2de4b65f83341d1942c201750fffdf6 user_domain_name: Default username: shiftstack_user identity_api_version: '3' region_name: regionOne $ openstack server list +--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+ | e52f1172-dfb1-4328-bb47-db2ebf14ead8 | ostest-28klf-worker-lx9bv | ACTIVE | ostest-28klf-openshift=10.196.2.211 | rhcos | | | b9e7e76e-e74e-4206-99d2-c94d9b2b4e3d | ostest-28klf-worker-2fvwv | ACTIVE | ostest-28klf-openshift=10.196.0.140 | rhcos | | | 32f09721-a9e5-4e3a-8092-ba127f1e7609 | ostest-28klf-master-0 | ACTIVE | ostest-28klf-openshift=10.196.0.47 | rhcos | | | 155849f7-27d8-45d1-99be-4d837252acdc | ostest-28klf-master-1 | ACTIVE | ostest-28klf-openshift=10.196.0.32 | rhcos | | | ac37f988-b9cb-433e-be21-4f5328a66cd8 | ostest-28klf-master-2 | ACTIVE | ostest-28klf-openshift=10.196.1.107 | rhcos | | +--------------------------------------+---------------------------+--------+-------------------------------------+-------+--------+ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2021-02-22-141219 True False 60s Cluster version is 4.2.0-0.nightly-2021-02-22-141219 Upgrades: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.40 True False 3m49s Cluster version is 4.3.40 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.33 True False 3m33s Cluster version is 4.4.33 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.40 True False 8s Cluster version is 4.5.40 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.35 True False 3m53s Cluster version is 4.6.35 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.17 True False 3m44s Cluster version is 4.7.17 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-14-145150 True False 12h Cluster version is 4.8.0-0.nightly-2021-06-14-145150 $ oc get pods -A | grep csi openshift-cluster-csi-drivers manila-csi-driver-operator-5f55dbb4cb-772r4 1/1 Running 0 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-controller-78f5ff7789-hrp5p 9/9 Running 0 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-bbxrc 2/2 Running 2 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-gtljz 2/2 Running 2 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-rwq56 2/2 Running 2 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-w5k92 2/2 Running 2 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-node-x86fc 2/2 Running 2 14h openshift-cluster-csi-drivers openstack-cinder-csi-driver-operator-69579789f5-5p8g8 1/1 Running 0 14h openshift-cluster-storage-operator csi-snapshot-controller-84f9687fc6-6rklp 1/1 Running 0 14h openshift-cluster-storage-operator csi-snapshot-controller-84f9687fc6-jlg7n 1/1 Running 0 14h openshift-cluster-storage-operator csi-snapshot-controller-operator-694786cd9-v6v9v 1/1 Running 0 14h openshift-cluster-storage-operator csi-snapshot-webhook-6b7dbc67d4-cm9dw 1/1 Running 0 14h openshift-cluster-storage-operator csi-snapshot-webhook-6b7dbc67d4-jjq7h 1/1 Running 0 14h $ oc get secret -n openshift-cluster-csi-drivers openstack-cloud-credentials -o yaml apiVersion: v1 data: clouds.yaml: <content> kind: Secret ... $ echo "<content>" | base64 -d clouds: openstack: auth: application_credential_id: "" application_credential_name: "" application_credential_secret: "" auth_url: http://10.0.0.104:5000 default_domain: "" domain_id: "" domain_name: "" password: hidden project_domain_id: "" project_domain_name: Default project_id: a2de4b65f83341d1942c201750fffdf6 project_name: shiftstack token: "" user_domain_id: "" user_domain_name: Default user_id: "" username: shiftstack_user auth_type: "" cert: "" cloud: "" identity_api_version: "3" key: "" profile: "" region_name: regionOne regions: null verify: true volume_api_version: "" The cluster is ok after the upgrade chain.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438