1940142 – 4.6->4.7 updates stick on OpenStackCinderCSIDriverOperatorCR_OpenStackCinderDriverControllerServiceController_Deploying

Bug 1940142 - 4.6->4.7 updates stick on OpenStackCinderCSIDriverOperatorCR_OpenStackCinderDriverControllerServiceController_Deploying

Summary: 4.6->4.7 updates stick on OpenStackCinderCSIDriverOperatorCR_OpenStackCinderD...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Mike Fedosin
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1940395 2027597 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-17 17:03 UTC by W. Trevor King
Modified:	2023-09-15 01:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Who is impacted? Customers that deployed an OCP cluster version <4.6 on OpenStack with self-signed certificates can't upgrade to 4.7. What is the impact? Cinder CSI driver gets incorrect CA cert path from the clouds.yaml file and can't start. How involved is remediation? The immediate workaround would be to manually modify the `clouds.yaml` key in `openstack-credentials` secret in `kube-system` namespace, and replace `cacert: <some value>` with `cacert: /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem`. The long-term solution is to update CCO to generate correct clouds.yaml . Is this a regression? The issue happens only when upgrading from 4.6 to 4.7, all other versions are not affected.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:53:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 314	0	None	open	Bug 1940142: Correct incorrect CACert in secrets created prior to 4.6	2021-03-31 03:57:53 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:54:17 UTC

Description W. Trevor King 2021-03-17 17:03:43 UTC

4.6 fixed bug 1884558 around a broken cacert file path by bumping the path in the installer [1].  But that didn't fix born-before-4.6 clusters who were initialized with the broken path.  Many OpenStack providers apparently work around the broken path, but when those clusters update to 4.7 and get the new Cinder CSI handler, they stick on update with the storage ClusterOperator Available=False with:

  Reason: OpenStackCinderCSIDriverOperatorCR_OpenStackCinderDriverControllerServiceController_Deploying
  Message: OpenStackCinderCSIDriverOperatorCRAvailable: OpenStackCinderDriverControllerServiceControllerAvailable: Waiting for Deployment to deploy the CSI Controller Service

and:

  W0316 15:02:07.788864       1 main.go:108] Failed to GetOpenStackProvider: Post "https://.../v3/auth/tokens": x509: certificate signed by unknown authority

in the crash-looping csi-driver container.

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the UpgradeBlocker keyword has been added to this bug.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it’s always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1884558#c6

Comment 1 Mike Fedosin 2021-03-17 17:17:31 UTC

Who is impacted?
Customers that deployed an OCP cluster version <4.6 on OpenStack with self-signed certificates can't upgrade to 4.7.

What is the impact?
Cinder CSI driver gets incorrect CA cert path from the clouds.yaml file and can't start.

How involved is remediation?
The immediate workaround would be to manually modify the `clouds.yaml` key in `openstack-credentials` secret in `kube-system` namespace, and replace `cacert: <some value>` with `cacert: /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem`. The long-term solution is to update CCO to generate correct clouds.yaml .

Is this a regression?
The issue happens only when upgrading from 4.6 to 4.7, all other versions are not affected.

Comment 4 Mike Fedosin 2021-03-22 11:15:37 UTC

*** Bug 1940395 has been marked as a duplicate of this bug. ***

Comment 5 W. Trevor King 2021-03-31 04:01:33 UTC

I'm adding ImpactStatementProposed [1], because comment 1 gives us an impact statement, and we just need to make a call on whether we need to block edges to protect folks while we get this fix out.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 6 Lalatendu Mohanty 2021-03-31 15:07:57 UTC

Without knowing the actual number of clusters or % of clusters that will be impacted it is not possible to mark this as upgrade blocker as this is very specific to clusters on OpenStack with self-signed certificates.

Comment 7 W. Trevor King 2021-04-02 03:37:09 UTC

Ok, I'm going to say we don't block edges on this, but if folks hear about more of this sort of thing going on, we can revisit.

Comment 9 wang lin 2021-04-12 07:59:46 UTC

CA cert path issue has fixed on 4.8.0-0.nightly-2021-04-09-222447

1.Install a self-signed cert cluster on openstack
2.Edit secret openstack-credentials in kube-system namespace, and update CA cert path to a wrong one and save
3.Check secret openstack-credentials again, verify it will be changed to `/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem` again
oc get secret -n kube-system openstack-credentials -o json | jq -r ".data"
{
  "clouds.yaml": "BASE64 encode string"
}

clouds:
    openstack:
        auth:
            auth_url: XXXXXXXXXXX
            password: XXXXXXXX
            project_id: 75604224364d40f0b076625b139dc6e3
            project_name: shiftstack
            user_domain_name: Default
            username: shiftstack_user
        cacert: /etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem
        endpoint_type: public
        identity_api_version: "3"
        region_name: regionOne
        verify: true

4. The components secrets are the same as the root credential

Comment 13 errata-xmlrpc 2021-07-27 22:53:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 15 Pierre Prinetti 2021-12-07 15:58:43 UTC

*** Bug 2027597 has been marked as a duplicate of this bug. ***

Comment 16 Red Hat Bugzilla 2023-09-15 01:03:35 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.