Bug 1999018 - [ASH] upgrade stuck due to Cluster cloud controller manager deployment strategy error
Summary: [ASH] upgrade stuck due to Cluster cloud controller manager deployment strate...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.9.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-30 09:16 UTC by Milind Yadav
Modified: 2022-04-11 08:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1997507
Environment:
job=periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node=all job=periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node=all
Last Closed: 2021-10-18 17:49:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-cloud-controller-manager-operator pull 111 0 None None None 2021-08-31 08:42:58 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:49:58 UTC

Comment 4 sunzhaohua 2021-09-02 01:26:45 UTC
Verified
clusterversion: 4.9.0-0.nightly-2021-08-31-123131
upgrade from 4.9.0-0.nightly-2021-08-29-010334 to 4.9.0-0.nightly-2021-08-31-123131, upgrade is successful.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-08-31-123131   True        False         22m     Cluster version is 4.9.0-0.nightly-2021-08-31-123131
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-08-31-123131   True        False         False      82m
baremetal                                  4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
cloud-controller-manager                   4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
cloud-credential                           4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
cluster-autoscaler                         4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
config-operator                            4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
console                                    4.9.0-0.nightly-2021-08-31-123131   True        False         False      52m
csi-snapshot-controller                    4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
dns                                        4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
etcd                                       4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
image-registry                             4.9.0-0.nightly-2021-08-31-123131   True        False         False      16h
ingress                                    4.9.0-0.nightly-2021-08-31-123131   True        False         False      33h
insights                                   4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
kube-apiserver                             4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
kube-controller-manager                    4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
kube-scheduler                             4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
kube-storage-version-migrator              4.9.0-0.nightly-2021-08-31-123131   True        False         False      54m
machine-api                                4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
machine-approver                           4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
machine-config                             4.9.0-0.nightly-2021-08-31-123131   True        False         False      24m
marketplace                                4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
monitoring                                 4.9.0-0.nightly-2021-08-31-123131   True        False         False      33h
network                                    4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
node-tuning                                4.9.0-0.nightly-2021-08-31-123131   True        False         False      150m
openshift-apiserver                        4.9.0-0.nightly-2021-08-31-123131   True        False         False      84m
openshift-controller-manager               4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
openshift-samples                          4.9.0-0.nightly-2021-08-31-123131   True        False         False      150m
operator-lifecycle-manager                 4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-08-31-123131   True        False         False      24h
service-ca                                 4.9.0-0.nightly-2021-08-31-123131   True        False         False      34h
storage                                    4.9.0-0.nightly-2021-08-31-123131   False       True          False      34s     AzureDiskCSIDriverOperatorCRAvailable: AzureDiskDriverNodeServiceControllerAvailable: Waiting for the DaemonSet to deploy the CSI Node Service
$ oc get node
NAME                         STATUS   ROLES    AGE   VERSION
mgahaganash-vc5vm-master-0   Ready    master   34h   v1.22.0-rc.0+1199c36
mgahaganash-vc5vm-master-1   Ready    master   34h   v1.22.0-rc.0+1199c36
mgahaganash-vc5vm-master-2   Ready    master   34h   v1.22.0-rc.0+1199c36
mgahaganash-vc5vm-worker-0   Ready    worker   33h   v1.22.0-rc.0+1199c36
mgahaganash-vc5vm-worker-1   Ready    worker   33h   v1.22.0-rc.0+1199c36
mgahaganash-vc5vm-worker-2   Ready    worker   33h   v1.22.0-rc.0+1199c36


storage bug: https://bugzilla.redhat.com/show_bug.cgi?id=1992875

Comment 5 W. Trevor King 2021-09-20 18:27:56 UTC
Mike Fiedler added UpgradeBlocker, which triggers the following (delayed, sorry :/) impact-statement request:

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 Joel Speed 2021-09-22 10:28:36 UTC
> Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?

Anyone using SNO would be impacted

> What is the impact?  Is it serious enough to warrant blocking edges?

The issue would be if someone upgraded from 4.8 to 4.9 on SNO, it would have blocked the upgrade completely.
As SNO was TP in 4.8, I don't think we will actually have any SNO upgrades.
As this is in 4.9.0 there should be no issues with edge blocking.

> How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

User must edit the `cluster-cloud-controller-manager-operator` deployment to update the strategy manually.

> Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

No, this has always been like this. The issue was noticed now that someone has tried an SNO upgrade.

As far as I'm aware, SNO isn't GA until 4.9 anyway so this shouldn't be an issue

Comment 7 W. Trevor King 2021-09-22 19:29:10 UTC
(In reply to Joel Speed from comment #6)
> The issue would be if someone upgraded from 4.8 to 4.9 on SNO, it would have
> blocked the upgrade completely.
> As SNO was TP in 4.8, I don't think we will actually have any SNO upgrades.
> As this is in 4.9.0 there should be no issues with edge blocking.

Makes sense to me.  I'm dropping UpgradeBlocker, because we don't need to block edges in graph-data over this.

Comment 10 errata-xmlrpc 2021-10-18 17:49:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.