Bug 1955517

Summary:	Failed to upgrade from 4.6.25 to 4.7.8 due to the machine-config degradation
Product:	OpenShift Container Platform	Reporter:	oarribas <oarribas>
Component:	Node	Assignee:	Qi Wang <qiwan>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aacostab, aos-bugs, jerzhang, lmohanty, minmli, morgan.peterman, nagrawal, oarribas, qiwan, rphillips, umohnani, wking
Version:	4.7	Keywords:	Upgrades
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When upgrading from 4.6 with more than one config that does not support machine config pool name suffix, MCO will generate duplicated machine config for the same configuration. Consequence: The upgrade failed. Fix: Clean up the outdated duplicated machine config. Result: The upgrade successed from 4.6 to 4.7.	Story Points:	---
Clone Of:
Clones:	1964568 (view as bug list)		Environment:
Last Closed:	2021-07-27 23:05:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1964568

Description oarribas 2021-04-30 09:50:41 UTC

Description of problem:

machine-config cluster operator degraded due to controller version mismatch

~~~
message: 'Unable to apply 4.7.8: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0
      has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration
      rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying'
~~~


Version-Release number of selected component (if applicable):

Upgrade from 4.6.25 to 4.7.8



Actual results:

~~~
$ oc get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version           True       True         11h    Working towards 4.7.8: 560 of 668 done (83% complete), waiting on machine-config
~~~

~~~
$ oc get co machine-config -o yaml
[...]
  - lastTransitionTime: '2021-04-29T21:12:31Z'
    message: 'Unable to apply 4.7.8: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0
      has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration
      rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying'
    reason: RequiredPoolsFailed
    status: 'True'
    type: Degraded
[...]
  extension:
    infra: all 3 nodes are at latest configuration rendered-infra-8c3fb0d33b3fee705b6569a90ac5fb5d
    lastSyncError: 'pool master has not progressed to latest configuration: controller
      version mismatch for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0
      has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration
      rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying'
    master: all 3 nodes are at latest configuration rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0
    worker: all 2 nodes are at latest configuration rendered-worker-8c3fb0d33b3fee705b6569a90ac5fb5d
~~~

The `99-xxxx-generated-kubelet` are duplicated:
~~~
$ oc get mc | grep kubelet
01-master-kubelet                                 116603ff3d7a39c0de52d7d16fe307c8471330a0  3.2.0            249d
01-worker-kubelet                                 116603ff3d7a39c0de52d7d16fe307c8471330a0  3.2.0            249d
99-infra-generated-kubelet                        d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af  3.1.0            14h
99-infra-generated-kubelet-1                      116603ff3d7a39c0de52d7d16fe307c8471330a0  3.2.0            11h
99-master-generated-kubelet                       d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af  3.1.0            14h
99-master-generated-kubelet-1                     116603ff3d7a39c0de52d7d16fe307c8471330a0  3.2.0            11h
99-worker-generated-kubelet                       d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af  3.1.0            14h
99-worker-generated-kubelet-1                     116603ff3d7a39c0de52d7d16fe307c8471330a0  3.2.0            11h
~~~




Expected results:

Successful upgrade.


Additional info:


Similar to BZ 1953627

Comment 2 Yu Qi Zhang 2021-05-04 00:19:06 UTC

We did change that generation behaviour, although I am not sure if we deleted the old ones. If not this may be an upgrade blocker. Raising severity and cc'ing Urvashi, who provided the original fix

Comment 5 Lalatendu Mohanty 2021-05-06 20:28:20 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 W. Trevor King 2021-05-06 20:32:39 UTC

Adding the ImpactStatementRequested keyword, per [1].  When you supply an impact statement, please replace it with ImpactStatementProposed.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 7 Qi Wang 2021-05-06 20:46:52 UTC

To get out of the upgrade stuck situation, the customer can run `oc delete mc/99-worker-generated-kubelet` to
delete the duplicate degraded machine configs (99-master-generated-kubelet, 99-infra-generated-kubelet). 

Basically this bug can be reproduced when clusters that has more than one kubeletconfig or ctrcfg CR created in 4.6 and they do an upgrade to 4.7. We are working on a fix for it. https://github.com/openshift/machine-config-operator/pull/2570.

Comment 8 Urvashi Mohnani 2021-05-06 20:50:13 UTC

@Lalatendu, filled out upgrade blocker assessment.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?

This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters that have more than 1 kubeletconfig CR or ctrcfg CR created.

What is the impact? Is it serious enough to warrant blocking edges?

Upgrades will be stuck and will show the machine-config-operator as degraded as the MC with the name 99-[pool]-generated-[kubelet or containerruntime]
will still be on the old controller version (3.1). It will remain in this stuck state until an admin goes and deletes the old MCs.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?

Admin uses oc to fix things by deleting the old MC still using the old controller version. The fix is to run `oc delete mc 99-[pool]-generated-[kubelet or containerruntime].
Once this fix is applied, the machine-config-operator upgrade should progress within a matter of minutes and the upgrade will be successful.

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?

No, this is not a regression. The MC generation logic was changed in 4.7 to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before upgrading. Fix for this is in https://github.com/openshift/machine-config-operator/pull/2570 and is still being tested.

Comment 9 Urvashi Mohnani 2021-05-06 21:28:36 UTC

(In reply to Urvashi Mohnani from comment #8)
> @Lalatendu, filled out upgrade blocker assessment.
> 
> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?
> 
>   This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters
> that have more than 1 kubeletconfig CR or ctrcfg CR created.
> 
> What is the impact?  Is it serious enough to warrant blocking edges?
> 
>   Upgrades will be stuck and will show the machine-config-operator as
> degraded as the MC with the name 99-[pool]-generated-[kubelet or
> containerruntime]
> will still be on the old controller version (3.1). It will remain in this
> stuck state until an admin goes and deletes the old MCs.
> 
> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?
> 
>   Admin uses oc to fix things by deleting the old MC still using the old
> controller version. The fix is to run `oc delete mc
> 99-[pool]-generated-[kubelet or containerruntime].

Adding to this: For kubeletconfig CR, only delete the 99-worker-generated-kubelet and/or 99-master-generated-kubelet MCs. For the ctrcfg CR, only delete the 99-worker-generated-containerruntime and/or 99-master-generated-containerruntime MCs. We are deleting the old generated MCs in favor of the new generated MCs (new ones will have the updated ignition version).
Please note: DO NOT delete any **rendered** MCs.


> Once this fix is applied, the machine-config-operator upgrade should
> progress within a matter of minutes and the upgrade will be successful.
> 
> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?
> 
>   No, this is not a regression. The MC generation logic was changed in 4.7
> to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't
> account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before
> upgrading. Fix for this is in
> https://github.com/openshift/machine-config-operator/pull/2570 and is still
> being tested.

Comment 10 W. Trevor King 2021-05-06 22:00:06 UTC

We peered into the Telemetry crystal ball for a bit, but we don't report Telemetry for the KubeletConfig or ContainerRuntimeConfig settings that matter here [1]. We do collect ContainerRuntimeConfigs in Insights since 4.6.18 [2], and Insights tarballs will have the ClusterOperator condition message entries where we can look for the 'controller version mismatch for 99-[pool]-generated-[kubelet or containerruntime] expected' messages. But I'm not set up to run that Insights analysis. We can still put an upper-bound on the frequency in Telemetry by looking at folks who are currently reporting Degraded=True during updates to 4.7, and from that it doesn't seem too common. Also, the impact is mostly the stuck update, which isn't regressing you much below your cluster health when you were running 4.6. There may be some degradation because the pool cannot roll out things like Kube-API X.509 certificates, and I get a bit fuzzy on how much time you have before that becomes a big concern. But we should have a fix out for this bug before you need to worry about expiring certs. I'm dropping UpgradeBlocker based on:

* Seems somewhat rare, based on some imprecise Telemetry estimates.
* Doesn't seem like a big hit to cluster health, based on comment 8.
* Seems straightforward to mitigate based on comments 7 through comment 9.
* Even if folks don't find this bug with its mitigation steps, updating a stuck cluster to one with the fix that will come with this bug series will allow their cluster to self-heal.

Feel free to push back if it seems like that assessment is off :)

[1]: https://github.com/openshift/cluster-monitoring-operator/blob/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db/Documentation/data-collection.md
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1891544#c6

Comment 18 errata-xmlrpc 2021-07-27 23:05:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438