Description of problem: machine-config cluster operator degraded due to controller version mismatch ~~~ message: 'Unable to apply 4.7.8: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0 has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying' ~~~ Version-Release number of selected component (if applicable): Upgrade from 4.6.25 to 4.7.8 Actual results: ~~~ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 11h Working towards 4.7.8: 560 of 668 done (83% complete), waiting on machine-config ~~~ ~~~ $ oc get co machine-config -o yaml [...] - lastTransitionTime: '2021-04-29T21:12:31Z' message: 'Unable to apply 4.7.8: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0 has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying' reason: RequiredPoolsFailed status: 'True' type: Degraded [...] extension: infra: all 3 nodes are at latest configuration rendered-infra-8c3fb0d33b3fee705b6569a90ac5fb5d lastSyncError: 'pool master has not progressed to latest configuration: controller version mismatch for 99-master-generated-kubelet expected 116603ff3d7a39c0de52d7d16fe307c8471330a0 has d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af: all 3 nodes are at latest configuration rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0, retrying' master: all 3 nodes are at latest configuration rendered-master-1063d0f1a527696ab1c13e7b6a9f09f0 worker: all 2 nodes are at latest configuration rendered-worker-8c3fb0d33b3fee705b6569a90ac5fb5d ~~~ The `99-xxxx-generated-kubelet` are duplicated: ~~~ $ oc get mc | grep kubelet 01-master-kubelet 116603ff3d7a39c0de52d7d16fe307c8471330a0 3.2.0 249d 01-worker-kubelet 116603ff3d7a39c0de52d7d16fe307c8471330a0 3.2.0 249d 99-infra-generated-kubelet d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af 3.1.0 14h 99-infra-generated-kubelet-1 116603ff3d7a39c0de52d7d16fe307c8471330a0 3.2.0 11h 99-master-generated-kubelet d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af 3.1.0 14h 99-master-generated-kubelet-1 116603ff3d7a39c0de52d7d16fe307c8471330a0 3.2.0 11h 99-worker-generated-kubelet d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af 3.1.0 14h 99-worker-generated-kubelet-1 116603ff3d7a39c0de52d7d16fe307c8471330a0 3.2.0 11h ~~~ Expected results: Successful upgrade. Additional info: Similar to BZ 1953627
We did change that generation behaviour, although I am not sure if we deleted the old ones. If not this may be an upgrade blocker. Raising severity and cc'ing Urvashi, who provided the original fix
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Adding the ImpactStatementRequested keyword, per [1]. When you supply an impact statement, please replace it with ImpactStatementProposed. [1]: https://github.com/openshift/enhancements/pull/475
To get out of the upgrade stuck situation, the customer can run `oc delete mc/99-worker-generated-kubelet` to delete the duplicate degraded machine configs (99-master-generated-kubelet, 99-infra-generated-kubelet). Basically this bug can be reproduced when clusters that has more than one kubeletconfig or ctrcfg CR created in 4.6 and they do an upgrade to 4.7. We are working on a fix for it. https://github.com/openshift/machine-config-operator/pull/2570.
@Lalatendu, filled out upgrade blocker assessment. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters that have more than 1 kubeletconfig CR or ctrcfg CR created. What is the impact? Is it serious enough to warrant blocking edges? Upgrades will be stuck and will show the machine-config-operator as degraded as the MC with the name 99-[pool]-generated-[kubelet or containerruntime] will still be on the old controller version (3.1). It will remain in this stuck state until an admin goes and deletes the old MCs. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin uses oc to fix things by deleting the old MC still using the old controller version. The fix is to run `oc delete mc 99-[pool]-generated-[kubelet or containerruntime]. Once this fix is applied, the machine-config-operator upgrade should progress within a matter of minutes and the upgrade will be successful. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, this is not a regression. The MC generation logic was changed in 4.7 to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before upgrading. Fix for this is in https://github.com/openshift/machine-config-operator/pull/2570 and is still being tested.
(In reply to Urvashi Mohnani from comment #8) > @Lalatendu, filled out upgrade blocker assessment. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > > This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters > that have more than 1 kubeletconfig CR or ctrcfg CR created. > > What is the impact? Is it serious enough to warrant blocking edges? > > Upgrades will be stuck and will show the machine-config-operator as > degraded as the MC with the name 99-[pool]-generated-[kubelet or > containerruntime] > will still be on the old controller version (3.1). It will remain in this > stuck state until an admin goes and deletes the old MCs. > > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? > > Admin uses oc to fix things by deleting the old MC still using the old > controller version. The fix is to run `oc delete mc > 99-[pool]-generated-[kubelet or containerruntime]. Adding to this: For kubeletconfig CR, only delete the 99-worker-generated-kubelet and/or 99-master-generated-kubelet MCs. For the ctrcfg CR, only delete the 99-worker-generated-containerruntime and/or 99-master-generated-containerruntime MCs. We are deleting the old generated MCs in favor of the new generated MCs (new ones will have the updated ignition version). Please note: DO NOT delete any **rendered** MCs. > Once this fix is applied, the machine-config-operator upgrade should > progress within a matter of minutes and the upgrade will be successful. > > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? > > No, this is not a regression. The MC generation logic was changed in 4.7 > to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't > account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before > upgrading. Fix for this is in > https://github.com/openshift/machine-config-operator/pull/2570 and is still > being tested.
We peered into the Telemetry crystal ball for a bit, but we don't report Telemetry for the KubeletConfig or ContainerRuntimeConfig settings that matter here [1]. We do collect ContainerRuntimeConfigs in Insights since 4.6.18 [2], and Insights tarballs will have the ClusterOperator condition message entries where we can look for the 'controller version mismatch for 99-[pool]-generated-[kubelet or containerruntime] expected' messages. But I'm not set up to run that Insights analysis. We can still put an upper-bound on the frequency in Telemetry by looking at folks who are currently reporting Degraded=True during updates to 4.7, and from that it doesn't seem too common. Also, the impact is mostly the stuck update, which isn't regressing you much below your cluster health when you were running 4.6. There may be some degradation because the pool cannot roll out things like Kube-API X.509 certificates, and I get a bit fuzzy on how much time you have before that becomes a big concern. But we should have a fix out for this bug before you need to worry about expiring certs. I'm dropping UpgradeBlocker based on: * Seems somewhat rare, based on some imprecise Telemetry estimates. * Doesn't seem like a big hit to cluster health, based on comment 8. * Seems straightforward to mitigate based on comments 7 through comment 9. * Even if folks don't find this bug with its mitigation steps, updating a stuck cluster to one with the fix that will come with this bug series will allow their cluster to self-heal. Feel free to push back if it seems like that assessment is off :) [1]: https://github.com/openshift/cluster-monitoring-operator/blob/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db/Documentation/data-collection.md [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1891544#c6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438