Bug 1955517
Summary: | Failed to upgrade from 4.6.25 to 4.7.8 due to the machine-config degradation | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | oarribas <oarribas> | |
Component: | Node | Assignee: | Qi Wang <qiwan> | |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | aacostab, aos-bugs, jerzhang, lmohanty, minmli, morgan.peterman, nagrawal, oarribas, qiwan, rphillips, umohnani, wking | |
Version: | 4.7 | Keywords: | Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: When upgrading from 4.6 with more than one config that does not support machine config pool name suffix, MCO will generate duplicated machine config for the same configuration.
Consequence: The upgrade failed.
Fix: Clean up the outdated duplicated machine config.
Result: The upgrade successed from 4.6 to 4.7.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1964568 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:05:17 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1964568 |
Description
oarribas
2021-04-30 09:50:41 UTC
We did change that generation behaviour, although I am not sure if we deleted the old ones. If not this may be an upgrade blocker. Raising severity and cc'ing Urvashi, who provided the original fix We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Adding the ImpactStatementRequested keyword, per [1]. When you supply an impact statement, please replace it with ImpactStatementProposed. [1]: https://github.com/openshift/enhancements/pull/475 To get out of the upgrade stuck situation, the customer can run `oc delete mc/99-worker-generated-kubelet` to delete the duplicate degraded machine configs (99-master-generated-kubelet, 99-infra-generated-kubelet). Basically this bug can be reproduced when clusters that has more than one kubeletconfig or ctrcfg CR created in 4.6 and they do an upgrade to 4.7. We are working on a fix for it. https://github.com/openshift/machine-config-operator/pull/2570. @Lalatendu, filled out upgrade blocker assessment. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters that have more than 1 kubeletconfig CR or ctrcfg CR created. What is the impact? Is it serious enough to warrant blocking edges? Upgrades will be stuck and will show the machine-config-operator as degraded as the MC with the name 99-[pool]-generated-[kubelet or containerruntime] will still be on the old controller version (3.1). It will remain in this stuck state until an admin goes and deletes the old MCs. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Admin uses oc to fix things by deleting the old MC still using the old controller version. The fix is to run `oc delete mc 99-[pool]-generated-[kubelet or containerruntime]. Once this fix is applied, the machine-config-operator upgrade should progress within a matter of minutes and the upgrade will be successful. Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, this is not a regression. The MC generation logic was changed in 4.7 to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before upgrading. Fix for this is in https://github.com/openshift/machine-config-operator/pull/2570 and is still being tested. (In reply to Urvashi Mohnani from comment #8) > @Lalatendu, filled out upgrade blocker assessment. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? > > This is an issue in the 4.6 to 4.7 upgrade path, but only affects clusters > that have more than 1 kubeletconfig CR or ctrcfg CR created. > > What is the impact? Is it serious enough to warrant blocking edges? > > Upgrades will be stuck and will show the machine-config-operator as > degraded as the MC with the name 99-[pool]-generated-[kubelet or > containerruntime] > will still be on the old controller version (3.1). It will remain in this > stuck state until an admin goes and deletes the old MCs. > > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? > > Admin uses oc to fix things by deleting the old MC still using the old > controller version. The fix is to run `oc delete mc > 99-[pool]-generated-[kubelet or containerruntime]. Adding to this: For kubeletconfig CR, only delete the 99-worker-generated-kubelet and/or 99-master-generated-kubelet MCs. For the ctrcfg CR, only delete the 99-worker-generated-containerruntime and/or 99-master-generated-containerruntime MCs. We are deleting the old generated MCs in favor of the new generated MCs (new ones will have the updated ignition version). Please note: DO NOT delete any **rendered** MCs. > Once this fix is applied, the machine-config-operator upgrade should > progress within a matter of minutes and the upgrade will be successful. > > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? > > No, this is not a regression. The MC generation logic was changed in 4.7 > to properly account for multiple kubeletconfig or ctrcfg CRs, but didn't > account for when 4.6 already had multiple kubeletconfig or ctrcfg CRs before > upgrading. Fix for this is in > https://github.com/openshift/machine-config-operator/pull/2570 and is still > being tested. We peered into the Telemetry crystal ball for a bit, but we don't report Telemetry for the KubeletConfig or ContainerRuntimeConfig settings that matter here [1]. We do collect ContainerRuntimeConfigs in Insights since 4.6.18 [2], and Insights tarballs will have the ClusterOperator condition message entries where we can look for the 'controller version mismatch for 99-[pool]-generated-[kubelet or containerruntime] expected' messages. But I'm not set up to run that Insights analysis. We can still put an upper-bound on the frequency in Telemetry by looking at folks who are currently reporting Degraded=True during updates to 4.7, and from that it doesn't seem too common. Also, the impact is mostly the stuck update, which isn't regressing you much below your cluster health when you were running 4.6. There may be some degradation because the pool cannot roll out things like Kube-API X.509 certificates, and I get a bit fuzzy on how much time you have before that becomes a big concern. But we should have a fix out for this bug before you need to worry about expiring certs. I'm dropping UpgradeBlocker based on: * Seems somewhat rare, based on some imprecise Telemetry estimates. * Doesn't seem like a big hit to cluster health, based on comment 8. * Seems straightforward to mitigate based on comments 7 through comment 9. * Even if folks don't find this bug with its mitigation steps, updating a stuck cluster to one with the fix that will come with this bug series will allow their cluster to self-heal. Feel free to push back if it seems like that assessment is off :) [1]: https://github.com/openshift/cluster-monitoring-operator/blob/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db/Documentation/data-collection.md [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1891544#c6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |