Bug 1820778
Summary: | kube-proxy pods keep restarting on clusters with 4.3.10 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Cesar Wong <cewong> |
Component: | Networking | Assignee: | Dan Winship <danw> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | urgent | CC: | bbennett, choag, danw, ddelcian, ealcaniz, inecas, lmohanty, vlaad, wking |
Version: | 4.3.z | Keywords: | Regression, Upgrades |
Target Milestone: | --- | ||
Target Release: | 4.3.z | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-14 16:18:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1803149 | ||
Bug Blocks: | 1801742 |
Description
Cesar Wong
2020-04-03 20:42:54 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities (In reply to W. Trevor King from comment #2) > Who is impacted? All customers on 4.3.10 running a third-party network plugin that uses a standalone kube-proxy started by CNO. (eg, Calico in the case of the reporter.) (Customers running openshift-sdn, ovn-kubernetes, or kuryr are not affected.) > What is the impact? Is it serious enough to warrant blocking edges? kube-proxy fails health checks shortly after starting up, every time it starts up, and is killed. Each time kube-proxy does start successfully it will correctly sync the iptables state, but eventually it will be spending more time in "CrashLoopBackoff" than in "Running", so iptables will be out of date most of the time. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? There is no easy way to fix the problem other than upgrading to a fixed release. > All customers on 4.3.10
correction, the bug was introduced in 4.3.8... so that may suggest that not many people are affected
> ...so that may suggest that not many people are affected
Do we have anything in Telemetry/Insights that we can look at to gauge the number of clusters with third-party network plugins?
Bumping the priority, since we're considering tombstoning 4.3.10 on this (although discussion around the fix seems to be moving quickly already). As per insights data we have very less clusters (probably few of production clusters) which will get impacted by this issue, so removing the upgrade blocker from this issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1393 |