Bug 1820778 - kube-proxy pods keep restarting on clusters with 4.3.10
Summary: kube-proxy pods keep restarting on clusters with 4.3.10
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.z
Hardware: All
OS: All
urgent
high
Target Milestone: ---
: 4.3.z
Assignee: Dan Winship
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1803149
Blocks: 1801742
TreeView+ depends on / blocked
 
Reported: 2020-04-03 20:42 UTC by Cesar Wong
Modified: 2020-05-05 15:11 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-14 16:18:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift sdn pull 126 None closed Bug 1820778: fix health check in standalone kube-proxy 2020-09-28 13:57:34 UTC
Red Hat Product Errata RHBA-2020:1393 None None None 2020-04-14 16:18:57 UTC

Description Cesar Wong 2020-04-03 20:42:54 UTC
Description of problem:
On an IBM ROKS cluster with a Calico sdn,
openshift-kube-proxy pods are crashing. OpenShift 4.3.5 worked fine. Logs don’t show much. Probes are failing so k8s restarts.

Version-Release number of selected component (if applicable):
4.3.10

How reproducible:
Always

Steps to Reproduce:
1. Install 4.3.10 cluster (or upgrade)

Actual results:
openshift-kube-proxy pods enter crashloop backoff 

$ k get pods -n openshift-kube-proxy
NAME                         READY   STATUS             RESTARTS   AGE
openshift-kube-proxy-bcfnp   0/1     CrashLoopBackOff   26         120m
openshift-kube-proxy-hfk2s   0/1     CrashLoopBackOff   25         120m
openshift-kube-proxy-t8mf9   0/1     CrashLoopBackOff   26         121m
openshift-kube-proxy-wrjgj   0/1     CrashLoopBackOff   25         121m

events:
20m         Warning   Unhealthy          pod/openshift-kube-proxy-bcfnp   Readiness probe failed: HTTP probe failed with statuscode: 503
5m46s       Warning   Unhealthy          pod/openshift-kube-proxy-bcfnp   Liveness probe failed: HTTP probe failed with statuscode: 503

Expected results:
openshift-kube-proxy pods run as normal

Additional info:
$ k logs -n openshift-kube-proxy openshift-kube-proxy-bcfnp 
W0403 20:04:56.608864       1 proxier.go:584] Failed to read file /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin with error open /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.610551       1 proxier.go:597] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.612410       1 proxier.go:597] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.614149       1 proxier.go:597] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.615878       1 proxier.go:597] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.617422       1 proxier.go:597] Failed to load kernel module nf_conntrack_ipv4 with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
I0403 20:04:56.617910       1 server.go:494] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
I0403 20:04:56.661306       1 node.go:135] Successfully retrieved node IP: 10.184.98.6
I0403 20:04:56.661352       1 server_others.go:146] Using iptables Proxier.
W0403 20:04:56.661472       1 proxier.go:275] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended
I0403 20:04:56.661774       1 server.go:529] Version: v0.0.0-master+$Format:%h$
I0403 20:04:56.663018       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0403 20:04:56.663364       1 config.go:131] Starting endpoints config controller
I0403 20:04:56.663412       1 shared_informer.go:197] Waiting for caches to sync for endpoints config
I0403 20:04:56.663529       1 config.go:313] Starting service config controller
I0403 20:04:56.663559       1 shared_informer.go:197] Waiting for caches to sync for service config
I0403 20:04:56.763666       1 shared_informer.go:204] Caches are synced for endpoints config 
I0403 20:04:56.763800       1 shared_informer.go:204] Caches are synced for service config

Comment 1 Cesar Wong 2020-04-03 20:43:50 UTC
Must-gather at https://drive.google.com/open?id=1nTr44hOaodzVlf-wkbtRn843tgyVeOoy

Comment 2 W. Trevor King 2020-04-03 22:58:07 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities

Comment 3 Dan Winship 2020-04-03 23:18:37 UTC
(In reply to W. Trevor King from comment #2)
> Who is impacted?

All customers on 4.3.10 running a third-party network plugin that uses a standalone kube-proxy started by CNO. (eg, Calico in the case of the reporter.)

(Customers running openshift-sdn, ovn-kubernetes, or kuryr are not affected.)

> What is the impact?  Is it serious enough to warrant blocking edges?

kube-proxy fails health checks shortly after starting up, every time it starts up, and is killed. Each time kube-proxy does start successfully it will correctly sync the iptables state, but eventually it will be spending more time in "CrashLoopBackoff" than in "Running", so iptables will be out of date most of the time.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?

There is no easy way to fix the problem other than upgrading to a fixed release.

Comment 4 Dan Winship 2020-04-03 23:20:08 UTC
> All customers on 4.3.10

correction, the bug was introduced in 4.3.8... so that may suggest that not many people are affected

Comment 5 W. Trevor King 2020-04-03 23:29:43 UTC
> ...so that may suggest that not many people are affected

Do we have anything in Telemetry/Insights that we can look at to gauge the number of clusters with third-party network plugins?

Comment 6 W. Trevor King 2020-04-04 03:24:05 UTC
Bumping the priority, since we're considering tombstoning 4.3.10 on this (although discussion around the fix seems to be moving quickly already).

Comment 8 Lalatendu Mohanty 2020-04-06 16:45:56 UTC
As per insights data we have very less clusters (probably few of production clusters) which will get impacted by this issue, so removing the upgrade blocker from this issue.

Comment 16 errata-xmlrpc 2020-04-14 16:18:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1393


Note You need to log in before you can comment on or make changes to this bug.