Bug 1820778

Summary:	kube-proxy pods keep restarting on clusters with 4.3.10
Product:	OpenShift Container Platform	Reporter:	Cesar Wong <cewong>
Component:	Networking	Assignee:	Dan Winship <danw>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	bbennett, choag, danw, ddelcian, ealcaniz, inecas, lmohanty, vlaad, wking
Version:	4.3.z	Keywords:	Regression, Upgrades
Target Milestone:	---
Target Release:	4.3.z
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-14 16:18:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1803149
Bug Blocks:	1801742

Description Cesar Wong 2020-04-03 20:42:54 UTC

Description of problem:
On an IBM ROKS cluster with a Calico sdn,
openshift-kube-proxy pods are crashing. OpenShift 4.3.5 worked fine. Logs don’t show much. Probes are failing so k8s restarts.

Version-Release number of selected component (if applicable):
4.3.10

How reproducible:
Always

Steps to Reproduce:
1. Install 4.3.10 cluster (or upgrade)

Actual results:
openshift-kube-proxy pods enter crashloop backoff 

$ k get pods -n openshift-kube-proxy
NAME                         READY   STATUS             RESTARTS   AGE
openshift-kube-proxy-bcfnp   0/1     CrashLoopBackOff   26         120m
openshift-kube-proxy-hfk2s   0/1     CrashLoopBackOff   25         120m
openshift-kube-proxy-t8mf9   0/1     CrashLoopBackOff   26         121m
openshift-kube-proxy-wrjgj   0/1     CrashLoopBackOff   25         121m

events:
20m         Warning   Unhealthy          pod/openshift-kube-proxy-bcfnp   Readiness probe failed: HTTP probe failed with statuscode: 503
5m46s       Warning   Unhealthy          pod/openshift-kube-proxy-bcfnp   Liveness probe failed: HTTP probe failed with statuscode: 503

Expected results:
openshift-kube-proxy pods run as normal

Additional info:
$ k logs -n openshift-kube-proxy openshift-kube-proxy-bcfnp 
W0403 20:04:56.608864       1 proxier.go:584] Failed to read file /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin with error open /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.610551       1 proxier.go:597] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.612410       1 proxier.go:597] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.614149       1 proxier.go:597] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.615878       1 proxier.go:597] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0403 20:04:56.617422       1 proxier.go:597] Failed to load kernel module nf_conntrack_ipv4 with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
I0403 20:04:56.617910       1 server.go:494] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config.
I0403 20:04:56.661306       1 node.go:135] Successfully retrieved node IP: 10.184.98.6
I0403 20:04:56.661352       1 server_others.go:146] Using iptables Proxier.
W0403 20:04:56.661472       1 proxier.go:275] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended
I0403 20:04:56.661774       1 server.go:529] Version: v0.0.0-master+$Format:%h$
I0403 20:04:56.663018       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0403 20:04:56.663364       1 config.go:131] Starting endpoints config controller
I0403 20:04:56.663412       1 shared_informer.go:197] Waiting for caches to sync for endpoints config
I0403 20:04:56.663529       1 config.go:313] Starting service config controller
I0403 20:04:56.663559       1 shared_informer.go:197] Waiting for caches to sync for service config
I0403 20:04:56.763666       1 shared_informer.go:204] Caches are synced for endpoints config 
I0403 20:04:56.763800       1 shared_informer.go:204] Caches are synced for service config

Comment 1 Cesar Wong 2020-04-03 20:43:50 UTC

Must-gather at https://drive.google.com/open?id=1nTr44hOaodzVlf-wkbtRn843tgyVeOoy

Comment 2 W. Trevor King 2020-04-03 22:58:07 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities

Comment 3 Dan Winship 2020-04-03 23:18:37 UTC

(In reply to W. Trevor King from comment #2)
> Who is impacted?

All customers on 4.3.10 running a third-party network plugin that uses a standalone kube-proxy started by CNO. (eg, Calico in the case of the reporter.)

(Customers running openshift-sdn, ovn-kubernetes, or kuryr are not affected.)

> What is the impact?  Is it serious enough to warrant blocking edges?

kube-proxy fails health checks shortly after starting up, every time it starts up, and is killed. Each time kube-proxy does start successfully it will correctly sync the iptables state, but eventually it will be spending more time in "CrashLoopBackoff" than in "Running", so iptables will be out of date most of the time.

> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?

There is no easy way to fix the problem other than upgrading to a fixed release.

Comment 4 Dan Winship 2020-04-03 23:20:08 UTC

> All customers on 4.3.10

correction, the bug was introduced in 4.3.8... so that may suggest that not many people are affected

Comment 5 W. Trevor King 2020-04-03 23:29:43 UTC

> ...so that may suggest that not many people are affected

Do we have anything in Telemetry/Insights that we can look at to gauge the number of clusters with third-party network plugins?

Comment 6 W. Trevor King 2020-04-04 03:24:05 UTC

Bumping the priority, since we're considering tombstoning 4.3.10 on this (although discussion around the fix seems to be moving quickly already).

Comment 8 Lalatendu Mohanty 2020-04-06 16:45:56 UTC

As per insights data we have very less clusters (probably few of production clusters) which will get impacted by this issue, so removing the upgrade blocker from this issue.

Comment 16 errata-xmlrpc 2020-04-14 16:18:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1393