Description of problem: On an IBM ROKS cluster with a Calico sdn, openshift-kube-proxy pods are crashing. OpenShift 4.3.5 worked fine. Logs don’t show much. Probes are failing so k8s restarts. Version-Release number of selected component (if applicable): 4.3.10 How reproducible: Always Steps to Reproduce: 1. Install 4.3.10 cluster (or upgrade) Actual results: openshift-kube-proxy pods enter crashloop backoff $ k get pods -n openshift-kube-proxy NAME READY STATUS RESTARTS AGE openshift-kube-proxy-bcfnp 0/1 CrashLoopBackOff 26 120m openshift-kube-proxy-hfk2s 0/1 CrashLoopBackOff 25 120m openshift-kube-proxy-t8mf9 0/1 CrashLoopBackOff 26 121m openshift-kube-proxy-wrjgj 0/1 CrashLoopBackOff 25 121m events: 20m Warning Unhealthy pod/openshift-kube-proxy-bcfnp Readiness probe failed: HTTP probe failed with statuscode: 503 5m46s Warning Unhealthy pod/openshift-kube-proxy-bcfnp Liveness probe failed: HTTP probe failed with statuscode: 503 Expected results: openshift-kube-proxy pods run as normal Additional info: $ k logs -n openshift-kube-proxy openshift-kube-proxy-bcfnp W0403 20:04:56.608864 1 proxier.go:584] Failed to read file /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin with error open /lib/modules/3.10.0-1062.18.1.el7.x86_64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules W0403 20:04:56.610551 1 proxier.go:597] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules W0403 20:04:56.612410 1 proxier.go:597] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules W0403 20:04:56.614149 1 proxier.go:597] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules W0403 20:04:56.615878 1 proxier.go:597] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules W0403 20:04:56.617422 1 proxier.go:597] Failed to load kernel module nf_conntrack_ipv4 with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules I0403 20:04:56.617910 1 server.go:494] Neither kubeconfig file nor master URL was specified. Falling back to in-cluster config. I0403 20:04:56.661306 1 node.go:135] Successfully retrieved node IP: 10.184.98.6 I0403 20:04:56.661352 1 server_others.go:146] Using iptables Proxier. W0403 20:04:56.661472 1 proxier.go:275] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended I0403 20:04:56.661774 1 server.go:529] Version: v0.0.0-master+$Format:%h$ I0403 20:04:56.663018 1 conntrack.go:52] Setting nf_conntrack_max to 131072 I0403 20:04:56.663364 1 config.go:131] Starting endpoints config controller I0403 20:04:56.663412 1 shared_informer.go:197] Waiting for caches to sync for endpoints config I0403 20:04:56.663529 1 config.go:313] Starting service config controller I0403 20:04:56.663559 1 shared_informer.go:197] Waiting for caches to sync for service config I0403 20:04:56.763666 1 shared_informer.go:204] Caches are synced for endpoints config I0403 20:04:56.763800 1 shared_informer.go:204] Caches are synced for service config
Must-gather at https://drive.google.com/open?id=1nTr44hOaodzVlf-wkbtRn843tgyVeOoy
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities
(In reply to W. Trevor King from comment #2) > Who is impacted? All customers on 4.3.10 running a third-party network plugin that uses a standalone kube-proxy started by CNO. (eg, Calico in the case of the reporter.) (Customers running openshift-sdn, ovn-kubernetes, or kuryr are not affected.) > What is the impact? Is it serious enough to warrant blocking edges? kube-proxy fails health checks shortly after starting up, every time it starts up, and is killed. Each time kube-proxy does start successfully it will correctly sync the iptables state, but eventually it will be spending more time in "CrashLoopBackoff" than in "Running", so iptables will be out of date most of the time. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? There is no easy way to fix the problem other than upgrading to a fixed release.
> All customers on 4.3.10 correction, the bug was introduced in 4.3.8... so that may suggest that not many people are affected
> ...so that may suggest that not many people are affected Do we have anything in Telemetry/Insights that we can look at to gauge the number of clusters with third-party network plugins?
Bumping the priority, since we're considering tombstoning 4.3.10 on this (although discussion around the fix seems to be moving quickly already).
As per insights data we have very less clusters (probably few of production clusters) which will get impacted by this issue, so removing the upgrade blocker from this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1393