Bug 1764150 - OpenShift Container Platform nodes become 'Ready' even when iptables NAT rules have not been populated by the openshift-snd pod(s) yet
Summary: OpenShift Container Platform nodes become 'Ready' even when iptables NAT rule...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Aniket Bhat
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-22 11:26 UTC by Christian Koep
Modified: 2023-03-24 15:43 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: SDN code will declare each node as READY before the SDN proxy was completely initialized. Consequence: Some IP Table rules will not be populated and the node is marked as ready. This causes some pods to not be able to communicate with the API while coming up. Fix: Wait for the SDN proxy to initialize and all the desired rules to get populated before marking the node as ready for scheduling. Result: All IP table rules exist before pods start getting scheduled to the node thereby preventing issues where pods can't communicate with the API server.
Clone Of:
Environment:
Last Closed: 2020-01-14 05:31:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24049 0 'None' closed Bug 1764150: Wait for sdn proxy to initialize 2020-11-11 15:14:14 UTC
Red Hat Product Errata RHBA-2020:0017 0 None None None 2020-01-14 05:31:39 UTC

Description Christian Koep 2019-10-22 11:26:16 UTC
Description of problem:
- Under certain conditions (e. g. when hitting [1] and simultaneously running `oc adm diagnostics` and / or `sosreport` many times) it can happen that lots of orphaned service and endpoint objects linger in a cluster.

- The above can lead to the fact that it takes a considerable amount of time for the `openshift-sdn` pod(s) to generate the iptables NAT rules after being restarted for whatever reason. This has the effect that pod(s) can't communicate to the Kubernetes API (which causes downtime for certain applications).

- During the time in which the iptables rules are incomplete, the following errors can be observed in the `openshift-sdn` logs:

~~~
E1011 16:48:21.681666   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-5rphw, netnamespaces.network.openshift.io "network-diag-global-ns-5rphw" not found
E1011 16:48:26.976305   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-7r42h, netnamespaces.network.openshift.io "network-diag-global-ns-7r42h" not found
E1011 16:48:32.259745   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-ns-mkjst, netnamespaces.network.openshift.io "network-diag-ns-mkjst" not found
~~~

- Furthermore, the output of `iptables -L -nv -t nat` does not yield expected results until after a few minutes.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift Container Platform 3.11.88 on VMware vSphere 

How reproducible:
- Potentially always, but only under certain conditions as outlined before.

Steps to Reproduce:
1. Keep a large number (>5.000) of service and endpoints objects without a corresponding project.
2. Restart `docker` or the `openshift-sdn` pod on a given node.
3. Observe that `iptables -L -nv -t nat` does not provide reasonable output until after a few minutes.

Actual results:
- After restarting a node, it becomes `Ready` before pods are able to connect to the Kubernetes API.

Expected results:
- To have a check in place that prevents scheduling to nodes that can't (yet) connect to the Kubernetes API.

Additional info:
- See the (private) comment section
- [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1625194

Comment 16 Aniket Bhat 2020-01-09 17:32:58 UTC
Provided the doc text for the bug. Who needs to verify the content of the doc text for accuracy?

Comment 17 errata-xmlrpc 2020-01-14 05:31:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0017


Note You need to log in before you can comment on or make changes to this bug.