Bug 1764150

Summary: OpenShift Container Platform nodes become 'Ready' even when iptables NAT rules have not been populated by the openshift-snd pod(s) yet
Product: OpenShift Container Platform Reporter: Christian Koep <ckoep>
Component: NetworkingAssignee: Aniket Bhat <anbhat>
Networking sub component: openshift-sdn QA Contact: Ross Brattain <rbrattai>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: anbhat
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: SDN code will declare each node as READY before the SDN proxy was completely initialized. Consequence: Some IP Table rules will not be populated and the node is marked as ready. This causes some pods to not be able to communicate with the API while coming up. Fix: Wait for the SDN proxy to initialize and all the desired rules to get populated before marking the node as ready for scheduling. Result: All IP table rules exist before pods start getting scheduled to the node thereby preventing issues where pods can't communicate with the API server.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-14 05:31:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christian Koep 2019-10-22 11:26:16 UTC
Description of problem:
- Under certain conditions (e. g. when hitting [1] and simultaneously running `oc adm diagnostics` and / or `sosreport` many times) it can happen that lots of orphaned service and endpoint objects linger in a cluster.

- The above can lead to the fact that it takes a considerable amount of time for the `openshift-sdn` pod(s) to generate the iptables NAT rules after being restarted for whatever reason. This has the effect that pod(s) can't communicate to the Kubernetes API (which causes downtime for certain applications).

- During the time in which the iptables rules are incomplete, the following errors can be observed in the `openshift-sdn` logs:

~~~
E1011 16:48:21.681666   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-5rphw, netnamespaces.network.openshift.io "network-diag-global-ns-5rphw" not found
E1011 16:48:26.976305   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-7r42h, netnamespaces.network.openshift.io "network-diag-global-ns-7r42h" not found
E1011 16:48:32.259745   20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-ns-mkjst, netnamespaces.network.openshift.io "network-diag-ns-mkjst" not found
~~~

- Furthermore, the output of `iptables -L -nv -t nat` does not yield expected results until after a few minutes.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift Container Platform 3.11.88 on VMware vSphere 

How reproducible:
- Potentially always, but only under certain conditions as outlined before.

Steps to Reproduce:
1. Keep a large number (>5.000) of service and endpoints objects without a corresponding project.
2. Restart `docker` or the `openshift-sdn` pod on a given node.
3. Observe that `iptables -L -nv -t nat` does not provide reasonable output until after a few minutes.

Actual results:
- After restarting a node, it becomes `Ready` before pods are able to connect to the Kubernetes API.

Expected results:
- To have a check in place that prevents scheduling to nodes that can't (yet) connect to the Kubernetes API.

Additional info:
- See the (private) comment section
- [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1625194

Comment 16 Aniket Bhat 2020-01-09 17:32:58 UTC
Provided the doc text for the bug. Who needs to verify the content of the doc text for accuracy?

Comment 17 errata-xmlrpc 2020-01-14 05:31:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0017