Description of problem: - Under certain conditions (e. g. when hitting [1] and simultaneously running `oc adm diagnostics` and / or `sosreport` many times) it can happen that lots of orphaned service and endpoint objects linger in a cluster. - The above can lead to the fact that it takes a considerable amount of time for the `openshift-sdn` pod(s) to generate the iptables NAT rules after being restarted for whatever reason. This has the effect that pod(s) can't communicate to the Kubernetes API (which causes downtime for certain applications). - During the time in which the iptables rules are incomplete, the following errors can be observed in the `openshift-sdn` logs: ~~~ E1011 16:48:21.681666 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-5rphw, netnamespaces.network.openshift.io "network-diag-global-ns-5rphw" not found E1011 16:48:26.976305 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-7r42h, netnamespaces.network.openshift.io "network-diag-global-ns-7r42h" not found E1011 16:48:32.259745 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-ns-mkjst, netnamespaces.network.openshift.io "network-diag-ns-mkjst" not found ~~~ - Furthermore, the output of `iptables -L -nv -t nat` does not yield expected results until after a few minutes. Version-Release number of selected component (if applicable): - Red Hat OpenShift Container Platform 3.11.88 on VMware vSphere How reproducible: - Potentially always, but only under certain conditions as outlined before. Steps to Reproduce: 1. Keep a large number (>5.000) of service and endpoints objects without a corresponding project. 2. Restart `docker` or the `openshift-sdn` pod on a given node. 3. Observe that `iptables -L -nv -t nat` does not provide reasonable output until after a few minutes. Actual results: - After restarting a node, it becomes `Ready` before pods are able to connect to the Kubernetes API. Expected results: - To have a check in place that prevents scheduling to nodes that can't (yet) connect to the Kubernetes API. Additional info: - See the (private) comment section - [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1625194
Provided the doc text for the bug. Who needs to verify the content of the doc text for accuracy?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0017