1764150 – OpenShift Container Platform nodes become 'Ready' even when iptables NAT rules have not been populated by the openshift-snd pod(s) yet

Bug 1764150 - OpenShift Container Platform nodes become 'Ready' even when iptables NAT rules have not been populated by the openshift-snd pod(s) yet

Summary: OpenShift Container Platform nodes become 'Ready' even when iptables NAT rule...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Aniket Bhat
QA Contact:	Ross Brattain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-22 11:26 UTC by Christian Koep
Modified:	2023-03-24 15:43 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: SDN code will declare each node as READY before the SDN proxy was completely initialized. Consequence: Some IP Table rules will not be populated and the node is marked as ready. This causes some pods to not be able to communicate with the API while coming up. Fix: Wait for the SDN proxy to initialize and all the desired rules to get populated before marking the node as ready for scheduling. Result: All IP table rules exist before pods start getting scheduled to the node thereby preventing issues where pods can't communicate with the API server.
Clone Of:
Environment:
Last Closed:	2020-01-14 05:31:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24049	0	'None'	closed	Bug 1764150: Wait for sdn proxy to initialize	2020-11-11 15:14:14 UTC
Red Hat Product Errata	RHBA-2020:0017	0	None	None	None	2020-01-14 05:31:39 UTC

Description Christian Koep 2019-10-22 11:26:16 UTC

Description of problem:
- Under certain conditions (e. g. when hitting [1] and simultaneously running `oc adm diagnostics` and / or `sosreport` many times) it can happen that lots of orphaned service and endpoint objects linger in a cluster.

- The above can lead to the fact that it takes a considerable amount of time for the `openshift-sdn` pod(s) to generate the iptables NAT rules after being restarted for whatever reason. This has the effect that pod(s) can't communicate to the Kubernetes API (which causes downtime for certain applications).

- During the time in which the iptables rules are incomplete, the following errors can be observed in the `openshift-sdn` logs:

~~~
E1011 16:48:21.681666 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-5rphw, netnamespaces.network.openshift.io "network-diag-global-ns-5rphw" not found
E1011 16:48:26.976305 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-global-ns-7r42h, netnamespaces.network.openshift.io "network-diag-global-ns-7r42h" not found
E1011 16:48:32.259745 20855 node.go:489] Skipped adding service rules for serviceEvent: ADDED, Error: failed to find netid for namespace: network-diag-ns-mkjst, netnamespaces.network.openshift.io "network-diag-ns-mkjst" not found
~~~

- Furthermore, the output of `iptables -L -nv -t nat` does not yield expected results until after a few minutes.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift Container Platform 3.11.88 on VMware vSphere

How reproducible:
- Potentially always, but only under certain conditions as outlined before.

Steps to Reproduce:
1. Keep a large number (>5.000) of service and endpoints objects without a corresponding project.
2. Restart `docker` or the `openshift-sdn` pod on a given node.
3. Observe that `iptables -L -nv -t nat` does not provide reasonable output until after a few minutes.

Actual results:
- After restarting a node, it becomes `Ready` before pods are able to connect to the Kubernetes API.

Expected results:
- To have a check in place that prevents scheduling to nodes that can't (yet) connect to the Kubernetes API.

Additional info:
- See the (private) comment section
- [1] - https://bugzilla.redhat.com/show_bug.cgi?id=1625194

Comment 16 Aniket Bhat 2020-01-09 17:32:58 UTC

Provided the doc text for the bug. Who needs to verify the content of the doc text for accuracy?

Comment 17 errata-xmlrpc 2020-01-14 05:31:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0017

Note You need to log in before you can comment on or make changes to this bug.