1825355 – openshift-sdn node can permanently NetNamespaces when LIST times out

Bug 1825355 - openshift-sdn node can permanently NetNamespaces when LIST times out

Summary: openshift-sdn node can permanently NetNamespaces when LIST times out

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1839107
TreeView+	depends on / blocked

Reported:	2020-04-17 18:56 UTC by Jim Minter
Modified:	2020-07-13 17:28 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When etcd is very slow, openshift-sdn can miss namespace creation events due to a race condition. Consequence: Pods in that namespace have no connectivity. Fix: The race condition was removed. Result: Pods eventually have connectivity.
Clone Of:
Clones:	1839107 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:28:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
log from sdn-zcwr6 pod on bad master (907.19 KB, text/plain) 2020-04-17 19:02 UTC, Jim Minter	no flags	Details
ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master (29.68 KB, text/plain) 2020-04-17 19:03 UTC, Jim Minter	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 134	0	None	closed	Bug 1825355: node/vnids: Correctly handle case where NetNamespace watch is far behind	2020-11-05 02:07:33 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:28:58 UTC

Description Jim Minter 2020-04-17 18:56:55 UTC

Description of problem:

4.3.10.  Cluster failed to install.  Cluster version operator reported control plane down or not responding.  Sporadic 503s contacting OpenShift API server.  No errors on OpenShift API server.  Kube API server reporting timeouts contacting OpenShift API server.

On one master, the pod IP of the OpenShift API server was not reachable via SDN from other nodes on the cluster, only from the local VM.

That master was missing the following openflow rule, which was present on the other masters:

# ovs-ofctl dump-flows br0 -O OpenFlow13 | grep table=80.*0x62b485
 cookie=0x0, duration=42187.819s, table=80, n_packets=474, n_bytes=66396, priority=50,reg1=0x62b485 actions=output:NXM_NX_REG2[]

Here are relevant openshift-sdn logs:

$ oc logs -n openshift-sdn sdn-zcwr6 | grep openshift-apiserver | grep -v openshift-apiserver-operator
I0417 06:42:52.434574    2512 proxy.go:368] sdn proxy: add svc openshift-apiserver/api: &Service{ObjectMeta:{api  openshift-apiserver /api/v1/namespaces/openshift-apiserver/services/api 8b962796-43f6-46d1-85ee-8625249f3d7b 6797 0 2020-04-17 06:42:52 +0000 UTC <nil> <nil> map[] map[prometheus.io/scheme:https prometheus.io/scrape:true service.alpha.openshift.io/serving-cert-secret-name:serving-cert] [] []  []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{0 8443 },NodePort:0,},},Selector:map[string]string{apiserver: true,},ClusterIP:172.30.255.15,Type:ClusterIP,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},},}
I0417 06:42:52.435044    2512 service.go:357] Adding new service port "openshift-apiserver/api:https" at 172.30.255.15:443/TCP
W0417 06:43:00.998950    2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map
I0417 06:43:01.075146    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-f7bmj got IP 10.128.0.14, ofport 15
I0417 06:43:02.270238    2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-f7bmj
I0417 06:43:25.222591    2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.130.0.26:8443]
I0417 06:43:33.088178    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-mtbx6 got IP 10.128.0.20, ofport 21
I0417 06:43:56.853377    2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.129.0.15:8443 10.130.0.26:8443]
I0417 06:43:56.853428    2512 roundrobin.go:218] Delete endpoint 10.129.0.15:8443 for service "openshift-apiserver/api:https"
I0417 06:45:04.183113    2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-mtbx6
I0417 06:45:46.137466    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-rgjkp got IP 10.128.0.33, ofport 34

The following line only appears in the openshift-sdn logs of the "bad" master:

W0417 06:43:00.998950    2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map

It is notable that multiple pod creations/deletions in the openshift-apiserver namespace did not unstick the problem.

I think the issue is something like the following:

* the net namespace watch was slow
* func (vmap *nodeVNIDMap) handleAddOrUpdateNetNamespace() lost the race to call vmap.setVNID()

* func (m *podManager) setup() ran
* func (vmap *nodeVNIDMap) WaitAndGetVNID() was called
* func (vmap *nodeVNIDMap) WaitAndGetVNID() waited for 5 seconds for vmap.getVNID() to succeed and gave up
* func (vmap *nodeVNIDMap) WaitAndGetVNID() called vmap.setVNID() to recover (and logged the warning message)
* BUG: there is a missing call to vmap.policy.AddNetNamespace() at this point
* as a result, inside the network policy plugin, np.namespaces was not inserted into
* m.ovs.SetUpPod() was called and set up the table=70 flows
* m.policy.EnsureVNIDRules() was called
* func (np *networkPolicyPlugin) EnsureVNIDRules() did not find the namespace in np.namespaces
* func (np *networkPolicyPlugin) EnsureVNIDRules() returned early without setting up the table=80 flows

I think this bug can affect the SDN reachability of pods in any namespace if there is a watch delay of 5 second to any pod at namespace creation time.

I'm not sure how to fix this - potentially one route might be to add a call to AddNetNamespace() from WaitAndGetVNID(), but I'm concerned about the implications of this - it seems pretty racy wrt handleAddOrUpdateNetNamespace(), etc.  Fundamentally the whole recovery concept in WaitAndGetVNID() seems like a bit of a hack.


Version-Release number of selected component (if applicable):

4.3.10


How reproducible:

Unclear.  ETCD watches are noticably slower on Azure and we have seen quite a lot of cases of OpenShift API server 503s.  We wonder if this is the cause of many of these.

Comment 1 Jim Minter 2020-04-17 19:02:10 UTC

Created attachment 1679732 [details]
log from sdn-zcwr6 pod on bad master

Comment 2 Jim Minter 2020-04-17 19:03:40 UTC

Created attachment 1679733 [details]
ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master

Comment 3 Ben Bennett 2020-04-20 12:53:23 UTC

Setting the target release to 4.5 so that this can get worked on.  We will backport fixes as needed once we have identified the problem.

Comment 5 Casey Callendrello 2020-04-27 11:18:32 UTC

There is a (prometheus) metric for VNID-not-found errors - you can graph this on a test cluster and see if there is a correlation.

Comment 6 Casey Callendrello 2020-04-27 11:57:57 UTC

Jim, I think your analysis is exactly correct. And it should be an easy fix.

It would be better if we didn't make direct GET calls to the apiserver, but that's a pattern that openshift-sdn often uses.

Is it only NetNamespace watches that fail? Clearly GETs succeed.

Comment 11 zhaozhanqi 2020-05-06 12:04:40 UTC

hi, Casey, any advice to verify this bug since it's not always to reproduce this kind of issue? thanks in advance.

Comment 12 Casey Callendrello 2020-05-06 14:28:10 UTC

This is quite difficult to verify. It's only caused by etcd resource starvation, and has never happened in any meaningful way outside of Azure, which we can't use.

Sorry, I don't immediately have any ideas. Maybe you can really throttle down disk iops? But that will break the cluster in other ways.

Perhaps the best way to verify this is just to create a lot of namespaces very quickly and ensure that pods work.

Comment 14 errata-xmlrpc 2020-07-13 17:28:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.