Bug 1825355

Summary: openshift-sdn node can permanently NetNamespaces when LIST times out
Product: OpenShift Container Platform Reporter: Jim Minter <jminter>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aconstan, bbennett, ffranz, gmarkley, jeder, mjudeiki, skrenger, sreber
Version: 4.3.zKeywords: ServiceDeliveryBlocker
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When etcd is very slow, openshift-sdn can miss namespace creation events due to a race condition. Consequence: Pods in that namespace have no connectivity. Fix: The race condition was removed. Result: Pods eventually have connectivity.
Story Points: ---
Clone Of:
: 1839107 (view as bug list) Environment:
Last Closed: 2020-07-13 17:28:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1839107    
Attachments:
Description Flags
log from sdn-zcwr6 pod on bad master
none
ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master none

Description Jim Minter 2020-04-17 18:56:55 UTC
Description of problem:

4.3.10.  Cluster failed to install.  Cluster version operator reported control plane down or not responding.  Sporadic 503s contacting OpenShift API server.  No errors on OpenShift API server.  Kube API server reporting timeouts contacting OpenShift API server.

On one master, the pod IP of the OpenShift API server was not reachable via SDN from other nodes on the cluster, only from the local VM.

That master was missing the following openflow rule, which was present on the other masters:

# ovs-ofctl dump-flows br0 -O OpenFlow13 | grep table=80.*0x62b485
 cookie=0x0, duration=42187.819s, table=80, n_packets=474, n_bytes=66396, priority=50,reg1=0x62b485 actions=output:NXM_NX_REG2[]

Here are relevant openshift-sdn logs:

$ oc logs -n openshift-sdn sdn-zcwr6 | grep openshift-apiserver | grep -v openshift-apiserver-operator
I0417 06:42:52.434574    2512 proxy.go:368] sdn proxy: add svc openshift-apiserver/api: &Service{ObjectMeta:{api  openshift-apiserver /api/v1/namespaces/openshift-apiserver/services/api 8b962796-43f6-46d1-85ee-8625249f3d7b 6797 0 2020-04-17 06:42:52 +0000 UTC <nil> <nil> map[] map[prometheus.io/scheme:https prometheus.io/scrape:true service.alpha.openshift.io/serving-cert-secret-name:serving-cert] [] []  []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{0 8443 },NodePort:0,},},Selector:map[string]string{apiserver: true,},ClusterIP:172.30.255.15,Type:ClusterIP,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},},}
I0417 06:42:52.435044    2512 service.go:357] Adding new service port "openshift-apiserver/api:https" at 172.30.255.15:443/TCP
W0417 06:43:00.998950    2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map
I0417 06:43:01.075146    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-f7bmj got IP 10.128.0.14, ofport 15
I0417 06:43:02.270238    2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-f7bmj
I0417 06:43:25.222591    2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.130.0.26:8443]
I0417 06:43:33.088178    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-mtbx6 got IP 10.128.0.20, ofport 21
I0417 06:43:56.853377    2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.129.0.15:8443 10.130.0.26:8443]
I0417 06:43:56.853428    2512 roundrobin.go:218] Delete endpoint 10.129.0.15:8443 for service "openshift-apiserver/api:https"
I0417 06:45:04.183113    2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-mtbx6
I0417 06:45:46.137466    2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-rgjkp got IP 10.128.0.33, ofport 34

The following line only appears in the openshift-sdn logs of the "bad" master:

W0417 06:43:00.998950    2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map

It is notable that multiple pod creations/deletions in the openshift-apiserver namespace did not unstick the problem.

I think the issue is something like the following:

* the net namespace watch was slow
* func (vmap *nodeVNIDMap) handleAddOrUpdateNetNamespace() lost the race to call vmap.setVNID()

* func (m *podManager) setup() ran
* func (vmap *nodeVNIDMap) WaitAndGetVNID() was called
* func (vmap *nodeVNIDMap) WaitAndGetVNID() waited for 5 seconds for vmap.getVNID() to succeed and gave up
* func (vmap *nodeVNIDMap) WaitAndGetVNID() called vmap.setVNID() to recover (and logged the warning message)
* BUG: there is a missing call to vmap.policy.AddNetNamespace() at this point
* as a result, inside the network policy plugin, np.namespaces was not inserted into
* m.ovs.SetUpPod() was called and set up the table=70 flows
* m.policy.EnsureVNIDRules() was called
* func (np *networkPolicyPlugin) EnsureVNIDRules() did not find the namespace in np.namespaces
* func (np *networkPolicyPlugin) EnsureVNIDRules() returned early without setting up the table=80 flows

I think this bug can affect the SDN reachability of pods in any namespace if there is a watch delay of 5 second to any pod at namespace creation time.

I'm not sure how to fix this - potentially one route might be to add a call to AddNetNamespace() from WaitAndGetVNID(), but I'm concerned about the implications of this - it seems pretty racy wrt handleAddOrUpdateNetNamespace(), etc.  Fundamentally the whole recovery concept in WaitAndGetVNID() seems like a bit of a hack.


Version-Release number of selected component (if applicable):

4.3.10


How reproducible:

Unclear.  ETCD watches are noticably slower on Azure and we have seen quite a lot of cases of OpenShift API server 503s.  We wonder if this is the cause of many of these.

Comment 1 Jim Minter 2020-04-17 19:02:10 UTC
Created attachment 1679732 [details]
log from sdn-zcwr6 pod on bad master

Comment 2 Jim Minter 2020-04-17 19:03:40 UTC
Created attachment 1679733 [details]
ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master

Comment 3 Ben Bennett 2020-04-20 12:53:23 UTC
Setting the target release to 4.5 so that this can get worked on.  We will backport fixes as needed once we have identified the problem.

Comment 5 Casey Callendrello 2020-04-27 11:18:32 UTC
There is a (prometheus) metric for VNID-not-found errors - you can graph this on a test cluster and see if there is a correlation.

Comment 6 Casey Callendrello 2020-04-27 11:57:57 UTC
Jim, I think your analysis is exactly correct. And it should be an easy fix.

It would be better if we didn't make direct GET calls to the apiserver, but that's a pattern that openshift-sdn often uses.

Is it only NetNamespace watches that fail? Clearly GETs succeed.

Comment 11 zhaozhanqi 2020-05-06 12:04:40 UTC
hi, Casey, any advice to verify this bug since it's not always to reproduce this kind of issue? thanks in advance.

Comment 12 Casey Callendrello 2020-05-06 14:28:10 UTC
This is quite difficult to verify. It's only caused by etcd resource starvation, and has never happened in any meaningful way outside of Azure, which we can't use.

Sorry, I don't immediately have any ideas. Maybe you can really throttle down disk iops? But that will break the cluster in other ways.

Perhaps the best way to verify this is just to create a lot of namespaces very quickly and ensure that pods work.

Comment 14 errata-xmlrpc 2020-07-13 17:28:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409