Description of problem: 4.3.10. Cluster failed to install. Cluster version operator reported control plane down or not responding. Sporadic 503s contacting OpenShift API server. No errors on OpenShift API server. Kube API server reporting timeouts contacting OpenShift API server. On one master, the pod IP of the OpenShift API server was not reachable via SDN from other nodes on the cluster, only from the local VM. That master was missing the following openflow rule, which was present on the other masters: # ovs-ofctl dump-flows br0 -O OpenFlow13 | grep table=80.*0x62b485 cookie=0x0, duration=42187.819s, table=80, n_packets=474, n_bytes=66396, priority=50,reg1=0x62b485 actions=output:NXM_NX_REG2[] Here are relevant openshift-sdn logs: $ oc logs -n openshift-sdn sdn-zcwr6 | grep openshift-apiserver | grep -v openshift-apiserver-operator I0417 06:42:52.434574 2512 proxy.go:368] sdn proxy: add svc openshift-apiserver/api: &Service{ObjectMeta:{api openshift-apiserver /api/v1/namespaces/openshift-apiserver/services/api 8b962796-43f6-46d1-85ee-8625249f3d7b 6797 0 2020-04-17 06:42:52 +0000 UTC <nil> <nil> map[] map[prometheus.io/scheme:https prometheus.io/scrape:true service.alpha.openshift.io/serving-cert-secret-name:serving-cert] [] [] []},Spec:ServiceSpec{Ports:[]ServicePort{ServicePort{Name:https,Protocol:TCP,Port:443,TargetPort:{0 8443 },NodePort:0,},},Selector:map[string]string{apiserver: true,},ClusterIP:172.30.255.15,Type:ClusterIP,ExternalIPs:[],SessionAffinity:None,LoadBalancerIP:,LoadBalancerSourceRanges:[],ExternalName:,ExternalTrafficPolicy:,HealthCheckNodePort:0,PublishNotReadyAddresses:false,SessionAffinityConfig:nil,IPFamily:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},},} I0417 06:42:52.435044 2512 service.go:357] Adding new service port "openshift-apiserver/api:https" at 172.30.255.15:443/TCP W0417 06:43:00.998950 2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map I0417 06:43:01.075146 2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-f7bmj got IP 10.128.0.14, ofport 15 I0417 06:43:02.270238 2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-f7bmj I0417 06:43:25.222591 2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.130.0.26:8443] I0417 06:43:33.088178 2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-mtbx6 got IP 10.128.0.20, ofport 21 I0417 06:43:56.853377 2512 roundrobin.go:270] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.129.0.15:8443 10.130.0.26:8443] I0417 06:43:56.853428 2512 roundrobin.go:218] Delete endpoint 10.129.0.15:8443 for service "openshift-apiserver/api:https" I0417 06:45:04.183113 2512 pod.go:539] CNI_DEL openshift-apiserver/apiserver-mtbx6 I0417 06:45:46.137466 2512 pod.go:503] CNI_ADD openshift-apiserver/apiserver-rgjkp got IP 10.128.0.33, ofport 34 The following line only appears in the openshift-sdn logs of the "bad" master: W0417 06:43:00.998950 2512 vnids.go:121] Netid for namespace: openshift-apiserver exists but not found in vnid map It is notable that multiple pod creations/deletions in the openshift-apiserver namespace did not unstick the problem. I think the issue is something like the following: * the net namespace watch was slow * func (vmap *nodeVNIDMap) handleAddOrUpdateNetNamespace() lost the race to call vmap.setVNID() * func (m *podManager) setup() ran * func (vmap *nodeVNIDMap) WaitAndGetVNID() was called * func (vmap *nodeVNIDMap) WaitAndGetVNID() waited for 5 seconds for vmap.getVNID() to succeed and gave up * func (vmap *nodeVNIDMap) WaitAndGetVNID() called vmap.setVNID() to recover (and logged the warning message) * BUG: there is a missing call to vmap.policy.AddNetNamespace() at this point * as a result, inside the network policy plugin, np.namespaces was not inserted into * m.ovs.SetUpPod() was called and set up the table=70 flows * m.policy.EnsureVNIDRules() was called * func (np *networkPolicyPlugin) EnsureVNIDRules() did not find the namespace in np.namespaces * func (np *networkPolicyPlugin) EnsureVNIDRules() returned early without setting up the table=80 flows I think this bug can affect the SDN reachability of pods in any namespace if there is a watch delay of 5 second to any pod at namespace creation time. I'm not sure how to fix this - potentially one route might be to add a call to AddNetNamespace() from WaitAndGetVNID(), but I'm concerned about the implications of this - it seems pretty racy wrt handleAddOrUpdateNetNamespace(), etc. Fundamentally the whole recovery concept in WaitAndGetVNID() seems like a bit of a hack. Version-Release number of selected component (if applicable): 4.3.10 How reproducible: Unclear. ETCD watches are noticably slower on Azure and we have seen quite a lot of cases of OpenShift API server 503s. We wonder if this is the cause of many of these.
Created attachment 1679732 [details] log from sdn-zcwr6 pod on bad master
Created attachment 1679733 [details] ovs-ofctl dump-flows br0 -O OpenFlow13 on bad master
Setting the target release to 4.5 so that this can get worked on. We will backport fixes as needed once we have identified the problem.
There is a (prometheus) metric for VNID-not-found errors - you can graph this on a test cluster and see if there is a correlation.
Jim, I think your analysis is exactly correct. And it should be an easy fix. It would be better if we didn't make direct GET calls to the apiserver, but that's a pattern that openshift-sdn often uses. Is it only NetNamespace watches that fail? Clearly GETs succeed.
hi, Casey, any advice to verify this bug since it's not always to reproduce this kind of issue? thanks in advance.
This is quite difficult to verify. It's only caused by etcd resource starvation, and has never happened in any meaningful way outside of Azure, which we can't use. Sorry, I don't immediately have any ideas. Maybe you can really throttle down disk iops? But that will break the cluster in other ways. Perhaps the best way to verify this is just to create a lot of namespaces very quickly and ensure that pods work.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409