Description of problem:
When running tests which creates thousands of objects on a large scale cluster (1000-2000 nodes), the nodes start to go NotReady as the sdn pod is being killed due to the liveness probe failing before the liveness server gets started. Not sure if it can be reproduced at much lower scale, we were able to run tests fine on a 250 node cluster built using an older build.
Version-Release number of selected component (if applicable):
Reproduced it twice at both 1000 and 2000 node scale.
Steps to Reproduce:
1. Install a cluster using 4.2.0-0.nightly-2019-10-07-011045.
2. Create thousands of objects.
3. Check the node and sdn pod status.
Nodes in NotReady state and sdn pods are crashing.
Cluster is stable.
The logs of various system components, events, journal, kubelet, sdn and runtime are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn/.
SDN logs show that apiserver responses are very slow:
I1014 20:27:16.921461 15009 node.go:378] Starting openshift-sdn network plugin
I1014 20:27:28.169570 15009 vnids.go:148] Associate netid 5369580 to namespace "b0" with mcEnabled false
in another (non-1000-node) cluster, that gap is 0.3 seconds, not 11.2 seconds.
I1014 20:27:28.381738 15009 proxy.go:103] Using unidling+iptables Proxier.
I1014 20:27:38.584259 15009 proxier.go:214] Setting proxy IP to 10.0.139.69 and initializing iptables
I1014 20:27:51.291165 15009 proxy.go:89] Starting multitenant SDN proxy endpoint filter
(normally 0.22s and 0.15s, not 10.2s and 12.7s)
The "interrupt: Gracefully shutting down ..." message appears (untimestamped) between the 20:27:38 and 20:27:51 logs; the kube-proxy healthz server doesn't get started until shortly after the 20:27:51 message, so the problem seems to be that startup is so slow that we time out on liveness before even starting the liveness server.
We tweaked the server side parameters including max-inflight-requests and max-mutating-inflgiht-requests to 1000/500 but we are using the default qps/burst rates on the client side ( 5/10 ). The client default qps/burst rates might be too low for a large/dense clusters, maybe we should bump the defaults though we don't have guidance on the values as it's tricky since there are lot of variables involved including the cluster size, number of objects, tolerable latencies e.t.c.
We installed Robert's custom watch based kubelet and bumped up qps/burst rates ( 50/10 instead of default 5/10 ) to improve apiserver response rate for a large/dense cluster and ran the same test to create around 2k projects and 10k pods. We hit the same issue again, here are the logs - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn-tuned-qps/. Looking at the sdn log, the time between starting openshift-sdn network plugin and associating net id is still around 13 seconds:
I1015 04:44:32.581887 7673 node.go:378] Starting openshift-sdn network plugin
I1015 04:44:45.455444 7673 vnids.go:148] Associate netid 3731156 to namespace "b0" with mcEnabled false
I1015 04:44:45.455470 7673 vnids.go:148] Associate netid 6597128 to namespace "b1" with mcEnabled false
So increasing the timeout on liveness is one of the solution I guess.
I suspect we need to just disable the liveness check. It isn't helping us at all.
Do you know how to stop the network operator? Then you can just edit the sdn daemonset and remove it, then re-run your tests.
We don't have the cluster around but we do have plans in future to run the tests on a large scale OCP cluster installed on top of OpenStack. It would be great if we can get the liveness check enhancements added in the product by then ( couple of weeks down the road ) or we can try disabling it and see if it resolves the problem like Casey suggested.
Assigned this bug to you since it's need to a lot of node to be reproduced.
We will not have a cluster this large for 4.3 - moving verification to 4.4.
*** Bug 1747871 has been marked as a duplicate of this bug. ***
We are no longer hitting the issue, verified it on a 2000 nodes OCP 4.3 cluster built using 4.3.0-rc.3 payload.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.