Bug 1761609

Summary: Nodes are in NotReady state because of failed sdn pod liveness probes (for very large clusters)
Product: OpenShift Container Platform Reporter: Naga Ravi Chaitanya Elluri <nelluri>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: akamra, danw, mifiedle, nelluri, rhowe, ricarril, sreber, xtian, yapei, zzhao
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard: aos-scalability-42
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-13 21:52:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Naga Ravi Chaitanya Elluri 2019-10-14 21:15:57 UTC
Description of problem:
When running tests which creates thousands of objects on a large scale cluster (1000-2000 nodes), the nodes start to go NotReady as the sdn pod is being killed due to the liveness probe failing before the liveness server gets started. Not sure if it can be reproduced at much lower scale, we were able to run tests fine on a 250 node cluster built using an older build. 
  

Version-Release number of selected component (if applicable):
Payload/build: 4.2.0-0.nightly-2019-10-07-011045 

How reproducible:
Reproduced it twice at both 1000 and 2000 node scale.

Steps to Reproduce:
1. Install a cluster using 4.2.0-0.nightly-2019-10-07-011045.
2. Create thousands of objects.
3. Check the node and sdn pod status.

Actual results:
Nodes in NotReady state and sdn pods are crashing.

Expected results:
Cluster is stable.

Additional info:
The logs of various system components, events, journal, kubelet, sdn and runtime are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn/.

Comment 1 Dan Winship 2019-10-14 21:41:43 UTC
SDN logs show that apiserver responses are very slow:

I1014 20:27:16.921461   15009 node.go:378] Starting openshift-sdn network plugin
I1014 20:27:28.169570   15009 vnids.go:148] Associate netid 5369580 to namespace "b0" with mcEnabled false

in another (non-1000-node) cluster, that gap is 0.3 seconds, not 11.2 seconds.

Later:

I1014 20:27:28.381738   15009 proxy.go:103] Using unidling+iptables Proxier.
...
I1014 20:27:38.584259   15009 proxier.go:214] Setting proxy IP to 10.0.139.69 and initializing iptables
...
I1014 20:27:51.291165   15009 proxy.go:89] Starting multitenant SDN proxy endpoint filter

(normally 0.22s and 0.15s, not 10.2s and 12.7s)


The "interrupt: Gracefully shutting down ..." message appears (untimestamped) between the 20:27:38 and 20:27:51 logs; the kube-proxy healthz server doesn't get started until shortly after the 20:27:51 message, so the problem seems to be that startup is so slow that we time out on liveness before even starting the liveness server.

Comment 2 Naga Ravi Chaitanya Elluri 2019-10-14 22:08:05 UTC
We tweaked the server side parameters including max-inflight-requests and max-mutating-inflgiht-requests to 1000/500 but we are using the default qps/burst rates on the client side ( 5/10 ). The client default qps/burst rates might be too low for a large/dense clusters, maybe we should bump the defaults though we don't have guidance on the values as it's tricky since there are lot of variables involved including the cluster size, number of objects, tolerable latencies e.t.c.

Comment 3 Naga Ravi Chaitanya Elluri 2019-10-15 14:12:02 UTC
We installed Robert's custom watch based kubelet and bumped up qps/burst rates ( 50/10 instead of default 5/10 ) to improve apiserver response rate for a large/dense cluster and ran the same test to create around 2k projects and 10k pods. We hit the same issue again, here are the logs - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn-tuned-qps/. Looking at the sdn log, the time between starting openshift-sdn network plugin and associating net id is still around 13 seconds:

I1015 04:44:32.581887    7673 node.go:378] Starting openshift-sdn network plugin
I1015 04:44:45.455444    7673 vnids.go:148] Associate netid 3731156 to namespace "b0" with mcEnabled false
I1015 04:44:45.455470    7673 vnids.go:148] Associate netid 6597128 to namespace "b1" with mcEnabled false

So increasing the timeout on liveness is one of the solution I guess.

Comment 4 Casey Callendrello 2019-10-16 12:06:33 UTC
I suspect we need to just disable the liveness check. It isn't helping us at all.

Do you know how to stop the network operator? Then you can just edit the sdn daemonset and remove it, then re-run your tests.

Comment 5 Naga Ravi Chaitanya Elluri 2019-10-18 16:56:11 UTC
We don't have the cluster around but we do have plans in future to run the tests on a large scale OCP cluster installed on top of OpenStack. It would be great if we can get the liveness check enhancements added in the product by then ( couple of weeks down the road ) or we can try disabling it and see if it resolves the problem like Casey suggested.

Comment 7 zhaozhanqi 2019-11-11 01:48:29 UTC
hi, Naga
 Assigned this bug to you since it's need to a lot of node to be reproduced.

Comment 11 Mike Fiedler 2019-12-02 12:47:06 UTC
We will not have a cluster this large for 4.3 - moving verification to 4.4.

Comment 12 Casey Callendrello 2019-12-03 10:50:02 UTC
*** Bug 1747871 has been marked as a duplicate of this bug. ***

Comment 14 Naga Ravi Chaitanya Elluri 2020-01-24 20:37:44 UTC
We are no longer hitting the issue, verified it on a 2000 nodes OCP 4.3 cluster built using 4.3.0-rc.3 payload.

Comment 18 errata-xmlrpc 2020-05-13 21:52:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581