1761609 – Nodes are in NotReady state because of failed sdn pod liveness probes (for very large clusters)

Bug 1761609 - Nodes are in NotReady state because of failed sdn pod liveness probes (for very large clusters)

Summary: Nodes are in NotReady state because of failed sdn pod liveness probes (for ve...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Casey Callendrello
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:	aos-scalability-42
Duplicates (1):	1747871 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-14 21:15 UTC by Naga Ravi Chaitanya Elluri
Modified:	2023-09-07 20:47 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 21:52:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 385	0	'None'	closed	Bug 1761609: Remove SDN ds liveness check	2020-12-15 08:27:22 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-13 21:52:14 UTC

Description Naga Ravi Chaitanya Elluri 2019-10-14 21:15:57 UTC

Description of problem:
When running tests which creates thousands of objects on a large scale cluster (1000-2000 nodes), the nodes start to go NotReady as the sdn pod is being killed due to the liveness probe failing before the liveness server gets started. Not sure if it can be reproduced at much lower scale, we were able to run tests fine on a 250 node cluster built using an older build. 
  

Version-Release number of selected component (if applicable):
Payload/build: 4.2.0-0.nightly-2019-10-07-011045 

How reproducible:
Reproduced it twice at both 1000 and 2000 node scale.

Steps to Reproduce:
1. Install a cluster using 4.2.0-0.nightly-2019-10-07-011045.
2. Create thousands of objects.
3. Check the node and sdn pod status.

Actual results:
Nodes in NotReady state and sdn pods are crashing.

Expected results:
Cluster is stable.

Additional info:
The logs of various system components, events, journal, kubelet, sdn and runtime are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn/.

Comment 1 Dan Winship 2019-10-14 21:41:43 UTC

SDN logs show that apiserver responses are very slow:

I1014 20:27:16.921461   15009 node.go:378] Starting openshift-sdn network plugin
I1014 20:27:28.169570   15009 vnids.go:148] Associate netid 5369580 to namespace "b0" with mcEnabled false

in another (non-1000-node) cluster, that gap is 0.3 seconds, not 11.2 seconds.

Later:

I1014 20:27:28.381738   15009 proxy.go:103] Using unidling+iptables Proxier.
...
I1014 20:27:38.584259   15009 proxier.go:214] Setting proxy IP to 10.0.139.69 and initializing iptables
...
I1014 20:27:51.291165   15009 proxy.go:89] Starting multitenant SDN proxy endpoint filter

(normally 0.22s and 0.15s, not 10.2s and 12.7s)


The "interrupt: Gracefully shutting down ..." message appears (untimestamped) between the 20:27:38 and 20:27:51 logs; the kube-proxy healthz server doesn't get started until shortly after the 20:27:51 message, so the problem seems to be that startup is so slow that we time out on liveness before even starting the liveness server.

Comment 2 Naga Ravi Chaitanya Elluri 2019-10-14 22:08:05 UTC

We tweaked the server side parameters including max-inflight-requests and max-mutating-inflgiht-requests to 1000/500 but we are using the default qps/burst rates on the client side ( 5/10 ). The client default qps/burst rates might be too low for a large/dense clusters, maybe we should bump the defaults though we don't have guidance on the values as it's tricky since there are lot of variables involved including the cluster size, number of objects, tolerable latencies e.t.c.

Comment 3 Naga Ravi Chaitanya Elluri 2019-10-15 14:12:02 UTC

We installed Robert's custom watch based kubelet and bumped up qps/burst rates ( 50/10 instead of default 5/10 ) to improve apiserver response rate for a large/dense cluster and ran the same test to create around 2k projects and 10k pods. We hit the same issue again, here are the logs - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn-tuned-qps/. Looking at the sdn log, the time between starting openshift-sdn network plugin and associating net id is still around 13 seconds:

I1015 04:44:32.581887    7673 node.go:378] Starting openshift-sdn network plugin
I1015 04:44:45.455444    7673 vnids.go:148] Associate netid 3731156 to namespace "b0" with mcEnabled false
I1015 04:44:45.455470    7673 vnids.go:148] Associate netid 6597128 to namespace "b1" with mcEnabled false

So increasing the timeout on liveness is one of the solution I guess.

Comment 4 Casey Callendrello 2019-10-16 12:06:33 UTC

I suspect we need to just disable the liveness check. It isn't helping us at all.

Do you know how to stop the network operator? Then you can just edit the sdn daemonset and remove it, then re-run your tests.

Comment 5 Naga Ravi Chaitanya Elluri 2019-10-18 16:56:11 UTC

We don't have the cluster around but we do have plans in future to run the tests on a large scale OCP cluster installed on top of OpenStack. It would be great if we can get the liveness check enhancements added in the product by then ( couple of weeks down the road ) or we can try disabling it and see if it resolves the problem like Casey suggested.

Comment 7 zhaozhanqi 2019-11-11 01:48:29 UTC

hi, Naga
 Assigned this bug to you since it's need to a lot of node to be reproduced.

Comment 11 Mike Fiedler 2019-12-02 12:47:06 UTC

We will not have a cluster this large for 4.3 - moving verification to 4.4.

Comment 12 Casey Callendrello 2019-12-03 10:50:02 UTC

*** Bug 1747871 has been marked as a duplicate of this bug. ***

Comment 14 Naga Ravi Chaitanya Elluri 2020-01-24 20:37:44 UTC

We are no longer hitting the issue, verified it on a 2000 nodes OCP 4.3 cluster built using 4.3.0-rc.3 payload.

Comment 18 errata-xmlrpc 2020-05-13 21:52:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.