Bug 1761609
Summary: | Nodes are in NotReady state because of failed sdn pod liveness probes (for very large clusters) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> |
Component: | Networking | Assignee: | Casey Callendrello <cdc> |
Networking sub component: | openshift-sdn | QA Contact: | Mike Fiedler <mifiedle> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | akamra, danw, mifiedle, nelluri, rhowe, ricarril, sreber, xtian, yapei, zzhao |
Version: | 4.2.0 | ||
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | aos-scalability-42 | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-13 21:52:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Naga Ravi Chaitanya Elluri
2019-10-14 21:15:57 UTC
SDN logs show that apiserver responses are very slow: I1014 20:27:16.921461 15009 node.go:378] Starting openshift-sdn network plugin I1014 20:27:28.169570 15009 vnids.go:148] Associate netid 5369580 to namespace "b0" with mcEnabled false in another (non-1000-node) cluster, that gap is 0.3 seconds, not 11.2 seconds. Later: I1014 20:27:28.381738 15009 proxy.go:103] Using unidling+iptables Proxier. ... I1014 20:27:38.584259 15009 proxier.go:214] Setting proxy IP to 10.0.139.69 and initializing iptables ... I1014 20:27:51.291165 15009 proxy.go:89] Starting multitenant SDN proxy endpoint filter (normally 0.22s and 0.15s, not 10.2s and 12.7s) The "interrupt: Gracefully shutting down ..." message appears (untimestamped) between the 20:27:38 and 20:27:51 logs; the kube-proxy healthz server doesn't get started until shortly after the 20:27:51 message, so the problem seems to be that startup is so slow that we time out on liveness before even starting the liveness server. We tweaked the server side parameters including max-inflight-requests and max-mutating-inflgiht-requests to 1000/500 but we are using the default qps/burst rates on the client side ( 5/10 ). The client default qps/burst rates might be too low for a large/dense clusters, maybe we should bump the defaults though we don't have guidance on the values as it's tricky since there are lot of variables involved including the cluster size, number of objects, tolerable latencies e.t.c. We installed Robert's custom watch based kubelet and bumped up qps/burst rates ( 50/10 instead of default 5/10 ) to improve apiserver response rate for a large/dense cluster and ran the same test to create around 2k projects and 10k pods. We hit the same issue again, here are the logs - http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/2000-node-scale/bugs/sdn-tuned-qps/. Looking at the sdn log, the time between starting openshift-sdn network plugin and associating net id is still around 13 seconds: I1015 04:44:32.581887 7673 node.go:378] Starting openshift-sdn network plugin I1015 04:44:45.455444 7673 vnids.go:148] Associate netid 3731156 to namespace "b0" with mcEnabled false I1015 04:44:45.455470 7673 vnids.go:148] Associate netid 6597128 to namespace "b1" with mcEnabled false So increasing the timeout on liveness is one of the solution I guess. I suspect we need to just disable the liveness check. It isn't helping us at all. Do you know how to stop the network operator? Then you can just edit the sdn daemonset and remove it, then re-run your tests. We don't have the cluster around but we do have plans in future to run the tests on a large scale OCP cluster installed on top of OpenStack. It would be great if we can get the liveness check enhancements added in the product by then ( couple of weeks down the road ) or we can try disabling it and see if it resolves the problem like Casey suggested. hi, Naga Assigned this bug to you since it's need to a lot of node to be reproduced. We will not have a cluster this large for 4.3 - moving verification to 4.4. *** Bug 1747871 has been marked as a duplicate of this bug. *** We are no longer hitting the issue, verified it on a 2000 nodes OCP 4.3 cluster built using 4.3.0-rc.3 payload. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |