Bug 1873816
Summary: | Nodes going to NotReady state frequently | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Avinash Bodhe <abodhe> |
Component: | Node | Assignee: | Ryan Phillips <rphillips> |
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | abhinkum, abodhe, aos-bugs, dphillip, jhou, jokerman, juqiao, mfojtik, nagrawal, openshift-bugs-escalate, rcarrier, sagopina, sttts, tsweeney, xxia |
Version: | 4.4 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-01-08 15:39:27 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 2
Seth Jennings
2020-08-31 16:13:25 UTC
Without must-gather output we cannot work on this. > Kindly check the case 02726830 in supportshell for node logs, kuebelet logs, sosreporst from affected nodes and must-gather.
No access.
Please upload elsewhere such that we can access it.
Created attachment 1713287 [details]
sdn pod logs
Created attachment 1713288 [details]
sdn pod logs
Created attachment 1713303 [details]
Alerts
Created attachment 1713304 [details]
Alerts-Resolved
muster-gather file is incomplete it seems. Cannot unpack. Which kind of install is this? Comment 2 asked already, without answer. Which kind of LB is used? This looks like the LB configuration is wrong. Which kind of install is this? Comment 2 asked already, without answer. - UPI based setup vmware restricted network installation. Which kind of LB is used? This looks like the LB configuration is wrong. - A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only? - Let me also know if you need haproxy configuration *** This bug has been marked as a duplicate of bug 1836017 *** @Avinash Please configure readyz correctly as documented since 4.6 for UPI. Unfortunately, UPI examples in the installer and docs were wrong in the past. @Stefan, Can you help me to know what are the exact configurtions needs to be done? Can you point to correct documentation or the exact steps? I can see the similar details in documentation for both 4.3 and 4.6. Hello, Any update on #15 ? As already written in the mail, this document describes the background: https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md I bet this is available in some UPI documentation. The UPI installer team will know more. Sending back to node for the node outage part of the story. The connection refused errors are due to LB misconfiguration, duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1836017. But I doubt this causes node outage. In general, try to see where requests that fail end: do they reach haproxy? Do you see errors on the master nodes? In apiserver logs? haproxy is not in our hands, it's "bring your own" as far as I understand, following UPI docs and example config. The installer team provides these afaik and are a better addressee for debugging, probably even better than node team who only suffer under the issues not to be able to connect to the API. What does this mean:
> - A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only?
Only two nodes out of many show this issue? This looks more like a networking issue. Why should this be kube-apiserver's fault if the other nodes are happy to speak to it through haproxy?
Disregard last comment. Was on the wrong BZ. This should be fixed in 4.6.9+ (4.5 pending) with the following BZ and kernel patch. 1857446 *** This bug has been marked as a duplicate of bug 1857446 *** |