From the CU ticket > But we still see the nodes went to not-ready state for 15 min and those were recovered back automatically. We noticed errors in the sdn pods on the egress nodes which could not communicate to VIP of api. also see kubelet side change to mitigate this https://github.com/kubernetes/kubernetes/pull/52176 https://github.com/kubernetes/kubernetes/issues/48638 However, it seems the LB is not removing downed apiservers from the backend. Many controllers (not just kubelet) including SDN are having issue communicating with the apiserver over the internal LB. Sending to kube-apiserver for investigation. Is this an IPI vshpere install?
Without must-gather output we cannot work on this.
> Kindly check the case 02726830 in supportshell for node logs, kuebelet logs, sosreporst from affected nodes and must-gather. No access. Please upload elsewhere such that we can access it.
Created attachment 1713287 [details] sdn pod logs
Created attachment 1713288 [details] sdn pod logs
Created attachment 1713303 [details] Alerts
Created attachment 1713304 [details] Alerts-Resolved
muster-gather file is incomplete it seems. Cannot unpack. Which kind of install is this? Comment 2 asked already, without answer. Which kind of LB is used? This looks like the LB configuration is wrong.
Which kind of install is this? Comment 2 asked already, without answer. - UPI based setup vmware restricted network installation. Which kind of LB is used? This looks like the LB configuration is wrong. - A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only? - Let me also know if you need haproxy configuration
*** This bug has been marked as a duplicate of bug 1836017 ***
@Avinash Please configure readyz correctly as documented since 4.6 for UPI. Unfortunately, UPI examples in the installer and docs were wrong in the past.
@Stefan, Can you help me to know what are the exact configurtions needs to be done? Can you point to correct documentation or the exact steps? I can see the similar details in documentation for both 4.3 and 4.6.
Hello, Any update on #15 ?
As already written in the mail, this document describes the background: https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md I bet this is available in some UPI documentation. The UPI installer team will know more.
Sending back to node for the node outage part of the story. The connection refused errors are due to LB misconfiguration, duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1836017. But I doubt this causes node outage.
In general, try to see where requests that fail end: do they reach haproxy? Do you see errors on the master nodes? In apiserver logs? haproxy is not in our hands, it's "bring your own" as far as I understand, following UPI docs and example config. The installer team provides these afaik and are a better addressee for debugging, probably even better than node team who only suffer under the issues not to be able to connect to the API.
What does this mean: > - A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only? Only two nodes out of many show this issue? This looks more like a networking issue. Why should this be kube-apiserver's fault if the other nodes are happy to speak to it through haproxy?
Disregard last comment. Was on the wrong BZ.
This should be fixed in 4.6.9+ (4.5 pending) with the following BZ and kernel patch. 1857446 *** This bug has been marked as a duplicate of bug 1857446 ***