Bug 1873816

Summary:	Nodes going to NotReady state frequently
Product:	OpenShift Container Platform	Reporter:	Avinash Bodhe <abodhe>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	abhinkum, abodhe, aos-bugs, dphillip, jhou, jokerman, juqiao, mfojtik, nagrawal, openshift-bugs-escalate, rcarrier, sagopina, sttts, tsweeney, xxia
Version:	4.4	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-08 15:39:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 Seth Jennings 2020-08-31 16:13:25 UTC

From the CU ticket

> But we still see the nodes went to not-ready state for 15 min and those were recovered back automatically. We noticed errors in the sdn pods on the egress nodes which could not communicate to VIP of api.

also see kubelet side change to mitigate this
https://github.com/kubernetes/kubernetes/pull/52176
https://github.com/kubernetes/kubernetes/issues/48638

However, it seems the LB is not removing downed apiservers from the backend.  Many controllers (not just kubelet) including SDN are having issue communicating with the apiserver over the internal LB.  Sending to kube-apiserver for investigation.

Is this an IPI vshpere install?

Comment 3 Stefan Schimanski 2020-09-01 07:40:41 UTC

Without must-gather output we cannot work on this.

Comment 4 Stefan Schimanski 2020-09-01 07:42:12 UTC

> Kindly check the case 02726830 in supportshell for node logs, kuebelet logs, sosreporst from affected nodes and must-gather.

No access.

Please upload elsewhere such that we can access it.

Comment 5 Avinash Bodhe 2020-09-01 12:27:32 UTC

Created attachment 1713287 [details]
sdn pod logs

Comment 6 Avinash Bodhe 2020-09-01 12:28:28 UTC

Created attachment 1713288 [details]
sdn pod logs

Comment 7 Avinash Bodhe 2020-09-01 12:59:42 UTC

Created attachment 1713303 [details]
Alerts

Comment 8 Avinash Bodhe 2020-09-01 13:00:17 UTC

Created attachment 1713304 [details]
Alerts-Resolved

Comment 11 Stefan Schimanski 2020-09-07 09:59:42 UTC

muster-gather file is incomplete it seems. Cannot unpack.

Which kind of install is this? Comment 2 asked already, without answer.

Which kind of LB is used? This looks like the LB configuration is wrong.

Comment 12 Avinash Bodhe 2020-09-07 11:29:23 UTC

Which kind of install is this? Comment 2 asked already, without answer.
- UPI based setup vmware restricted network installation.

Which kind of LB is used? This looks like the LB configuration is wrong.
- A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only? 
- Let me also know if you need haproxy configuration

Comment 13 Stefan Schimanski 2020-09-11 14:54:29 UTC


*** This bug has been marked as a duplicate of bug 1836017 ***

Comment 14 Stefan Schimanski 2020-09-11 14:55:25 UTC

@Avinash Please configure readyz correctly as documented since 4.6 for UPI. Unfortunately, UPI examples in the installer and docs were wrong in the past.

Comment 15 Avinash Bodhe 2020-09-14 04:10:59 UTC

@Stefan, Can you help me to know what are the exact configurtions needs to be done?  Can you point to correct documentation or the exact steps?

I can see the similar details in documentation for both 4.3 and 4.6.

Comment 16 Avinash Bodhe 2020-09-24 03:54:54 UTC

Hello, Any update on #15 ?

Comment 18 Stefan Schimanski 2020-10-01 16:30:26 UTC

As already written in the mail, this document describes the background:

  https://github.com/openshift/installer/blob/master/docs/dev/kube-apiserver-health-check.md 

I bet this is available in some UPI documentation. The UPI installer team will know more.

Comment 19 Stefan Schimanski 2020-10-02 14:05:34 UTC

Sending back to node for the node outage part of the story. The connection refused errors are due to LB misconfiguration, duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1836017. But I doubt this causes node outage.

Comment 20 Stefan Schimanski 2020-10-02 14:10:05 UTC

In general, try to see where requests that fail end: do they reach haproxy? Do you see errors on the master nodes? In apiserver logs?

haproxy is not in our hands, it's "bring your own" as far as I understand, following UPI docs and example config. The installer team provides these afaik and are a better addressee for debugging, probably even better than node team who only suffer under the issues not to be able to connect to the API.

Comment 21 Stefan Schimanski 2020-10-02 14:31:14 UTC

What does this mean:

> - A pair of haproxy nodes with keepalived were configured as LB for api, api-int and apps. If LB configuration is the issue, why we issues with the egress nodes only? 

Only two nodes out of many show this issue? This looks more like a networking issue. Why should this be kube-apiserver's fault if the other nodes are happy to speak to it through haproxy?

Comment 36 Stefan Schimanski 2020-11-18 16:12:42 UTC

Disregard last comment. Was on the wrong BZ.

Comment 48 Ryan Phillips 2021-01-08 15:39:27 UTC

This should be fixed in 4.6.9+ (4.5 pending) with the following BZ and kernel patch.

1857446

*** This bug has been marked as a duplicate of bug 1857446 ***