Bug 1873114
Summary: | Nodes goes into NotReady state (VMware) | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Juan Manuel Parrilla Madrid <jparrill> | ||||
Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||
Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | aarapov, abhinkum, abraj, achakrat, adeshpan, agawand, amsingh, andcosta, aos-bugs, aprajapa, asheth, asm, aygarg, bleanhar, ccoleman, ChetRHosey, cpassare, dbenoit, deparker, dphillip, emachado, fjaspe, igreen, jmalde, jokerman, kahara, kangell, kechung, lszaszki, mchebbi, mdhanve, mdunnett, mfuruta, mnunes, mrobson, nagrawal, namato, naygupta, ngirard, nijoshi, oarribas, openshift-bugs-escalate, pchavan, pkhaire, prdeshpa, rbryant, rcarrier, rdave, rdiscala, rheinzma, rpalathi, rphillips, rupatel, ryan.phillips, sagopina, sbhavsar, smaudet, sople, sparpate, srengan, ssonigra, tschelle, tstellar, tsweeney, vjaypurk, weinliu | ||||
Version: | 4.5 | Keywords: | Bugfix | ||||
Target Milestone: | --- | Flags: | rcarrier:
needinfo-
|
||||
Target Release: | 4.7.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: HTTP/2 transports did not have the correct options attached to the connections to provide timeout logic. VMWare network interfaces (and other scenarios) would blip for a few seconds causing connections to fail silently.
Consequence: This resulted in connections lingering around causing other related failures; for instance: nodes not being detected down, API calls using stale connections and failing, etc.
Fix: The fix was to add proper timeouts for <= 4.7. Upstream has a more permanent fix which will roll out in >= 4.8.
Result: HTTP/2 connections within the system are more reliable, and side-effects are mitigated.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1873949 1901208 (view as bug list) | Environment: | |||||
Last Closed: | 2021-02-24 15:16:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1857446 | ||||||
Bug Blocks: | 1901208 | ||||||
Attachments: |
|
Description
Juan Manuel Parrilla Madrid
2020-08-27 12:16:13 UTC
I forgot to mention, if you restart the Kubelet the node comes alive again (from the OCP perspective) and start reporting the status but eventually gets in NotReady state again This looks like a golang issue https://github.com/kubernetes/kubernetes/issues/87615 https://github.com/golang/go/issues/40201 (In reply to Seth Jennings from comment #2) > This looks like a golang issue > https://github.com/kubernetes/kubernetes/issues/87615 > https://github.com/golang/go/issues/40201 Thanks. I checked the kernel log on a host that was seeing this issue and observed a lot of this in the log, which seems to support this being the same issue as the one you linked. [84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready (In reply to Russell Bryant from comment #3) > (In reply to Seth Jennings from comment #2) > > This looks like a golang issue > > https://github.com/kubernetes/kubernetes/issues/87615 > > https://github.com/golang/go/issues/40201 > > Thanks. I checked the kernel log on a host that was seeing this issue and > observed a lot of this in the log, which seems to support this being the > same issue as the one you linked. > > [84057.832697] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84057.861232] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84057.988240] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84058.010442] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84129.485448] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84129.526182] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84219.127096] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84219.156378] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [84350.926777] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [84350.963246] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106613.643101] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106613.677715] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106614.409979] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106614.430376] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106614.783140] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106614.824764] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106615.445239] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106615.473290] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [106793.931631] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [106793.968889] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready but turns out that's not actually the network interface in use. The beginning of when the "use of closed network connection" errors is no longer in the journal, so I don't think I can pinpoint what happened to trigger it. Hopefully we can catch more logs around the time this starts next time since this seems easy to reproduce in this lab for some reason. Created attachment 1712946 [details]
Hub Kubelet Master-0
Adding a log file from affected node where, we restart the kubelet and fails again (look for "Unauthorized")
I'm setting this to a high priority / severity since when this occurs, a Node is in a very broken state until kubelet gets restarted. If It's necessary the environment could be redeployed to reproduce the state and try to fix the bug. As you know, to maintain this environment in this state with the error replication and so on, will block us in some ways, then we could have this environment as it is for ~8 more bussiness days, from that date we will repurpose the envieonment for other focus. In order to change the state to replicate the issue will cost us like ~3-5 bussiness days, just to let you know ;) *** Bug 1872734 has been marked as a duplicate of this bug. *** *** Bug 1875236 has been marked as a duplicate of this bug. *** Russell Bryant works on/with the bare-metal team, where NIC resets are most likely to occur vs other platforms. He should be able to recreate the issue. for testing if upstream fix addresses this issue https://github.com/openshift/kubernetes/pull/398 *** Bug 1873949 has been marked as a duplicate of this bug. *** *** Bug 1893511 has been marked as a duplicate of this bug. *** *** Bug 1899109 has been marked as a duplicate of this bug. *** Hello, This bug is to get the fix backported to OCP 4.5, Could you please confirm the fix release date for OCP 4.5 ? Thanks in advance. The BZ for OCP 4.5 is [1], and it's already fixed in OCP 4.5.27 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1907938 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 *** Bug 1934422 has been marked as a duplicate of this bug. *** *** Bug 1953678 has been marked as a duplicate of this bug. *** *** Bug 1953678 has been marked as a duplicate of this bug. *** *** Bug 1953678 has been marked as a duplicate of this bug. *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |