Bug 1975789 - worker nodes rebooted when we simulate a case where the api-server is down
Summary: worker nodes rebooted when we simulate a case where the api-server is down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Poison Pill Operator
Version: 4.8
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Marc Sluiter
QA Contact: Shelly Miron
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-24 12:31 UTC by Shelly Miron
Modified: 2021-07-27 23:14 UTC (History)
2 users (show)

Fixed In Version: poison-pill-container-v4.8.0-17
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:13:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github medik8s poison-pill pull 44 0 None open [bugzilla 1975789] add timeout for api-server requests when checking if other node is healthy 2021-06-27 07:19:21 UTC
Github medik8s poison-pill pull 61 0 None closed [Bug 1975789] fix nodes that rebooted when there was an api-server failure + improved logging 2021-07-05 13:26:36 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:14:18 UTC

Comment 1 Nir 2021-06-24 14:58:15 UTC
Thanks Shelly.
I suspect that problem stems from missing timeout when accessing the api-server.
So when the api-server is down, and some node being asked if other node is healthy or not, it queries the api-server, which is down, and that times out only after 30s while poison-pill wait only 10 seconds.

we will need to add timeout to that request and see how it goes

Comment 2 Shelly Miron 2021-07-04 06:34:41 UTC
Seems like the issue still exists -
worker nodes rebooted once the api-server is not reachable from all of them.
the logs of one of them is attached to the bug.

Comment 7 errata-xmlrpc 2021-07-27 23:13:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.