Thanks Shelly. I suspect that problem stems from missing timeout when accessing the api-server. So when the api-server is down, and some node being asked if other node is healthy or not, it queries the api-server, which is down, and that times out only after 30s while poison-pill wait only 10 seconds. we will need to add timeout to that request and see how it goes
Seems like the issue still exists - worker nodes rebooted once the api-server is not reachable from all of them. the logs of one of them is attached to the bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438