+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1720747 +++ ====================================================================== Description of problem: Shortly after a host was rebooted, it went into a "Not Responding" state, then "Connecting", but it never came out of this state. Later on it went "non responsive" and was successfully soft-fenced, but the state never became "Up'. It finally transitioned to "Up" when the engine was later restarted. Version-Release number of selected component (if applicable): RHV 4.2.8.7 RHEL 7.6 host; vdsm-4.20.48-1 How reproducible: Not. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: (Originally by Gordon Watson)
sync2jira
pending a fix on vdsm side
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
To better understand this issue let's start from the global picture: oVirt/RHV manager is periodically polling all the hosts for status information to update its view about hosts status and take the right decisions. If we are on a hosted-engine deployment, between the information fetched from the host we have also the status of hosted-engine HA daemon to let the engine show hosted-engine HA score for each host, its HE maintenance mode and so on. On the technical side; the engine talks over TLS with with a daemon called vdsm running on each host; when polled from the engine, vdsm also tries to fetch hosted-engine HA status from hosted-engine HA daemon via a Unix domain socket. Vdsm corectly detects if the HA damon is down and its unix socket domain closed. But now the issue: hosted-engine hosts exchange information about their status over a kind of whiteboard volume saved on the hosted-engine storage domain. For concurrency reasons, this volume is also protected via sanlock. Each host periodically tries to refresh its status and read other hosts status from that volume. When it faces any kind of storage issue (writing its status or refreshing other host ones), ovirt-ha-broker tries to restart itself to get back in a functional state. This because ovirt-ha-broker is, by design, an HA related deamon and so it's aim is to be as rialiable as possile and to try to react on temporary failures with a self-healing approach. The issue which cause this bug is that during its restart phase, ovirt-ha-broker can lost request sent by vdsm over its Unix domain socket so a query sent from vdsm can eventually get ignored. vdsm is a multi-thread deamon, if the communication vdsm and ovirt-ha-broker get interrupted that thread got stuck and so vdsm never sends back its status to the engine that flags the host as not responsive and so the reported issue. The workaround applied by the user in simply to restart the engine to quickly poll again the host over a new request causing vdsm to retry querying ovirt-ha-broker. The fix to this bug simply introduces a 5 seconds timeout in the communication between vdsm and ovirt-ha-broker so that if ovirt-ha-broker doesn't reply over the Unix domain socket within 5 seconds, vdsm reports back to the engine that the HA mechanism is down without any need to manually restart the engine to force a new query. Please notice that this fix is going to made the communication between the engine and vdsm more reliable but is not going to solve or address the root cause of the issue. The root cause of the issue is that there is a kind of temporary failure on storage side that cause ovirt-ha-broker to restart multiple times in a row causing this issue. This is can be caused by a real failure on storage side or, more probably since it doesn't affect running VMs, by a latency issue which prevent sanlock from refreshing it's lock mechanism quickly enough (please notice that sanlock is a mechanism to ensure safe writes over a shared storage and so we expect it to be really picky over latency issues).
Verified in vdsm-4.30.38-1.el7ev.x86_64 I can see now in vdsm log: ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score (api:200) with timeout traceback.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:4230