Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1745715

Summary: [downstream clone - 4.3.7] Host in "Not Responding" and "Connecting" state until engine restarted
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: vdsmAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Petr Kubica <pkubica>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.8CC: cnagarka, dfediuck, fdelorey, joboyer, lleistne, lsurette, michal.skrivanek, mperina, mzamazal, nsoffer, pelauter, pkliczew, Rhev-m-bugs, rhodain, srevivo, stirabos, tnisan, ycui, yoliynyk
Target Milestone: ovirt-4.3.7Keywords: ZStream
Target Release: ---Flags: pmatyas: testing_plan_complete+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
On a self-hosted Red Hat Virtualization Manager, ovirt-ha-broker restarts if it encounters an issue reading or writing to storage. Previously, in some cases, this caused VDSM to get stuck and report the status of the host as "Not Responding" and "Connecting" until the user restarted the Manager. The workaround for this issue was to restart the Manager. The current release mitigates this issue by adding a 5-seconds delay before VDSM sends a report to the Manager that a host is down.
Story Points: ---
Clone Of: 1720747 Environment:
Last Closed: 2019-12-12 10:36:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771557    

Description RHV bug bot 2019-08-26 16:59:50 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1720747 +++
======================================================================

Description of problem:

Shortly after a host was rebooted, it went into a "Not Responding" state, then "Connecting", but it never came out of this state. Later on it went "non responsive" and was successfully soft-fenced, but the state never became "Up'. It finally transitioned to "Up" when the engine was later restarted.


Version-Release number of selected component (if applicable):

RHV 4.2.8.7
RHEL 7.6 host;
  vdsm-4.20.48-1


How reproducible:

Not.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

(Originally by Gordon Watson)

Comment 28 Daniel Gur 2019-08-28 13:13:56 UTC
sync2jira

Comment 29 Daniel Gur 2019-08-28 13:18:10 UTC
sync2jira

Comment 30 Sandro Bonazzola 2019-09-04 07:26:20 UTC
pending a fix on vdsm side

Comment 31 RHEL Program Management 2019-10-15 09:28:38 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 34 Simone Tiraboschi 2019-10-25 14:03:58 UTC
To better understand this issue let's start from the global picture:
oVirt/RHV manager is periodically polling all the hosts for status information to update its view about hosts status and take the right decisions.

If we are on a hosted-engine deployment, between the information fetched from the host we have also the status of hosted-engine HA daemon to let the engine show hosted-engine HA score for each host, its HE maintenance mode and so on.

On the technical side; the engine talks over TLS with with a daemon called vdsm running on each host; when polled from the engine, vdsm also tries to fetch hosted-engine HA status from hosted-engine HA daemon via a Unix domain socket.
Vdsm  corectly detects if the HA damon is down and its unix socket domain closed.

But now the issue: hosted-engine hosts exchange information about their status over a kind of whiteboard volume saved on the hosted-engine storage domain.
For concurrency reasons, this volume is also protected via sanlock.
Each host periodically tries to refresh its status and read other hosts status from that volume.
When it faces any kind of storage issue (writing its status or refreshing other host ones), ovirt-ha-broker tries to restart itself to get back in a functional state. This because ovirt-ha-broker is, by design, an HA related deamon and so it's aim is to be as rialiable as possile and to try to react on temporary failures with a self-healing approach. 

The issue which cause this bug is that during its restart phase, ovirt-ha-broker can lost request sent by vdsm over its Unix domain socket so a query sent from vdsm can eventually get ignored.
vdsm is a multi-thread deamon, if the communication vdsm and ovirt-ha-broker get interrupted that thread got stuck and so vdsm never sends back its status to the engine that flags the host as not responsive and so the reported issue.
The workaround applied by the user in simply to restart the engine to quickly poll again the host over a new request causing vdsm to retry querying ovirt-ha-broker. 

The fix to this bug simply introduces a 5 seconds timeout in the communication between vdsm and ovirt-ha-broker so that if ovirt-ha-broker doesn't reply over the Unix domain socket within 5 seconds, vdsm reports back to the engine that the HA mechanism is down without any need to manually restart the engine to force a new query.

Please notice that this fix is going to made the communication between the engine and vdsm more reliable but is not going to solve or address the root cause of the issue.
The root cause of the issue is that there is a kind of temporary failure on storage side that cause ovirt-ha-broker to restart multiple times in a row causing this issue. This is can be caused by a real failure on storage side or, more probably since it doesn't affect running VMs, by a latency issue which prevent sanlock from refreshing it's lock mechanism quickly enough (please notice that sanlock is a mechanism to ensure safe writes over a shared storage and so we expect it to be really picky over latency issues).

Comment 36 Petr Kubica 2019-11-26 09:03:07 UTC
Verified in vdsm-4.30.38-1.el7ev.x86_64

I can see now in vdsm log:
ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score (api:200) with timeout traceback.

Comment 39 errata-xmlrpc 2019-12-12 10:36:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4230