1745715 – [downstream clone - 4.3.7] Host in "Not Responding" and "Connecting" state until engine restarted

Bug 1745715 - [downstream clone - 4.3.7] Host in "Not Responding" and "Connecting" state until engine restarted

Summary: [downstream clone - 4.3.7] Host in "Not Responding" and "Connecting" state un...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.3.7
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Petr Kubica
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1771557
TreeView+	depends on / blocked

Reported:	2019-08-26 16:59 UTC by RHV bug bot
Modified:	2024-06-13 22:12 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	On a self-hosted Red Hat Virtualization Manager, ovirt-ha-broker restarts if it encounters an issue reading or writing to storage. Previously, in some cases, this caused VDSM to get stuck and report the status of the host as "Not Responding" and "Connecting" until the user restarted the Manager. The workaround for this issue was to restart the Manager. The current release mitigates this issue by adding a 5-seconds delay before VDSM sends a report to the Manager that a host is down.
Clone Of:	1720747
Environment:
Last Closed:	2019-12-12 10:36:52 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	pmatyas: testing_plan_complete+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4222911	None	None	RHV host in "not responding" state until ovirt-engine service restarted	2019-08-26 17:00:58 UTC
Red Hat Product Errata	RHBA-2019:4230	None	None	None	2019-12-12 10:37:13 UTC
oVirt gerrit	102521	'None'	MERGED	he: use a timeout on HE.haClient long operations	2021-02-08 03:30:38 UTC
oVirt gerrit	103984	'None'	MERGED	he: use a timeout on HE.haClient long operations	2021-02-08 03:30:38 UTC

Description RHV bug bot 2019-08-26 16:59:50 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1720747 +++
======================================================================

Description of problem:

Shortly after a host was rebooted, it went into a "Not Responding" state, then "Connecting", but it never came out of this state. Later on it went "non responsive" and was successfully soft-fenced, but the state never became "Up'. It finally transitioned to "Up" when the engine was later restarted.


Version-Release number of selected component (if applicable):

RHV 4.2.8.7
RHEL 7.6 host;
  vdsm-4.20.48-1


How reproducible:

Not.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

(Originally by Gordon Watson)

Comment 28 Daniel Gur 2019-08-28 13:13:56 UTC

sync2jira

Comment 29 Daniel Gur 2019-08-28 13:18:10 UTC

sync2jira

Comment 30 Sandro Bonazzola 2019-09-04 07:26:20 UTC

pending a fix on vdsm side

Comment 31 RHEL Program Management 2019-10-15 09:28:38 UTC

The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 34 Simone Tiraboschi 2019-10-25 14:03:58 UTC

To better understand this issue let's start from the global picture:
oVirt/RHV manager is periodically polling all the hosts for status information to update its view about hosts status and take the right decisions.

If we are on a hosted-engine deployment, between the information fetched from the host we have also the status of hosted-engine HA daemon to let the engine show hosted-engine HA score for each host, its HE maintenance mode and so on.

On the technical side; the engine talks over TLS with with a daemon called vdsm running on each host; when polled from the engine, vdsm also tries to fetch hosted-engine HA status from hosted-engine HA daemon via a Unix domain socket.
Vdsm corectly detects if the HA damon is down and its unix socket domain closed.

But now the issue: hosted-engine hosts exchange information about their status over a kind of whiteboard volume saved on the hosted-engine storage domain.
For concurrency reasons, this volume is also protected via sanlock.
Each host periodically tries to refresh its status and read other hosts status from that volume.
When it faces any kind of storage issue (writing its status or refreshing other host ones), ovirt-ha-broker tries to restart itself to get back in a functional state. This because ovirt-ha-broker is, by design, an HA related deamon and so it's aim is to be as rialiable as possile and to try to react on temporary failures with a self-healing approach.

The issue which cause this bug is that during its restart phase, ovirt-ha-broker can lost request sent by vdsm over its Unix domain socket so a query sent from vdsm can eventually get ignored.
vdsm is a multi-thread deamon, if the communication vdsm and ovirt-ha-broker get interrupted that thread got stuck and so vdsm never sends back its status to the engine that flags the host as not responsive and so the reported issue.
The workaround applied by the user in simply to restart the engine to quickly poll again the host over a new request causing vdsm to retry querying ovirt-ha-broker.

The fix to this bug simply introduces a 5 seconds timeout in the communication between vdsm and ovirt-ha-broker so that if ovirt-ha-broker doesn't reply over the Unix domain socket within 5 seconds, vdsm reports back to the engine that the HA mechanism is down without any need to manually restart the engine to force a new query.

Please notice that this fix is going to made the communication between the engine and vdsm more reliable but is not going to solve or address the root cause of the issue.
The root cause of the issue is that there is a kind of temporary failure on storage side that cause ovirt-ha-broker to restart multiple times in a row causing this issue. This is can be caused by a real failure on storage side or, more probably since it doesn't affect running VMs, by a latency issue which prevent sanlock from refreshing it's lock mechanism quickly enough (please notice that sanlock is a mechanism to ensure safe writes over a shared storage and so we expect it to be really picky over latency issues).

Comment 36 Petr Kubica 2019-11-26 09:03:07 UTC

Verified in vdsm-4.30.38-1.el7ev.x86_64

I can see now in vdsm log:
ERROR (periodic/1) [root] failed to retrieve Hosted Engine HA score (api:200) with timeout traceback.

Comment 39 errata-xmlrpc 2019-12-12 10:36:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:4230

Note You need to log in before you can comment on or make changes to this bug.