Created attachment 1513101 [details]
Description of problem:
Due a power outage in the lab, engine lost communication with some of the hosts in several data centers. These hosts came up eventually but their status remained non responsive for days. Only after reboot to ovirt-engine service, the status was updated to up.
Version-Release number of selected component (if applicable):
Happened after power outage. This environment was upgraded to 4.2.7
Steps to Reproduce:
1. Power off a host
Connection lost with host tigris04 at 2018-12-07 18:48:24:
2018-12-07 18:48:24,636+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)  Connection timeout for host 'tigris04.scl.lab.tlv.redhat.com', last response arrived 34840 ms ago.
vdsm starts on the host at 2018-12-07 18:53:44 after the power outage:
2018-12-07 18:53:44,488+0200 INFO (MainThread) [vds] (PID: 3501) I am the actual vdsm 4.20.39-1.el7ev tigris04.scl.lab.tlv.redhat.com (3.10.0-862.11.6.el7.x86_64) (vdsmd:149)
Engine keeps reporting the host as unreachable and its state remains non responsive up until ovirt-engine restart which took place ~2 days after:
2018-12-09 11:02:10,592+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60)  EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM t
igris04 command Get Host Capabilities failed: Message timeout which can be caused by communication issues
Only after ovirt-engine service restart the host status is refreshed:
2018-12-09 11:29:37,820+02 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (ServerService Thread Pool -- 46)  Initialize vdsBroker 'tigris04.scl.lab.tlv.redhat.com:54321'
Host status should be up when its reachable to the engine and its vdsmd service is running.
*** This bug has been marked as a duplicate of bug 1582379 ***
On 18.104.22.168-1.el7, I'm facing with the same issue :
For no reason, one of the host was seen as unreachable, so was fenced.
When back to life, the engine declared it as NonResponsive, though it can perfectly ping it, and SSH it.
I'm applying every updates since long, and I'm still seeing this issue very regularly.
That's why I'm suprised to see this bug and/or its duplicate as closed.
Do you advice me to open a third bug for that?
Can you please attach the debug logs when the issue happens.