Bug 1657852

Summary: Engine keeps reporting hosts as non responsive even though the communication issue with them was solved
Product: [oVirt] ovirt-engine Reporter: Elad <ebenahar>
Component: BLL.InfraAssignee: Ravi Nori <rnori>
Status: CLOSED DUPLICATE QA Contact: Lukas Svaty <lsvaty>
Severity: high Docs Contact:
Priority: unspecified    
Version: futureCC: bugs, mperina, nicolas, rnori
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-11 10:41:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Elad 2018-12-10 15:34:10 UTC
Created attachment 1513101 [details]
logs

Description of problem:
Due a power outage in the lab, engine lost communication with some of the hosts in several data centers. These hosts came up eventually but their status remained non responsive for days. Only after reboot to ovirt-engine service, the status was updated to up.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.7.4-0.1.el7ev.noarch
postgresql-9.2.24-1.el7_5.x86_64
vdsm-4.20.39-1.el7ev.x86_64

How reproducible:
Happened after power outage. This environment was upgraded to 4.2.7 

Steps to Reproduce:
1. Power off a host


Actual results:

Connection lost with host tigris04 at 2018-12-07 18:48:24:

2018-12-07 18:48:24,636+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'tigris04.scl.lab.tlv.redhat.com', last response arrived 34840 ms ago.


vdsm starts on the host at 2018-12-07 18:53:44 after the power outage:

2018-12-07 18:53:44,488+0200 INFO  (MainThread) [vds] (PID: 3501) I am the actual vdsm 4.20.39-1.el7ev tigris04.scl.lab.tlv.redhat.com (3.10.0-862.11.6.el7.x86_64) (vdsmd:149)



Engine keeps reporting the host as unreachable and its state remains non responsive up until ovirt-engine restart which took place ~2 days after:

2018-12-09 11:02:10,592+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM t
igris04 command Get Host Capabilities failed: Message timeout which can be caused by communication issues


Only after ovirt-engine service restart the host status is refreshed:

2018-12-09 11:29:37,820+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (ServerService Thread Pool -- 46) [] Initialize vdsBroker 'tigris04.scl.lab.tlv.redhat.com:54321'

Expected results:
Host status should be up when its reachable to the engine and its vdsmd service is running.

Additional info:

Comment 3 Martin Perina 2018-12-11 10:41:29 UTC

*** This bug has been marked as a duplicate of bug 1582379 ***

Comment 4 Nicolas Ecarnot 2019-04-17 09:35:24 UTC
Hello,

On 4.3.2.1-1.el7, I'm facing with the same issue :
For no reason, one of the host was seen as unreachable, so was fenced.
When back to life, the engine declared it as NonResponsive, though it can perfectly ping it, and SSH it.

I'm applying every updates since long, and I'm still seeing this issue very regularly.
That's why I'm suprised to see this bug and/or its duplicate as closed.

Do you advice me to open a third bug for that?

Comment 5 Ravi Nori 2019-04-17 14:38:04 UTC
Can you please attach the debug logs when the issue happens.

Comment 6 Red Hat Bugzilla 2023-09-14 04:43:33 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days