Bug 1657852 - Engine keeps reporting hosts as non responsive even though the communication issue with them was solved [NEEDINFO]
Summary: Engine keeps reporting hosts as non responsive even though the communication ...
Keywords:
Status: CLOSED DUPLICATE of bug 1582379
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: future
Hardware: x86_64
OS: Unspecified
unspecified
high vote
Target Milestone: ---
: ---
Assignee: Ravi Nori
QA Contact: Lukas Svaty
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-10 15:34 UTC by Elad
Modified: 2019-04-17 14:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-11 10:41:29 UTC
oVirt Team: Infra
nicolas: needinfo? (rnori)
rnori: needinfo? (nicolas)


Attachments (Terms of Use)
logs (5.46 MB, application/gzip)
2018-12-10 15:34 UTC, Elad
no flags Details

Description Elad 2018-12-10 15:34:10 UTC
Created attachment 1513101 [details]
logs

Description of problem:
Due a power outage in the lab, engine lost communication with some of the hosts in several data centers. These hosts came up eventually but their status remained non responsive for days. Only after reboot to ovirt-engine service, the status was updated to up.


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.7.4-0.1.el7ev.noarch
postgresql-9.2.24-1.el7_5.x86_64
vdsm-4.20.39-1.el7ev.x86_64

How reproducible:
Happened after power outage. This environment was upgraded to 4.2.7 

Steps to Reproduce:
1. Power off a host


Actual results:

Connection lost with host tigris04 at 2018-12-07 18:48:24:

2018-12-07 18:48:24,636+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'tigris04.scl.lab.tlv.redhat.com', last response arrived 34840 ms ago.


vdsm starts on the host at 2018-12-07 18:53:44 after the power outage:

2018-12-07 18:53:44,488+0200 INFO  (MainThread) [vds] (PID: 3501) I am the actual vdsm 4.20.39-1.el7ev tigris04.scl.lab.tlv.redhat.com (3.10.0-862.11.6.el7.x86_64) (vdsmd:149)



Engine keeps reporting the host as unreachable and its state remains non responsive up until ovirt-engine restart which took place ~2 days after:

2018-12-09 11:02:10,592+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM t
igris04 command Get Host Capabilities failed: Message timeout which can be caused by communication issues


Only after ovirt-engine service restart the host status is refreshed:

2018-12-09 11:29:37,820+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (ServerService Thread Pool -- 46) [] Initialize vdsBroker 'tigris04.scl.lab.tlv.redhat.com:54321'

Expected results:
Host status should be up when its reachable to the engine and its vdsmd service is running.

Additional info:

Comment 3 Martin Perina 2018-12-11 10:41:29 UTC

*** This bug has been marked as a duplicate of bug 1582379 ***

Comment 4 Nicolas Ecarnot 2019-04-17 09:35:24 UTC
Hello,

On 4.3.2.1-1.el7, I'm facing with the same issue :
For no reason, one of the host was seen as unreachable, so was fenced.
When back to life, the engine declared it as NonResponsive, though it can perfectly ping it, and SSH it.

I'm applying every updates since long, and I'm still seeing this issue very regularly.
That's why I'm suprised to see this bug and/or its duplicate as closed.

Do you advice me to open a third bug for that?

Comment 5 Ravi Nori 2019-04-17 14:38:04 UTC
Can you please attach the debug logs when the issue happens.


Note You need to log in before you can comment on or make changes to this bug.