Bug 1657852
Summary: | Engine keeps reporting hosts as non responsive even though the communication issue with them was solved | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Elad <ebenahar> | ||||
Component: | BLL.Infra | Assignee: | Ravi Nori <rnori> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Lukas Svaty <lsvaty> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | future | CC: | bugs, mperina, nicolas, rnori | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-12-11 10:41:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
*** This bug has been marked as a duplicate of bug 1582379 *** Hello, On 4.3.2.1-1.el7, I'm facing with the same issue : For no reason, one of the host was seen as unreachable, so was fenced. When back to life, the engine declared it as NonResponsive, though it can perfectly ping it, and SSH it. I'm applying every updates since long, and I'm still seeing this issue very regularly. That's why I'm suprised to see this bug and/or its duplicate as closed. Do you advice me to open a third bug for that? Can you please attach the debug logs when the issue happens. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1513101 [details] logs Description of problem: Due a power outage in the lab, engine lost communication with some of the hosts in several data centers. These hosts came up eventually but their status remained non responsive for days. Only after reboot to ovirt-engine service, the status was updated to up. Version-Release number of selected component (if applicable): ovirt-engine-4.2.7.4-0.1.el7ev.noarch postgresql-9.2.24-1.el7_5.x86_64 vdsm-4.20.39-1.el7ev.x86_64 How reproducible: Happened after power outage. This environment was upgraded to 4.2.7 Steps to Reproduce: 1. Power off a host Actual results: Connection lost with host tigris04 at 2018-12-07 18:48:24: 2018-12-07 18:48:24,636+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connection timeout for host 'tigris04.scl.lab.tlv.redhat.com', last response arrived 34840 ms ago. vdsm starts on the host at 2018-12-07 18:53:44 after the power outage: 2018-12-07 18:53:44,488+0200 INFO (MainThread) [vds] (PID: 3501) I am the actual vdsm 4.20.39-1.el7ev tigris04.scl.lab.tlv.redhat.com (3.10.0-862.11.6.el7.x86_64) (vdsmd:149) Engine keeps reporting the host as unreachable and its state remains non responsive up until ovirt-engine restart which took place ~2 days after: 2018-12-09 11:02:10,592+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM t igris04 command Get Host Capabilities failed: Message timeout which can be caused by communication issues Only after ovirt-engine service restart the host status is refreshed: 2018-12-09 11:29:37,820+02 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (ServerService Thread Pool -- 46) [] Initialize vdsBroker 'tigris04.scl.lab.tlv.redhat.com:54321' Expected results: Host status should be up when its reachable to the engine and its vdsmd service is running. Additional info: