Description of problem: 1) Active-Active DR setup (site A and B) - Hosts A1, A2, B1, B2 - HostedEngine on A1 2) Site A is disconnected from storage/network 3) HostedEngine fails over from site A2 to B1 - Backend initialization 2018-04-11 ~ 18:43:53 4) Hosts from site A are in NotResponding, with VMs "running" 5) Engine starts fencing those hosts in site A - Host A2 fence attempt at 2018-04-11 18:44:14, skipped, never tried again - Host A1 fence attempt at 2018-04-11 18:47:05, run and succeeded Skipping fencing is here: 2018-04-11 18:44:14,392+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [] EVENT_ID: VDS_ALERT_FENCE_OPERATION_SKIPPED(9,003), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host A2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted" But host has power management enabled, found the code may log misleading messages, just filled this: https://bugzilla.redhat.com/show_bug.cgi?id=1568265 Looking at the code the skip could have have been due to DisableFenceAtStartupInSec or PreviousHostedEngine. As this was not a HE host before failover, I assume it was skipped due to DisableFenceAtStartupInSec (just a few seconds after startup), but with incorrect logging. But here comes the problem, after this skip A2 goes on a loop like this forever, every few seconds: 2018-04-11 18:55:00,814+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to A2/IP 2018-04-11 18:55:03,819+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] execution failed: java.net.NoRouteToHostException: No route to host 2018-04-11 18:55:03,820+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] Failure to refresh host 'A2' runtime info: java.net.NoRouteToHostException: No route to host And the is no new attempt to run VdsNotRespondingTreatmentCommand for it again. Version-Release number of selected component (if applicable): rhevm-4.1.10.3-0.1.el7.noarch How reproducible: 100% at customer site (3 out of 3 attempts) Steps to Reproduce: 1. Do DR failover as above Actual results: - Host A2 not fenced - HA VMs with lease disabled not restarted Expected results: - Host A2 eventually fenced as host was not repsonding and PM is configured and reachable
Correction: > 3) HostedEngine fails over from site A2 to B1 Its from A1 to B1. The host that was not fenced (A2) was not the previous HE one (A1).
It seems to me similar to the issue discussed in BZ1506217. If so, then you have following options: 1. You can wait for RHV 4.2 where we have enabled functionality which performs fencing of all non-responding hosts after DisableFenceAtStartupInSec passes - for details please take a look at BZ1520424 2. If you can't wait for 4.2, you can try to decrease DisableFenceAtStartupInSec using engine-config as discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1506217#c13
(In reply to Martin Perina from comment #3) > It seems to me similar to the issue discussed in BZ1506217. If so, then you > have following options: > > 1. You can wait for RHV 4.2 where we have enabled functionality which > performs fencing of all non-responding hosts after > DisableFenceAtStartupInSec passes - for details please take a look at > BZ1520424 If we are 100% sure it is exact the same as BZ1506217 then I believe we can wait for 4.2. > 2. If you can't wait for 4.2, you can try to decrease > DisableFenceAtStartupInSec using engine-config as discussed in > https://bugzilla.redhat.com/show_bug.cgi?id=1506217#c13 I thought about this, but looking at the logs it would need to be as low as 6s. I'm afraid very low values weren't tested before, and it still wouldn't guarantee anything.
(In reply to Germano Veit Michel from comment #4) > (In reply to Martin Perina from comment #3) > > It seems to me similar to the issue discussed in BZ1506217. If so, then you > > have following options: > > > > 1. You can wait for RHV 4.2 where we have enabled functionality which > > performs fencing of all non-responding hosts after > > DisableFenceAtStartupInSec passes - for details please take a look at > > BZ1520424 > > If we are 100% sure it is exact the same as BZ1506217 then I believe we can > wait for 4.2. 4.2 has just GA'ed. Please see the setup can be upgraded.
(In reply to Yaniv Kaul from comment #5) > 4.2 has just GA'ed. > Please see the setup can be upgraded. Thanks. Andrea (TAM) relayed this request to the customer, switching the needinfo to him. Andrea, once the customer makes the test, could you please update here?
Andrea, Germano- it's been a month. If no updates please close the BZ.
Closing as duplicate of BZ1506217, feel free to reopen if reproduced on RHV 4.2 *** This bug has been marked as a duplicate of bug 1506217 ***
BZ<2>Jira Resync