Bug 1568267 - VdsNotRespondingTreatment skipped forever
Summary: VdsNotRespondingTreatment skipped forever
Keywords:
Status: CLOSED DUPLICATE of bug 1506217
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.10
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Martin Perina
QA Contact: Pavel Stehlik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-17 05:28 UTC by Germano Veit Michel
Modified: 2020-08-03 15:36 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-15 15:48:22 UTC
oVirt Team: Infra
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1568265 None VERIFIED Skipped power management operation has misleading logs 2019-02-20 17:12:18 UTC

Internal Links: 1568265

Description Germano Veit Michel 2018-04-17 05:28:47 UTC
Description of problem:

1) Active-Active DR setup (site A and B)
   - Hosts A1, A2, B1, B2
   - HostedEngine on A1
2) Site A is disconnected from storage/network
3) HostedEngine fails over from site A2 to B1
   - Backend initialization 2018-04-11 ~ 18:43:53
4) Hosts from site A are in NotResponding, with VMs "running"
5) Engine starts fencing those hosts in site A
   - Host A2 fence attempt at 2018-04-11 18:44:14, skipped, never tried again
   - Host A1 fence attempt at 2018-04-11 18:47:05, run and succeeded

Skipping fencing is here:

2018-04-11 18:44:14,392+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-18) [] EVENT_ID: VDS_ALERT_FENCE_OPERATION_SKIPPED(9,003), Correlation ID: null, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Host A2 became non responsive. It has no power management configured. Please check the host status, manually reboot it, and click "Confirm Host Has Been Rebooted"

But host has power management enabled, found the code may log misleading messages, just filled this: https://bugzilla.redhat.com/show_bug.cgi?id=1568265

Looking at the code the skip could have have been due to DisableFenceAtStartupInSec or PreviousHostedEngine. As this was not a HE host before failover, I assume it was skipped due to DisableFenceAtStartupInSec (just a few seconds after startup), but with incorrect logging.

But here comes the problem, after this skip A2 goes on a loop like this forever, every few seconds:

2018-04-11 18:55:00,814+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] 
  (SSL Stomp Reactor) [] Connecting to A2/IP                                
2018-04-11 18:55:03,819+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] 
  execution failed: java.net.NoRouteToHostException: No route to host  
2018-04-11 18:55:03,820+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring]
  Failure to refresh host 'A2' runtime info: java.net.NoRouteToHostException: No route to host

And the is no new attempt to run VdsNotRespondingTreatmentCommand for it again.

Version-Release number of selected component (if applicable):
rhevm-4.1.10.3-0.1.el7.noarch

How reproducible:
100% at customer site (3 out of 3 attempts)

Steps to Reproduce:
1. Do DR failover as above

Actual results:
- Host A2 not fenced
- HA VMs with lease disabled not restarted

Expected results:
- Host A2 eventually fenced as host was not repsonding and PM is configured and reachable

Comment 2 Germano Veit Michel 2018-04-17 05:54:06 UTC
Correction:

> 3) HostedEngine fails over from site A2 to B1
Its from A1 to B1.

The host that was not fenced (A2) was not the previous HE one (A1).

Comment 3 Martin Perina 2018-04-27 14:08:04 UTC
It seems to me similar to the issue discussed in BZ1506217. If so, then you have following options:

1. You can wait for RHV 4.2 where we have enabled functionality which performs fencing of all non-responding hosts after DisableFenceAtStartupInSec passes - for details please take a look at BZ1520424

2. If you can't wait for 4.2, you can try to decrease DisableFenceAtStartupInSec using engine-config as discussed in https://bugzilla.redhat.com/show_bug.cgi?id=1506217#c13

Comment 4 Germano Veit Michel 2018-04-29 22:52:22 UTC
(In reply to Martin Perina from comment #3)
> It seems to me similar to the issue discussed in BZ1506217. If so, then you
> have following options:
> 
> 1. You can wait for RHV 4.2 where we have enabled functionality which
> performs fencing of all non-responding hosts after
> DisableFenceAtStartupInSec passes - for details please take a look at
> BZ1520424

If we are 100% sure it is exact the same as BZ1506217 then I believe we can wait for 4.2.

> 2. If you can't wait for 4.2, you can try to decrease
> DisableFenceAtStartupInSec using engine-config as discussed in
> https://bugzilla.redhat.com/show_bug.cgi?id=1506217#c13

I thought about this, but looking at the logs it would need to be as low as 6s. I'm afraid very low values weren't tested before, and it still wouldn't guarantee anything.

Comment 5 Yaniv Kaul 2018-05-17 13:43:31 UTC
(In reply to Germano Veit Michel from comment #4)
> (In reply to Martin Perina from comment #3)
> > It seems to me similar to the issue discussed in BZ1506217. If so, then you
> > have following options:
> > 
> > 1. You can wait for RHV 4.2 where we have enabled functionality which
> > performs fencing of all non-responding hosts after
> > DisableFenceAtStartupInSec passes - for details please take a look at
> > BZ1520424
> 
> If we are 100% sure it is exact the same as BZ1506217 then I believe we can
> wait for 4.2.

4.2 has just GA'ed.
Please see the setup can be upgraded.

Comment 6 Germano Veit Michel 2018-05-18 00:09:03 UTC
(In reply to Yaniv Kaul from comment #5)
> 4.2 has just GA'ed.
> Please see the setup can be upgraded.

Thanks. Andrea (TAM) relayed this request to the customer, switching the needinfo to him.

Andrea, once the customer makes the test, could you please update here?

Comment 7 Doron Fediuck 2018-06-14 09:26:57 UTC
Andrea, Germano- it's been a month.
If no updates please close the BZ.

Comment 10 Martin Perina 2018-06-15 15:48:22 UTC
Closing as duplicate of BZ1506217, feel free to reopen if reproduced on RHV 4.2

*** This bug has been marked as a duplicate of bug 1506217 ***

Comment 11 Franta Kust 2019-05-16 13:06:59 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.