Bug 1506217

Summary:	Non hosted engine host (HA enabled) is not getting fenced
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Nirav Dave <ndave>
Component:	ovirt-engine	Assignee:	Eli Mesika <emesika>
Status:	CLOSED ERRATA	QA Contact:	Artyom <alukiano>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.6	CC:	apinnick, gveitmic, lsurette, mavital, mgoldboi, mkalinin, mperina, ndave, rbalakri, Rhev-m-bugs, srevivo, ykaul
Target Milestone:	ovirt-4.2.2	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Previously, unresponsive hosts with power management enabled had to be fenced manually. In the current release, the Manager, upon start-up, will automatically attempt to fence the hosts after a configurable period (5 minutes, by default) of inactivity has elapsed.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-15 17:45:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1520424
Bug Blocks:

Description Nirav Dave 2017-10-25 12:22:02 UTC

Description of problem:

The hosted engine host is being rebooted from idrac, the VMs from hosted engine is getting successfully restarted another hosted engine host. After rebooting hosted engine host, non-hosted engine host is reboot in the interval of 2-3 second, engine fails to fence the non-hosted engine host, HA is enabled on hosted engine host.


It is failing with following error:

"Failed to run Fence script on vds <non hosted engine host name>" 

Code snippet from where the error is being generated
----------------------------------------
  /**
         * Only fence the host if the VDS is down, otherwise it might have gone back up until this command was executed. If
         * the VDS is not fenced then don't send an audit log event.
         */
        @Override
        protected void executeCommand() {
            VDS host = getVds();
            if (!previousHostedEngineHost.isPreviousHostId(host.getId())
                    && !new FenceValidator().isStartupTimeoutPassed()
                    && !host.isInFenceFlow()) {
                log.error("Failed to run Fence script on vds '{}'.", getVdsName());
                alertIfPowerManagementOperationSkipped();
                // If fencing can't be done and the host is the SPM, set storage-pool to non-operational
                if (host.getSpmStatus() != VdsSpmStatus.None) {
                    setStoragePoolNonOperational();
                }
                return;
            }

---------------------------------------------


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Release version : rhevm-4.1.6.2-0.1.el7.noarch

Env: Hosted Engine

How reproducible:
-----------------

The environment should have atleast one non-hosted engine host with two hosted engine host.

1) Reboot hosted engine host.
2) Immediately reboot non-hosted engine host.



Steps to Reproduce:
Mentioned in how reproducible

Actual results:
"Failed to run Fence script on vds <non hosted engine host name>" 

Expected results:
Host should be fenced

I will be attaching the logs shortly to this.

Thanks & Regards,
Nirav Dave

Comment 3 Nirav Dave 2017-10-26 09:59:57 UTC

Hello,

Please let know if more logs/details are needed.

Thanks & Regards,
Nirav Dave

Comment 4 Martin Perina 2017-10-27 14:16:14 UTC

Not sure how it's possible, but I'm missing several time intervals in the engine.log (for example the log records mentioned in customer case are not present in provided logs).

But from the description I assume following scenario:

1. Have at least 3 hosts in the cluster:
    host1 - HE VM and other HA VMs are running on it
    host2 - HA VMs running on it
    host3

2. Reboot host1 and after several seconds reboot host2

3. HE VM is restarted on host3

4. Engine detects that host1 is NonResponsive and it's a host where HE VM run previously, so it will fence host1 and start HA VMs on host3

5. Engine detects that host2 is NonResponsive, but as it's not a host where HE VM run previously, so it's not fenced as fencing is disabled during 5 minutes after engine startup. Such hosts needs to be fence manually or they will become available again when engine can reconnect to them. Of course during this time HA VMs cannot be restarted on different host.

The interval disabling fencing during engine startup can be changed using 'engine-config -s DisableFenceAtStartupInSec=NNN', but I don't recommend changing it as it's one of our fail safes against fencing storm.

So if above is the flow, I think everything is working as expected. If not, please describe the flow you think doesn't work.

Comment 13 Martin Perina 2017-12-04 13:34:43 UTC

Please take a look at description at Comment 4, this is another use case of chicken-egg problem introduced by hosted-engine.


Workaround:
===========
Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval during engine startup (interval can be changed by engine-config option DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval because too low value may cause fencing storm. Fortunately since 3.6 we have another way how prevent fencing storms: For each cluster we have a Fencing Policy and here we can define to skip fencing if number of Connecting/NonResponsive hosts in the cluster is higher than specified %. Option is named 'Skip fencing on cluster connectivity issues' and it's set to 50% by default.
So here's a workaround that can be tested:

1. Ensure that for all cluster with HA VMs that fencing is enabled in Fencing Policy of the cluster and 'Skip fencing on cluster connectivity issues' is also enabled and set to good value (depends on number of hosts in the cluster)

2. Decrease DisableFenceAtStartupInSec value using
     engine-config -s DisableFenceAtStartupInSec=NNN
   where NNN is number of seconds from engine startup within which fencing is disabled. I'd start with 30 seconds value and please try different values if not working well for your setup.

3. Restart ovirt-engine and try the scenario mentioned in the description of the bug.


Solution:
=========
Definitive solution for this problem is described in BZ1520424.

Comment 14 Martin Perina 2018-02-14 20:36:14 UTC

Moving to MODIFIED to align status with BZ1520424

Comment 15 RHV bug bot 2018-02-16 16:24:52 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ]

For more info please contact: rhv-devops

Comment 16 Marina Kalinin 2018-02-19 21:18:24 UTC

(In reply to Martin Perina from comment #14)
> Moving to MODIFIED to align status with BZ1520424

Martin, should this bug be considered downstream clone of this u/s bug?

Comment 17 Marina Kalinin 2018-02-19 21:19:43 UTC

(In reply to Martin Perina from comment #13)
> Please take a look at description at Comment 4, this is another use case of
> chicken-egg problem introduced by hosted-engine.
> 
> 
> Workaround:
> ===========
> Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval
> during engine startup (interval can be changed by engine-config option
> DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval
> because too low value may cause fencing storm. Fortunately since 3.6 we have
> another way how prevent fencing storms: For each cluster we have a Fencing
> Policy and here we can define to skip fencing if number of
> Connecting/NonResponsive hosts in the cluster is higher than specified %.
> Option is named 'Skip fencing on cluster connectivity issues' and it's set
> to 50% by default.
> So here's a workaround that can be tested:
> 
> 1. Ensure that for all cluster with HA VMs that fencing is enabled in
> Fencing Policy of the cluster and 'Skip fencing on cluster connectivity
> issues' is also enabled and set to good value (depends on number of hosts in
> the cluster)
> 
> 2. Decrease DisableFenceAtStartupInSec value using
>      engine-config -s DisableFenceAtStartupInSec=NNN
>    where NNN is number of seconds from engine startup within which fencing
> is disabled. I'd start with 30 seconds value and please try different values
> if not working well for your setup.
> 
> 3. Restart ovirt-engine and try the scenario mentioned in the description of
> the bug.
> 
> 
> Solution:
> =========
> Definitive solution for this problem is described in BZ1520424.

Nirav,
Please make sure we kcs this workaround.

Comment 18 Martin Perina 2018-02-20 07:30:06 UTC

(In reply to Marina from comment #16)
> (In reply to Martin Perina from comment #14)
> > Moving to MODIFIED to align status with BZ1520424
> 
> Martin, should this bug be considered downstream clone of this u/s bug?

Well, BZ15061217 is an RFE, issue described in this bug can also be fixed using above workaround.

Comment 20 Artyom 2018-02-20 12:12:42 UTC

Verified on rhvm-4.2.2-0.1.el7.noarch

Comment 21 RHV bug bot 2018-03-16 15:03:22 UTC

INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[No relevant external trackers attached]

For more info please contact: rhv-devops

Comment 25 errata-xmlrpc 2018-05-15 17:45:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 26 Martin Perina 2018-06-15 15:48:22 UTC

*** Bug 1568267 has been marked as a duplicate of this bug. ***

Comment 27 Franta Kust 2019-05-16 13:05:19 UTC

BZ<2>Jira Resync

Comment 28 Daniel Gur 2019-08-28 13:12:47 UTC

sync2jira

Comment 29 Daniel Gur 2019-08-28 13:17:00 UTC

sync2jira

Comment 32 Red Hat Bugzilla 2023-09-15 00:04:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days