Description of problem: The hosted engine host is being rebooted from idrac, the VMs from hosted engine is getting successfully restarted another hosted engine host. After rebooting hosted engine host, non-hosted engine host is reboot in the interval of 2-3 second, engine fails to fence the non-hosted engine host, HA is enabled on hosted engine host. It is failing with following error: "Failed to run Fence script on vds <non hosted engine host name>" Code snippet from where the error is being generated ---------------------------------------- /** * Only fence the host if the VDS is down, otherwise it might have gone back up until this command was executed. If * the VDS is not fenced then don't send an audit log event. */ @Override protected void executeCommand() { VDS host = getVds(); if (!previousHostedEngineHost.isPreviousHostId(host.getId()) && !new FenceValidator().isStartupTimeoutPassed() && !host.isInFenceFlow()) { log.error("Failed to run Fence script on vds '{}'.", getVdsName()); alertIfPowerManagementOperationSkipped(); // If fencing can't be done and the host is the SPM, set storage-pool to non-operational if (host.getSpmStatus() != VdsSpmStatus.None) { setStoragePoolNonOperational(); } return; } --------------------------------------------- Version-Release number of selected component (if applicable): ------------------------------------------------------------- Release version : rhevm-4.1.6.2-0.1.el7.noarch Env: Hosted Engine How reproducible: ----------------- The environment should have atleast one non-hosted engine host with two hosted engine host. 1) Reboot hosted engine host. 2) Immediately reboot non-hosted engine host. Steps to Reproduce: Mentioned in how reproducible Actual results: "Failed to run Fence script on vds <non hosted engine host name>" Expected results: Host should be fenced I will be attaching the logs shortly to this. Thanks & Regards, Nirav Dave
Hello, Please let know if more logs/details are needed. Thanks & Regards, Nirav Dave
Not sure how it's possible, but I'm missing several time intervals in the engine.log (for example the log records mentioned in customer case are not present in provided logs). But from the description I assume following scenario: 1. Have at least 3 hosts in the cluster: host1 - HE VM and other HA VMs are running on it host2 - HA VMs running on it host3 2. Reboot host1 and after several seconds reboot host2 3. HE VM is restarted on host3 4. Engine detects that host1 is NonResponsive and it's a host where HE VM run previously, so it will fence host1 and start HA VMs on host3 5. Engine detects that host2 is NonResponsive, but as it's not a host where HE VM run previously, so it's not fenced as fencing is disabled during 5 minutes after engine startup. Such hosts needs to be fence manually or they will become available again when engine can reconnect to them. Of course during this time HA VMs cannot be restarted on different host. The interval disabling fencing during engine startup can be changed using 'engine-config -s DisableFenceAtStartupInSec=NNN', but I don't recommend changing it as it's one of our fail safes against fencing storm. So if above is the flow, I think everything is working as expected. If not, please describe the flow you think doesn't work.
Please take a look at description at Comment 4, this is another use case of chicken-egg problem introduced by hosted-engine. Workaround: =========== Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval during engine startup (interval can be changed by engine-config option DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval because too low value may cause fencing storm. Fortunately since 3.6 we have another way how prevent fencing storms: For each cluster we have a Fencing Policy and here we can define to skip fencing if number of Connecting/NonResponsive hosts in the cluster is higher than specified %. Option is named 'Skip fencing on cluster connectivity issues' and it's set to 50% by default. So here's a workaround that can be tested: 1. Ensure that for all cluster with HA VMs that fencing is enabled in Fencing Policy of the cluster and 'Skip fencing on cluster connectivity issues' is also enabled and set to good value (depends on number of hosts in the cluster) 2. Decrease DisableFenceAtStartupInSec value using engine-config -s DisableFenceAtStartupInSec=NNN where NNN is number of seconds from engine startup within which fencing is disabled. I'd start with 30 seconds value and please try different values if not working well for your setup. 3. Restart ovirt-engine and try the scenario mentioned in the description of the bug. Solution: ========= Definitive solution for this problem is described in BZ1520424.
Moving to MODIFIED to align status with BZ1520424
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops
(In reply to Martin Perina from comment #14) > Moving to MODIFIED to align status with BZ1520424 Martin, should this bug be considered downstream clone of this u/s bug?
(In reply to Martin Perina from comment #13) > Please take a look at description at Comment 4, this is another use case of > chicken-egg problem introduced by hosted-engine. > > > Workaround: > =========== > Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval > during engine startup (interval can be changed by engine-config option > DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval > because too low value may cause fencing storm. Fortunately since 3.6 we have > another way how prevent fencing storms: For each cluster we have a Fencing > Policy and here we can define to skip fencing if number of > Connecting/NonResponsive hosts in the cluster is higher than specified %. > Option is named 'Skip fencing on cluster connectivity issues' and it's set > to 50% by default. > So here's a workaround that can be tested: > > 1. Ensure that for all cluster with HA VMs that fencing is enabled in > Fencing Policy of the cluster and 'Skip fencing on cluster connectivity > issues' is also enabled and set to good value (depends on number of hosts in > the cluster) > > 2. Decrease DisableFenceAtStartupInSec value using > engine-config -s DisableFenceAtStartupInSec=NNN > where NNN is number of seconds from engine startup within which fencing > is disabled. I'd start with 30 seconds value and please try different values > if not working well for your setup. > > 3. Restart ovirt-engine and try the scenario mentioned in the description of > the bug. > > > Solution: > ========= > Definitive solution for this problem is described in BZ1520424. Nirav, Please make sure we kcs this workaround.
(In reply to Marina from comment #16) > (In reply to Martin Perina from comment #14) > > Moving to MODIFIED to align status with BZ1520424 > > Martin, should this bug be considered downstream clone of this u/s bug? Well, BZ15061217 is an RFE, issue described in this bug can also be fixed using above workaround.
Verified on rhvm-4.2.2-0.1.el7.noarch
INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [No relevant external trackers attached] For more info please contact: rhv-devops
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
*** Bug 1568267 has been marked as a duplicate of this bug. ***
BZ<2>Jira Resync
sync2jira
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days