Bug 1506217
Summary: | Non hosted engine host (HA enabled) is not getting fenced | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nirav Dave <ndave> |
Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> |
Status: | CLOSED ERRATA | QA Contact: | Artyom <alukiano> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.6 | CC: | apinnick, gveitmic, lsurette, mavital, mgoldboi, mkalinin, mperina, ndave, rbalakri, Rhev-m-bugs, srevivo, ykaul |
Target Milestone: | ovirt-4.2.2 | Flags: | lsvaty:
testing_plan_complete-
|
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
Previously, unresponsive hosts with power management enabled had to be fenced manually. In the current release, the Manager, upon start-up, will automatically attempt to fence the hosts after a configurable period (5 minutes, by default) of inactivity has elapsed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-15 17:45:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1520424 | ||
Bug Blocks: |
Description
Nirav Dave
2017-10-25 12:22:02 UTC
Hello, Please let know if more logs/details are needed. Thanks & Regards, Nirav Dave Not sure how it's possible, but I'm missing several time intervals in the engine.log (for example the log records mentioned in customer case are not present in provided logs). But from the description I assume following scenario: 1. Have at least 3 hosts in the cluster: host1 - HE VM and other HA VMs are running on it host2 - HA VMs running on it host3 2. Reboot host1 and after several seconds reboot host2 3. HE VM is restarted on host3 4. Engine detects that host1 is NonResponsive and it's a host where HE VM run previously, so it will fence host1 and start HA VMs on host3 5. Engine detects that host2 is NonResponsive, but as it's not a host where HE VM run previously, so it's not fenced as fencing is disabled during 5 minutes after engine startup. Such hosts needs to be fence manually or they will become available again when engine can reconnect to them. Of course during this time HA VMs cannot be restarted on different host. The interval disabling fencing during engine startup can be changed using 'engine-config -s DisableFenceAtStartupInSec=NNN', but I don't recommend changing it as it's one of our fail safes against fencing storm. So if above is the flow, I think everything is working as expected. If not, please describe the flow you think doesn't work. Please take a look at description at Comment 4, this is another use case of chicken-egg problem introduced by hosted-engine. Workaround: =========== Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval during engine startup (interval can be changed by engine-config option DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval because too low value may cause fencing storm. Fortunately since 3.6 we have another way how prevent fencing storms: For each cluster we have a Fencing Policy and here we can define to skip fencing if number of Connecting/NonResponsive hosts in the cluster is higher than specified %. Option is named 'Skip fencing on cluster connectivity issues' and it's set to 50% by default. So here's a workaround that can be tested: 1. Ensure that for all cluster with HA VMs that fencing is enabled in Fencing Policy of the cluster and 'Skip fencing on cluster connectivity issues' is also enabled and set to good value (depends on number of hosts in the cluster) 2. Decrease DisableFenceAtStartupInSec value using engine-config -s DisableFenceAtStartupInSec=NNN where NNN is number of seconds from engine startup within which fencing is disabled. I'd start with 30 seconds value and please try different values if not working well for your setup. 3. Restart ovirt-engine and try the scenario mentioned in the description of the bug. Solution: ========= Definitive solution for this problem is described in BZ1520424. Moving to MODIFIED to align status with BZ1520424 WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.2-ga': '?'}', ] For more info please contact: rhv-devops (In reply to Martin Perina from comment #14) > Moving to MODIFIED to align status with BZ1520424 Martin, should this bug be considered downstream clone of this u/s bug? (In reply to Martin Perina from comment #13) > Please take a look at description at Comment 4, this is another use case of > chicken-egg problem introduced by hosted-engine. > > > Workaround: > =========== > Ad mentioned above since 3.1 fencing is disabled within 5 minutes interval > during engine startup (interval can be changed by engine-config option > DisableFenceAtStartupInSec), but it's a bit risky to decrease that interval > because too low value may cause fencing storm. Fortunately since 3.6 we have > another way how prevent fencing storms: For each cluster we have a Fencing > Policy and here we can define to skip fencing if number of > Connecting/NonResponsive hosts in the cluster is higher than specified %. > Option is named 'Skip fencing on cluster connectivity issues' and it's set > to 50% by default. > So here's a workaround that can be tested: > > 1. Ensure that for all cluster with HA VMs that fencing is enabled in > Fencing Policy of the cluster and 'Skip fencing on cluster connectivity > issues' is also enabled and set to good value (depends on number of hosts in > the cluster) > > 2. Decrease DisableFenceAtStartupInSec value using > engine-config -s DisableFenceAtStartupInSec=NNN > where NNN is number of seconds from engine startup within which fencing > is disabled. I'd start with 30 seconds value and please try different values > if not working well for your setup. > > 3. Restart ovirt-engine and try the scenario mentioned in the description of > the bug. > > > Solution: > ========= > Definitive solution for this problem is described in BZ1520424. Nirav, Please make sure we kcs this workaround. (In reply to Marina from comment #16) > (In reply to Martin Perina from comment #14) > > Moving to MODIFIED to align status with BZ1520424 > > Martin, should this bug be considered downstream clone of this u/s bug? Well, BZ15061217 is an RFE, issue described in this bug can also be fixed using above workaround. Verified on rhvm-4.2.2-0.1.el7.noarch INFO: Bug status (VERIFIED) wasn't changed but the folowing should be fixed: [No relevant external trackers attached] For more info please contact: rhv-devops Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 *** Bug 1568267 has been marked as a duplicate of this bug. *** BZ<2>Jira Resync sync2jira sync2jira The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |