Bug 1520424

Summary: [RFE] Fence hosts which became NonResponsive right after engine startup
Product: [oVirt] ovirt-engine Reporter: Martin Perina <mperina>
Component: BLL.InfraAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: ---CC: bgraveno, bugs, kshukla, lveyde, mgoldboi, mkalinin, mperina
Target Milestone: ovirt-4.2.2Keywords: FutureFeature
Target Release: 4.2.2Flags: mperina: ovirt-4.2?
pmatyas: testing_plan_complete-
mperina: planning_ack?
mperina: devel_ack+
pstehlik: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.2.2 Doc Type: Enhancement
Doc Text:
After starting up, the Manager will automatically attempt to fence unresponsive hosts that have power management enabled after the configurable quiet time (5 minutes by default) has elapsed. Previously the user needed to fence them manually.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-29 11:16:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1506217, 1549899    

Description Martin Perina 2017-12-04 13:13:50 UTC
Fencing is disabled within 5 minutes interval from engine startup (interval can be changed using engine-config option DisableFenceAtStartupInSec). If some host become NonResponsive during that interval, it will not be fenced automatically and administrators are required to fence it manually (audit log error message is displayed for that) or the host needs to become responsive again by itself.

The DisableFenceAtStartupInSec option exists from 3.1 to prevent fencing storms after whole data center outage, because hosts are usually booting much longer than engine, so we need to give them time to recover and not fence them during booting up.

Unfortunately this option doesn't work well with hosted engine, especially with scenario described in [1].

To solve this issue we will schedule a job to start after DisableFenceAtStartupInSec interval is over and which will execute fencing on all NonResponsive hosts.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1506217#c4

Comment 1 Yaniv Kaul 2017-12-05 05:39:12 UTC
We could easily change the default on hosted engine?

Comment 2 Martin Perina 2017-12-05 08:53:13 UTC
(In reply to Yaniv Kaul from comment #1)
> We could easily change the default on hosted engine?

What do you mean by that? Enable that feature only on hosted engine? If so then yes, we could introduce an option do enable/disable that feature, so HE setup can change the default if needed

Comment 3 Petr Matyáš 2018-02-19 16:04:23 UTC
Verified on ovirt-engine-4.2.2-0.1.el7.noarch

Non responsive hosts are fenced after grace period after engine startup sequence.

Comment 4 Sandro Bonazzola 2018-03-29 11:16:36 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.