Description of problem:
engine brings hosts out of maintenance mode by choosing them as proxies for power fencing
Version-Release number of selected component (if applicable):
Steps to Reproduce:
In a two node cluster:
- put host 2 into maintenance
2017-11-23 10:56:25,726 INFO [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-42)  Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database, vds 'rhevh2-375'(ac28d47f-c403-45a0-9976-5025e5a9dca3)
- stop 'vdsmd' in host 2
- stop 'vdsmd' on host 1, which is the only host up and therefore the current SPM, and block IP traffic from engine to it.
After _many_ failed connection attempts from RHEV-M to host 1, RHEV-M gives up and proceeds to try to fence host 1:
2017-11-23 11:07:37,557 INFO [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-6-thread-43)  ResourceManager::vdsNotResponding entered for Host '70155b92-324f-4921-8365-377dbc6c297a', 'rhevh1-375.usersys.redhat.com'
It tries to use host 2 as a proxy to fence host 1 ...
2017-11-23 11:07:37,747 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host rhevh1-375 using Proxy Host rhevh2-375 and Fence Agent ipmilan:10.33.9.254.
... and fails since 'vdsmd' is stopped in host 2:
2017-11-23 11:07:37,947 WARN [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Fence action failed using proxy host 'rhevh2-375.usersys.redhat.com', trying another proxy
2017-11-23 11:07:38,074 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Can not run fence action on host 'rhevh1-375', no suitable proxy host was found.
2017-11-23 11:07:38,074 WARN [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Failed to find another proxy to re-run failed fence action, retrying with the same proxy 'rhevh2-375.usersys.redhat.com'
- After a second and final failed attempt ...
2017-11-23 11:07:38,098 ERROR [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] FenceVdsVDSCommand finished with null return value: succeeded=false, exceptionString='org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed'
... RHEV-M tries to run "GetCapabilitiesVDSCommand" on host 2, bringing it out of maintenance mode !!! :
2017-11-23 11:07:39,046 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor)  Connecting to rhevh2-375.usersys.redhat.com/10.33.9.146
2017-11-23 11:07:39,047 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-12) [357ea9d9] Command 'GetCapabilitiesVDSCommand(HostName = rhevh2-375, [...]
- Since host 2 wasn't actually powered down, RHEV-M did a soft fencing and restarted 'vdsmd' in host 2:
2017-11-23 11:07:39,423 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Opening SSH Soft Fencing session on host 'rhevh2-375.usersys.redhat.com'
2017-11-23 11:07:39,570 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Executing SSH Soft Fencing command on host 'rhevh2-375.usersys.redhat.com'
engine chooses hosts in maintenance mode as proxies to fence other hosts bringing the chosen hosts out of maintenance mode in the process.
Hosts in maintenance mode are not brought out of maintenance mode unless requested by user.
This impacts the process to recover from damaged LVM metadata outlined in article https://access.redhat.com/solutions/120903
After further testing it turns out that disabling Power Management for the host is not a sufficient workaround to prevent this, unless the details of the fencing agent are deleted.
The "Enable Power Management" status seems to be disregarded and if info about fencing agent is still configured then fencing still occurs, I had to re-enable power management, delete the details of the fencing agent, then disable power management back.
(In reply to Julio Entrena Perez from comment #2)
> After further testing it turns out that disabling Power Management for the
> host is not a sufficient workaround to prevent this, unless the details of
> the fencing agent are deleted.
> The "Enable Power Management" status seems to be disregarded and if info
> about fencing agent is still configured then fencing still occurs, I had to
> re-enable power management, delete the details of the fencing agent, then
> disable power management back.
The only workaround is to disable fencing for the cluster in Fencing Policy tab in Cluster Detail dialog. In that case fencing will be disable, so we will not try to use them as fence proxies and those hosts will stay in Maintenance.
Verified on ovirt-engine-4.2.0-0.6.el7.noarch
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.