Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1516817 - Don't allow hosts in Maintenance to be selected as fence proxies
Don't allow hosts in Maintenance to be selected as fence proxies
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.10
All Linux
unspecified Severity high
: ovirt-4.2.0
: ---
Assigned To: Martin Perina
Petr Matyáš
: ZStream
Depends On:
Blocks: 1517707
  Show dependency treegraph
 
Reported: 2017-11-23 06:58 EST by Julio Entrena Perez
Modified: 2018-05-15 13:47 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1517707 (view as bug list)
Environment:
Last Closed: 2018-05-15 13:46:12 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 84575 master MERGED core: Only hosts in Up and NonOperational can be fence proxies 2017-11-27 09:36 EST
Red Hat Product Errata RHEA-2018:1488 None None None 2018-05-15 13:47 EDT

  None (edit)
Description Julio Entrena Perez 2017-11-23 06:58:48 EST
Description of problem:
engine brings hosts out of maintenance mode by choosing them as proxies for power fencing

Version-Release number of selected component (if applicable):
rhevm-3.6.12-0.1.el6

How reproducible:
Always

Steps to Reproduce:

In a two node cluster:

- put host 2 into maintenance

  2017-11-23 10:56:25,726 INFO  [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-42) [] Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database,  vds 'rhevh2-375'(ac28d47f-c403-45a0-9976-5025e5a9dca3)

- stop 'vdsmd' in host 2
  
- stop 'vdsmd' on host 1, which is the only host up and therefore the current SPM, and block IP traffic from engine to it.

  After _many_ failed connection attempts from RHEV-M to host 1, RHEV-M gives up and proceeds to try to fence host 1:
  
  2017-11-23 11:07:37,557 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-6-thread-43) [] ResourceManager::vdsNotResponding entered for Host '70155b92-324f-4921-8365-377dbc6c297a', 'rhevh1-375.usersys.redhat.com'
  
  It tries to use host 2 as a proxy to fence host 1 ...
  
  2017-11-23 11:07:37,747 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host rhevh1-375 using Proxy Host rhevh2-375 and Fence Agent ipmilan:10.33.9.254.
  
  ... and fails since 'vdsmd' is stopped in host 2:
  
  2017-11-23 11:07:37,947 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Fence action failed using proxy host 'rhevh2-375.usersys.redhat.com', trying another proxy

  2017-11-23 11:07:38,074 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Can not run fence action on host 'rhevh1-375', no suitable proxy host was found.

  2017-11-23 11:07:38,074 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Failed to find another proxy to re-run failed fence action, retrying with the same proxy 'rhevh2-375.usersys.redhat.com'
  

- After a second and final failed attempt ...

  2017-11-23 11:07:38,098 ERROR [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] FenceVdsVDSCommand finished with null return value: succeeded=false, exceptionString='org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed'

  ... RHEV-M tries to run "GetCapabilitiesVDSCommand" on host 2, bringing it out of maintenance mode !!! :

  2017-11-23 11:07:39,046 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to rhevh2-375.usersys.redhat.com/10.33.9.146

  2017-11-23 11:07:39,047 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-12) [357ea9d9] Command 'GetCapabilitiesVDSCommand(HostName = rhevh2-375, [...]
  

- Since host 2 wasn't actually powered down, RHEV-M did a soft fencing and restarted 'vdsmd' in host 2:

  2017-11-23 11:07:39,423 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Opening SSH Soft Fencing session on host 'rhevh2-375.usersys.redhat.com'

  2017-11-23 11:07:39,570 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Executing SSH Soft Fencing command on host 'rhevh2-375.usersys.redhat.com'


Actual results:
engine chooses hosts in maintenance mode as proxies to fence other hosts bringing the chosen hosts out of maintenance mode in the process.

Expected results:
Hosts in maintenance mode are not brought out of maintenance mode unless requested by user.

Additional info:
This impacts the process to recover from damaged LVM metadata outlined in article https://access.redhat.com/solutions/120903
Comment 2 Julio Entrena Perez 2017-11-24 11:43:07 EST
After further testing it turns out that disabling Power Management for the host is not a sufficient workaround to prevent this, unless the details of the fencing agent are deleted.

The "Enable Power Management" status seems to be disregarded and if info about fencing agent is still configured then fencing still occurs, I had to re-enable power management, delete the details of the fencing agent, then disable power management back.
Comment 3 Martin Perina 2017-11-24 14:01:46 EST
(In reply to Julio Entrena Perez from comment #2)
> After further testing it turns out that disabling Power Management for the
> host is not a sufficient workaround to prevent this, unless the details of
> the fencing agent are deleted.
> 
> The "Enable Power Management" status seems to be disregarded and if info
> about fencing agent is still configured then fencing still occurs, I had to
> re-enable power management, delete the details of the fencing agent, then
> disable power management back.

The only workaround is to disable fencing for the cluster in Fencing Policy tab in Cluster Detail dialog. In that case fencing will be disable, so we will not try to use them as fence proxies and those hosts will stay in Maintenance.
Comment 6 Petr Matyáš 2017-12-07 12:23:29 EST
Verified on ovirt-engine-4.2.0-0.6.el7.noarch
Comment 10 errata-xmlrpc 2018-05-15 13:46:12 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Note You need to log in before you can comment on or make changes to this bug.