Bug 1516817 - Don't allow hosts in Maintenance to be selected as fence proxies
Summary: Don't allow hosts in Maintenance to be selected as fence proxies
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.10
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.2.0
: ---
Assignee: Martin Perina
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1517707
TreeView+ depends on / blocked
 
Reported: 2017-11-23 11:58 UTC by Julio Entrena Perez
Modified: 2021-03-11 18:38 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1517707 (view as bug list)
Environment:
Last Closed: 2018-05-15 17:46:12 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2018:1488 0 None None None 2018-05-15 17:47:48 UTC
oVirt gerrit 84575 0 master MERGED core: Only hosts in Up and NonOperational can be fence proxies 2020-09-10 18:36:01 UTC

Description Julio Entrena Perez 2017-11-23 11:58:48 UTC
Description of problem:
engine brings hosts out of maintenance mode by choosing them as proxies for power fencing

Version-Release number of selected component (if applicable):
rhevm-3.6.12-0.1.el6

How reproducible:
Always

Steps to Reproduce:

In a two node cluster:

- put host 2 into maintenance

  2017-11-23 10:56:25,726 INFO  [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-42) [] Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database,  vds 'rhevh2-375'(ac28d47f-c403-45a0-9976-5025e5a9dca3)

- stop 'vdsmd' in host 2
  
- stop 'vdsmd' on host 1, which is the only host up and therefore the current SPM, and block IP traffic from engine to it.

  After _many_ failed connection attempts from RHEV-M to host 1, RHEV-M gives up and proceeds to try to fence host 1:
  
  2017-11-23 11:07:37,557 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-6-thread-43) [] ResourceManager::vdsNotResponding entered for Host '70155b92-324f-4921-8365-377dbc6c297a', 'rhevh1-375.usersys.redhat.com'
  
  It tries to use host 2 as a proxy to fence host 1 ...
  
  2017-11-23 11:07:37,747 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host rhevh1-375 using Proxy Host rhevh2-375 and Fence Agent ipmilan:10.33.9.254.
  
  ... and fails since 'vdsmd' is stopped in host 2:
  
  2017-11-23 11:07:37,947 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Fence action failed using proxy host 'rhevh2-375.usersys.redhat.com', trying another proxy

  2017-11-23 11:07:38,074 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Can not run fence action on host 'rhevh1-375', no suitable proxy host was found.

  2017-11-23 11:07:38,074 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Failed to find another proxy to re-run failed fence action, retrying with the same proxy 'rhevh2-375.usersys.redhat.com'
  

- After a second and final failed attempt ...

  2017-11-23 11:07:38,098 ERROR [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] FenceVdsVDSCommand finished with null return value: succeeded=false, exceptionString='org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed'

  ... RHEV-M tries to run "GetCapabilitiesVDSCommand" on host 2, bringing it out of maintenance mode !!! :

  2017-11-23 11:07:39,046 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to rhevh2-375.usersys.redhat.com/10.33.9.146

  2017-11-23 11:07:39,047 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-12) [357ea9d9] Command 'GetCapabilitiesVDSCommand(HostName = rhevh2-375, [...]
  

- Since host 2 wasn't actually powered down, RHEV-M did a soft fencing and restarted 'vdsmd' in host 2:

  2017-11-23 11:07:39,423 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Opening SSH Soft Fencing session on host 'rhevh2-375.usersys.redhat.com'

  2017-11-23 11:07:39,570 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Executing SSH Soft Fencing command on host 'rhevh2-375.usersys.redhat.com'


Actual results:
engine chooses hosts in maintenance mode as proxies to fence other hosts bringing the chosen hosts out of maintenance mode in the process.

Expected results:
Hosts in maintenance mode are not brought out of maintenance mode unless requested by user.

Additional info:
This impacts the process to recover from damaged LVM metadata outlined in article https://access.redhat.com/solutions/120903

Comment 2 Julio Entrena Perez 2017-11-24 16:43:07 UTC
After further testing it turns out that disabling Power Management for the host is not a sufficient workaround to prevent this, unless the details of the fencing agent are deleted.

The "Enable Power Management" status seems to be disregarded and if info about fencing agent is still configured then fencing still occurs, I had to re-enable power management, delete the details of the fencing agent, then disable power management back.

Comment 3 Martin Perina 2017-11-24 19:01:46 UTC
(In reply to Julio Entrena Perez from comment #2)
> After further testing it turns out that disabling Power Management for the
> host is not a sufficient workaround to prevent this, unless the details of
> the fencing agent are deleted.
> 
> The "Enable Power Management" status seems to be disregarded and if info
> about fencing agent is still configured then fencing still occurs, I had to
> re-enable power management, delete the details of the fencing agent, then
> disable power management back.

The only workaround is to disable fencing for the cluster in Fencing Policy tab in Cluster Detail dialog. In that case fencing will be disable, so we will not try to use them as fence proxies and those hosts will stay in Maintenance.

Comment 6 Petr Matyáš 2017-12-07 17:23:29 UTC
Verified on ovirt-engine-4.2.0-0.6.el7.noarch

Comment 10 errata-xmlrpc 2018-05-15 17:46:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 11 Franta Kust 2019-05-16 13:09:06 UTC
BZ<2>Jira Resync


Note You need to log in before you can comment on or make changes to this bug.