Bug 1517707

Summary: [downstream clone - 4.1.9] Don't allow hosts in Maintenance to be selected as fence proxies
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: ovirt-engineAssignee: Martin Perina <mperina>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.10CC: lsurette, lveyde, mgoldboi, mperina, pstehlik, rbalakri, Rhev-m-bugs, srevivo, vanhoof, ykaul
Target Milestone: ovirt-4.1.9Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.1.9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1516817 Environment:
Last Closed: 2018-01-24 14:43:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1516817    
Bug Blocks: 1523346    

Description rhev-integ 2017-11-27 09:40:05 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1516817 +++
======================================================================

Description of problem:
engine brings hosts out of maintenance mode by choosing them as proxies for power fencing

Version-Release number of selected component (if applicable):
rhevm-3.6.12-0.1.el6

How reproducible:
Always

Steps to Reproduce:

In a two node cluster:

- put host 2 into maintenance

  2017-11-23 10:56:25,726 INFO  [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-42) [] Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database,  vds 'rhevh2-375'(ac28d47f-c403-45a0-9976-5025e5a9dca3)

- stop 'vdsmd' in host 2
  
- stop 'vdsmd' on host 1, which is the only host up and therefore the current SPM, and block IP traffic from engine to it.

  After _many_ failed connection attempts from RHEV-M to host 1, RHEV-M gives up and proceeds to try to fence host 1:
  
  2017-11-23 11:07:37,557 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-6-thread-43) [] ResourceManager::vdsNotResponding entered for Host '70155b92-324f-4921-8365-377dbc6c297a', 'rhevh1-375.usersys.redhat.com'
  
  It tries to use host 2 as a proxy to fence host 1 ...
  
  2017-11-23 11:07:37,747 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host rhevh1-375 using Proxy Host rhevh2-375 and Fence Agent ipmilan:10.33.9.254.
  
  ... and fails since 'vdsmd' is stopped in host 2:
  
  2017-11-23 11:07:37,947 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Fence action failed using proxy host 'rhevh2-375.usersys.redhat.com', trying another proxy

  2017-11-23 11:07:38,074 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Can not run fence action on host 'rhevh1-375', no suitable proxy host was found.

  2017-11-23 11:07:38,074 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Failed to find another proxy to re-run failed fence action, retrying with the same proxy 'rhevh2-375.usersys.redhat.com'
  

- After a second and final failed attempt ...

  2017-11-23 11:07:38,098 ERROR [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] FenceVdsVDSCommand finished with null return value: succeeded=false, exceptionString='org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed'

  ... RHEV-M tries to run "GetCapabilitiesVDSCommand" on host 2, bringing it out of maintenance mode !!! :

  2017-11-23 11:07:39,046 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to rhevh2-375.usersys.redhat.com/10.33.9.146

  2017-11-23 11:07:39,047 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-12) [357ea9d9] Command 'GetCapabilitiesVDSCommand(HostName = rhevh2-375, [...]
  

- Since host 2 wasn't actually powered down, RHEV-M did a soft fencing and restarted 'vdsmd' in host 2:

  2017-11-23 11:07:39,423 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Opening SSH Soft Fencing session on host 'rhevh2-375.usersys.redhat.com'

  2017-11-23 11:07:39,570 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Executing SSH Soft Fencing command on host 'rhevh2-375.usersys.redhat.com'


Actual results:
engine chooses hosts in maintenance mode as proxies to fence other hosts bringing the chosen hosts out of maintenance mode in the process.

Expected results:
Hosts in maintenance mode are not brought out of maintenance mode unless requested by user.

Additional info:
This impacts the process to recover from damaged LVM metadata outlined in article https://access.redhat.com/solutions/120903

(Originally by Julio Entrena Perez)

Comment 3 rhev-integ 2017-11-27 09:40:16 UTC
After further testing it turns out that disabling Power Management for the host is not a sufficient workaround to prevent this, unless the details of the fencing agent are deleted.

The "Enable Power Management" status seems to be disregarded and if info about fencing agent is still configured then fencing still occurs, I had to re-enable power management, delete the details of the fencing agent, then disable power management back.

(Originally by Julio Entrena Perez)

Comment 4 rhev-integ 2017-11-27 09:40:20 UTC
(In reply to Julio Entrena Perez from comment #2)
> After further testing it turns out that disabling Power Management for the
> host is not a sufficient workaround to prevent this, unless the details of
> the fencing agent are deleted.
> 
> The "Enable Power Management" status seems to be disregarded and if info
> about fencing agent is still configured then fencing still occurs, I had to
> re-enable power management, delete the details of the fencing agent, then
> disable power management back.

The only workaround is to disable fencing for the cluster in Fencing Policy tab in Cluster Detail dialog. In that case fencing will be disable, so we will not try to use them as fence proxies and those hosts will stay in Maintenance.

(Originally by Martin Perina)

Comment 5 Petr Matyáš 2018-01-16 11:48:23 UTC
Verified on ovirt-engine-4.1.9-0.2.el7.noarch

Comment 8 errata-xmlrpc 2018-01-24 14:43:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0135

Comment 9 Franta Kust 2019-05-16 12:54:32 UTC
BZ<2>Jira re-sync