1516817 – Don't allow hosts in Maintenance to be selected as fence proxies

Bug 1516817 - Don't allow hosts in Maintenance to be selected as fence proxies

Summary: Don't allow hosts in Maintenance to be selected as fence proxies

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.10
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Martin Perina
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1517707
TreeView+	depends on / blocked

Reported:	2017-11-23 11:58 UTC by Julio Entrena Perez
Modified:	2021-03-11 18:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1517707 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:46:12 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:1488	0	None	None	None	2018-05-15 17:47:48 UTC
oVirt gerrit	84575	0	master	MERGED	core: Only hosts in Up and NonOperational can be fence proxies	2020-09-10 18:36:01 UTC

Description Julio Entrena Perez 2017-11-23 11:58:48 UTC

Description of problem:
engine brings hosts out of maintenance mode by choosing them as proxies for power fencing

Version-Release number of selected component (if applicable):
rhevm-3.6.12-0.1.el6

How reproducible:
Always

Steps to Reproduce:

In a two node cluster:

- put host 2 into maintenance

  2017-11-23 10:56:25,726 INFO  [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-42) [] Updated vds status from 'Preparing for Maintenance' to 'Maintenance' in database,  vds 'rhevh2-375'(ac28d47f-c403-45a0-9976-5025e5a9dca3)

- stop 'vdsmd' in host 2
  
- stop 'vdsmd' on host 1, which is the only host up and therefore the current SPM, and block IP traffic from engine to it.

  After _many_ failed connection attempts from RHEV-M to host 1, RHEV-M gives up and proceeds to try to fence host 1:
  
  2017-11-23 11:07:37,557 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (org.ovirt.thread.pool-6-thread-43) [] ResourceManager::vdsNotResponding entered for Host '70155b92-324f-4921-8365-377dbc6c297a', 'rhevh1-375.usersys.redhat.com'
  
  It tries to use host 2 as a proxy to fence host 1 ...
  
  2017-11-23 11:07:37,747 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Executing power management status on Host rhevh1-375 using Proxy Host rhevh2-375 and Fence Agent ipmilan:10.33.9.254.
  
  ... and fails since 'vdsmd' is stopped in host 2:
  
  2017-11-23 11:07:37,947 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Fence action failed using proxy host 'rhevh2-375.usersys.redhat.com', trying another proxy

  2017-11-23 11:07:38,074 ERROR [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Can not run fence action on host 'rhevh1-375', no suitable proxy host was found.

  2017-11-23 11:07:38,074 WARN  [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] Failed to find another proxy to re-run failed fence action, retrying with the same proxy 'rhevh2-375.usersys.redhat.com'
  

- After a second and final failed attempt ...

  2017-11-23 11:07:38,098 ERROR [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (org.ovirt.thread.pool-6-thread-43) [33227fa7] FenceVdsVDSCommand finished with null return value: succeeded=false, exceptionString='org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed'

  ... RHEV-M tries to run "GetCapabilitiesVDSCommand" on host 2, bringing it out of maintenance mode !!! :

  2017-11-23 11:07:39,046 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to rhevh2-375.usersys.redhat.com/10.33.9.146

  2017-11-23 11:07:39,047 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-12) [357ea9d9] Command 'GetCapabilitiesVDSCommand(HostName = rhevh2-375, [...]
  

- Since host 2 wasn't actually powered down, RHEV-M did a soft fencing and restarted 'vdsmd' in host 2:

  2017-11-23 11:07:39,423 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Opening SSH Soft Fencing session on host 'rhevh2-375.usersys.redhat.com'

  2017-11-23 11:07:39,570 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-29) [26d1f53d] Executing SSH Soft Fencing command on host 'rhevh2-375.usersys.redhat.com'


Actual results:
engine chooses hosts in maintenance mode as proxies to fence other hosts bringing the chosen hosts out of maintenance mode in the process.

Expected results:
Hosts in maintenance mode are not brought out of maintenance mode unless requested by user.

Additional info:
This impacts the process to recover from damaged LVM metadata outlined in article https://access.redhat.com/solutions/120903

Comment 2 Julio Entrena Perez 2017-11-24 16:43:07 UTC

After further testing it turns out that disabling Power Management for the host is not a sufficient workaround to prevent this, unless the details of the fencing agent are deleted.

The "Enable Power Management" status seems to be disregarded and if info about fencing agent is still configured then fencing still occurs, I had to re-enable power management, delete the details of the fencing agent, then disable power management back.

Comment 3 Martin Perina 2017-11-24 19:01:46 UTC

(In reply to Julio Entrena Perez from comment #2)
> After further testing it turns out that disabling Power Management for the
> host is not a sufficient workaround to prevent this, unless the details of
> the fencing agent are deleted.
> 
> The "Enable Power Management" status seems to be disregarded and if info
> about fencing agent is still configured then fencing still occurs, I had to
> re-enable power management, delete the details of the fencing agent, then
> disable power management back.

The only workaround is to disable fencing for the cluster in Fencing Policy tab in Cluster Detail dialog. In that case fencing will be disable, so we will not try to use them as fence proxies and those hosts will stay in Maintenance.

Comment 6 Petr Matyáš 2017-12-07 17:23:29 UTC

Verified on ovirt-engine-4.2.0-0.6.el7.noarch

Comment 10 errata-xmlrpc 2018-05-15 17:46:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 11 Franta Kust 2019-05-16 13:09:06 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.