1240330 – fencing adjacent node occurs even if the stonith resource is Stopped

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1240330 - fencing adjacent node occurs even if the stonith resource is Stopped

Summary: fencing adjacent node occurs even if the stonith resource is Stopped

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	7.3
Assignee:	Klaus Wenninger
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1304771
Blocks:	1364088
TreeView+	depends on / blocked

Reported:	2015-07-06 15:16 UTC by Renaud Marigny
Modified:	2023-09-14 03:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:	pacemaker-1.1.15-1.2c148ac.git.el7
Doc Type:	No Doc Update
Doc Text:	This was fixed in 7.1.z and 7.2, so there is no change in behavior in 7.3.
Clone Of:
Clones:	1301204 (view as bug list)
Environment:
Last Closed:	2016-11-03 18:56:02 UTC
Target Upstream Version:
Embargoed:
Flags:	kwenning: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	2065493	0	None	None	None	2017-12-06 17:28:59 UTC
Red Hat Product Errata	RHSA-2016:2578	0	normal	SHIPPED_LIVE	Moderate: pacemaker security, bug fix, and enhancement update	2016-11-03 12:07:24 UTC

Comment 2 David Vossel 2015-07-06 16:15:04 UTC

I'm not surprised by this. It's possible that "stopping" a stonith device only really means "stop performing status monitoring". Stonith likely still has the device registered even after the "stop" which is why stonith can still successfully fence the node. 

I wouldn't give this a high priority. There's a very simple workaround. Instead of disabling the stonith agent using 'pcs resource disable' , delete it from the configuration using 'pcs stonith delete'.

Also, i'm surprised it's possible to disable a stonith device using pcs. There are no commmands under 'pcs stonith' capable of disabling a fencing device. Someone would have to use 'pcs resource disable' which isn't stonith specific. This may work, but I'm not sure this is supported.

Comment 3 Andrew Beekhof 2015-07-15 23:06:51 UTC

Disable is supposed to mark it as unusable for fencing.
Likewise a location constraint is supposed to be able to make it unusable on a specific node.

Its is possible this broke somewhere along the line.
Realistically we're not going to get to it until 7.3 though :-(

Comment 8 Ken Gaillot 2016-01-19 21:01:22 UTC

To clarify the goal here, it is an intentional feature of pacemaker that fencing devices can be used even if in 'Stopped' state. However, if *target-role* is Stopped, then the fencing device should not be used. Separately, if a node has a negative location constraint for a fencing device, that node should not use the device.

Comment 9 Moullé Alain 2016-01-20 08:33:01 UTC

Hi
1/ About 1st remark of David Vossel :
I was also a little lost when acting on stonith resources: sometimes it is pcs stonith <action> (ex show) , sometimes it is pcs resource <action> (ex: enable/disable) but ok it is a detail.
2/ about the capacity of pacemaker to fence with a stonith resource despite it is disabled : it was not the case in older releases of pacemaker, time of crm, when we put a stonith resource at Stopped with crm resource stop , pacemaker did not tried to fence the targeted node of the stonith resource. This was an easy way to avoid fencing an adjacent node at start of pacemaker when the user is aware of its actions and definitely does not want the adjacent node to be fenced (for example when the adjacent node is temporarily used for whatever else such maintenance etc. and the administrator knows for sure that none shared resources are started on the adjacent node) 
So I do agree with Andrew writing "Disable is supposed to mark it as unusable for fencing" 
Regards
Alain Moullé

Comment 10 Klaus Wenninger 2016-02-01 19:29:31 UTC

Had a look and the code filtering the list of resources being
a stonith device, target-role!=Stopped, non negative score on
the node, ... is there.
There is as well update_cib_cache_cb taking care of relevant
dynamic changes in the cib.

Tried the straight forward case with the cib created by CTS.

  pcs stonith update Fencing meta target-role="Stopped"

leads to the expected behaviour like e.g. Fencing not shown
anymore using 
  
  crm_mon 

and the device is not shown anymore doing

  stonith_admin -L 

and we get the expected log from stonithd
telling that it got the dynamic change.

Feb 01 17:53:44 [1960] node2 stonith-ng: (      main.c:638   )    info: cib_device_update:      Device Fencing has been disabled

... same status after a pacemaker restart ...

Haven't played with location constraints yet to see how well that works
when forcing a stonith device off a node like this.
My test was done with upstream master as of a couple of days ago.
So one possibility would be that it is a regression which is meanwhile
fixed again upstream (willingly or not who knows ...) or that the scenario
tested above uses location constraints which make the filtering stonithd
is doing fail.

Comment 11 Klaus Wenninger 2016-02-02 10:48:27 UTC

Forcing it off via location constraint seems to lead to the desired result
as well:

pcs resource ban Fencing node2

Feb 02 10:56:30 [22227] node2 stonith-ng: (  commands.c:2556  )   debug: stonith_command:       Processing st_device_remove 17 from lrmd.22228 (            1000)
Feb 02 10:56:30 [22227] node2 stonith-ng: (  commands.c:1078  )    info: stonith_device_remove: Removed 'Fencing' from the device list (2 active devices)

Verification with stonith_admin is OK as well (as well after pacemaker restart).

Comment 12 Klaus Wenninger 2016-02-02 14:22:53 UTC

On pacemaker-1.1.13-10.el7 (the one delivered with release RHEL7.2) I
can reproduce a fencing device forced off via ban staying in stonithd-device-list
although crm_mon for instance doesn't show it anymore as running.

I see:

Feb 02 15:07:14 [25409] node2 stonith-ng: (  commands.c:2450  )   debug: stonith_command:       Processing st_device_remove 14 from lrmd.25410 (            1000)
Feb 02 15:07:14 [25409] node2 stonith-ng: (  commands.c:2464  )   debug: stonith_command:       Processed st_device_remove from lrmd.25410: OK (0)

The info line from stonith_device_remove seems to be missing.

Turning off the stonith-device using target-role="Stopped" seems to work
reliably on the other hand. It even cleans up the situation after having
messed up using ban.
A pacemaker restart cures the situation with the ban-rule in place and
still having the stonith-device as well of course.

Comment 13 Klaus Wenninger 2016-02-02 17:31:57 UTC

Do you still have the pacemaker-packet-version you had experienced this
behaviour described with?
I gave pacemaker-1.1.12-22.el7 a try (should be the version 7.1 was release with)
and with that I could reproduce that disabling a fencing-resource doesn't
remove it from the device-list (stonith_admin -L).
Although with this version shutting down the device via ban-rule seems
to work properly including the device being removed from device-list.

Comment 15 Klaus Wenninger 2016-02-03 15:15:16 UTC

Found that it was a rhel-7.1 without z-stream from the sosreport.

That is fine as it matches my research ;-)
Checking the most current pacemaker from z-stream (pacemaker-1.1.12-22.el7_1.4)
showed that the behaviour there is rather as with rhel-7.2 that
forcing the stonith-device stopped via disable seems to work and using
a ban-rule seems not to.

Comment 16 Klaus Wenninger 2016-02-03 18:39:33 UTC

Test with upstream pacemaker-1.1.4 showed that both testcases (disable & ban)
work as expected.

Regarding the initial symptom that disabling a fencing-resource (on the fly)
still kept the device being in the list of stonithd we can state that
it is definitely reproducible with RHEL-7.1 as initially released but
that it is already solved both with RHEL-7.2 and RHEL-7.1.z.

As the target release is 7.3 and we expect a rebase to at least to what
6.8 has been rebased (pacemaker-1.1.14) both issues (stopping a fencing-resource 
using disable & ban) should be solved there automatically.

There have been multiple recent changes in the code when stonithd is feeling
responsible for a certain cib-change when it is receiving the callback.
So I guess these are responsible for the behaviour being back to what
it is desired.
Unless we would be planning a z-stream on a release it is probably academic
to really pin down which of the changes cured which of the 2 cases.

Comment 18 Mike McCune 2016-03-28 22:53:35 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 20 Jaroslav Kortus 2016-09-09 17:14:11 UTC

Verified with pacemaker-1.1.15-10.el7.x86_64.

Both conditions (ban and disable) work as desired.
These messages can be observed when the fencing device is disabled (pcs resource disable) or banned (pcs resource ban <fencing dev>  <node>):
error: Operation reboot of virt-046 by <no-one> for stonith-api.26768: No such device
stonith_api_kick: Could not kick (reboot) node 2/(null) : No such device (-19)
kick_helper error -19 nodeid 2

pcs status snip showing disabled fencing device:
Full list of resources:

 fence-virt-045	(stonith:fence_xvm):	Started virt-045
 fence-virt-046	(stonith:fence_xvm):	Stopped (disabled)


Nodes are not rebooted upon rejoin, regular fencing works as expected.

Comment 22 errata-xmlrpc 2016-11-03 18:56:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Comment 23 Red Hat Bugzilla 2023-09-14 03:01:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.