Bug 1240330
Summary: | fencing adjacent node occurs even if the stonith resource is Stopped | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Renaud Marigny <rmarigny> | |
Component: | pacemaker | Assignee: | Klaus Wenninger <kwenning> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 7.1 | CC: | abeekhof, alain.moulle, cluster-maint, fdinitto, ivlnka, kgaillot, mnavrati, rmarigny, sbhat, sbradley | |
Target Milestone: | rc | Flags: | kwenning:
needinfo-
|
|
Target Release: | 7.3 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | pacemaker-1.1.15-1.2c148ac.git.el7 | Doc Type: | No Doc Update | |
Doc Text: |
This was fixed in 7.1.z and 7.2, so there is no change in behavior in 7.3.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1301204 (view as bug list) | Environment: | ||
Last Closed: | 2016-11-03 18:56:02 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1304771 | |||
Bug Blocks: | 1364088 |
Comment 2
David Vossel
2015-07-06 16:15:04 UTC
Disable is supposed to mark it as unusable for fencing. Likewise a location constraint is supposed to be able to make it unusable on a specific node. Its is possible this broke somewhere along the line. Realistically we're not going to get to it until 7.3 though :-( To clarify the goal here, it is an intentional feature of pacemaker that fencing devices can be used even if in 'Stopped' state. However, if *target-role* is Stopped, then the fencing device should not be used. Separately, if a node has a negative location constraint for a fencing device, that node should not use the device. Hi 1/ About 1st remark of David Vossel : I was also a little lost when acting on stonith resources: sometimes it is pcs stonith <action> (ex show) , sometimes it is pcs resource <action> (ex: enable/disable) but ok it is a detail. 2/ about the capacity of pacemaker to fence with a stonith resource despite it is disabled : it was not the case in older releases of pacemaker, time of crm, when we put a stonith resource at Stopped with crm resource stop , pacemaker did not tried to fence the targeted node of the stonith resource. This was an easy way to avoid fencing an adjacent node at start of pacemaker when the user is aware of its actions and definitely does not want the adjacent node to be fenced (for example when the adjacent node is temporarily used for whatever else such maintenance etc. and the administrator knows for sure that none shared resources are started on the adjacent node) So I do agree with Andrew writing "Disable is supposed to mark it as unusable for fencing" Regards Alain Moullé Had a look and the code filtering the list of resources being a stonith device, target-role!=Stopped, non negative score on the node, ... is there. There is as well update_cib_cache_cb taking care of relevant dynamic changes in the cib. Tried the straight forward case with the cib created by CTS. pcs stonith update Fencing meta target-role="Stopped" leads to the expected behaviour like e.g. Fencing not shown anymore using crm_mon and the device is not shown anymore doing stonith_admin -L and we get the expected log from stonithd telling that it got the dynamic change. Feb 01 17:53:44 [1960] node2 stonith-ng: ( main.c:638 ) info: cib_device_update: Device Fencing has been disabled ... same status after a pacemaker restart ... Haven't played with location constraints yet to see how well that works when forcing a stonith device off a node like this. My test was done with upstream master as of a couple of days ago. So one possibility would be that it is a regression which is meanwhile fixed again upstream (willingly or not who knows ...) or that the scenario tested above uses location constraints which make the filtering stonithd is doing fail. Forcing it off via location constraint seems to lead to the desired result as well: pcs resource ban Fencing node2 Feb 02 10:56:30 [22227] node2 stonith-ng: ( commands.c:2556 ) debug: stonith_command: Processing st_device_remove 17 from lrmd.22228 ( 1000) Feb 02 10:56:30 [22227] node2 stonith-ng: ( commands.c:1078 ) info: stonith_device_remove: Removed 'Fencing' from the device list (2 active devices) Verification with stonith_admin is OK as well (as well after pacemaker restart). On pacemaker-1.1.13-10.el7 (the one delivered with release RHEL7.2) I can reproduce a fencing device forced off via ban staying in stonithd-device-list although crm_mon for instance doesn't show it anymore as running. I see: Feb 02 15:07:14 [25409] node2 stonith-ng: ( commands.c:2450 ) debug: stonith_command: Processing st_device_remove 14 from lrmd.25410 ( 1000) Feb 02 15:07:14 [25409] node2 stonith-ng: ( commands.c:2464 ) debug: stonith_command: Processed st_device_remove from lrmd.25410: OK (0) The info line from stonith_device_remove seems to be missing. Turning off the stonith-device using target-role="Stopped" seems to work reliably on the other hand. It even cleans up the situation after having messed up using ban. A pacemaker restart cures the situation with the ban-rule in place and still having the stonith-device as well of course. Do you still have the pacemaker-packet-version you had experienced this behaviour described with? I gave pacemaker-1.1.12-22.el7 a try (should be the version 7.1 was release with) and with that I could reproduce that disabling a fencing-resource doesn't remove it from the device-list (stonith_admin -L). Although with this version shutting down the device via ban-rule seems to work properly including the device being removed from device-list. Found that it was a rhel-7.1 without z-stream from the sosreport. That is fine as it matches my research ;-) Checking the most current pacemaker from z-stream (pacemaker-1.1.12-22.el7_1.4) showed that the behaviour there is rather as with rhel-7.2 that forcing the stonith-device stopped via disable seems to work and using a ban-rule seems not to. Test with upstream pacemaker-1.1.4 showed that both testcases (disable & ban) work as expected. Regarding the initial symptom that disabling a fencing-resource (on the fly) still kept the device being in the list of stonithd we can state that it is definitely reproducible with RHEL-7.1 as initially released but that it is already solved both with RHEL-7.2 and RHEL-7.1.z. As the target release is 7.3 and we expect a rebase to at least to what 6.8 has been rebased (pacemaker-1.1.14) both issues (stopping a fencing-resource using disable & ban) should be solved there automatically. There have been multiple recent changes in the code when stonithd is feeling responsible for a certain cib-change when it is receiving the callback. So I guess these are responsible for the behaviour being back to what it is desired. Unless we would be planning a z-stream on a release it is probably academic to really pin down which of the changes cured which of the 2 cases. This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions Verified with pacemaker-1.1.15-10.el7.x86_64. Both conditions (ban and disable) work as desired. These messages can be observed when the fencing device is disabled (pcs resource disable) or banned (pcs resource ban <fencing dev> <node>): error: Operation reboot of virt-046 by <no-one> for stonith-api.26768: No such device stonith_api_kick: Could not kick (reboot) node 2/(null) : No such device (-19) kick_helper error -19 nodeid 2 pcs status snip showing disabled fencing device: Full list of resources: fence-virt-045 (stonith:fence_xvm): Started virt-045 fence-virt-046 (stonith:fence_xvm): Stopped (disabled) Nodes are not rebooted upon rejoin, regular fencing works as expected. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |