|Summary:||Updating a fencing device will sometimes result in it no longer being registered|
|Product:||Red Hat Enterprise Linux 7||Reporter:||Andrew Beekhof <abeekhof>|
|Component:||pacemaker||Assignee:||Klaus Wenninger <kwenning>|
|Status:||CLOSED ERRATA||QA Contact:||cluster-qe <cluster-qe>|
|Severity:||urgent||Docs Contact:||Milan Navratil <mnavrati>|
|Version:||7.2||CC:||abeekhof, cfeist, cluster-maint, kgaillot, mjuricek, mnavrati, phagara, royoung, tlavigne|
|Fixed In Version:||pacemaker-1.1.15-1.2c148ac.git.el7||Doc Type:||Bug Fix|
*stonithd* now properly distinguishes attribute removals from device removals. Prior to this update, if a user deleted an attribute from a fence device, Pacemaker's *stonithd* service sometimes mistakenly removed the entire device. Consequently, the cluster would no longer use the fence device. The underlying source code has been modified to fix this bug, and *stonithd* now properly distinguishes attribute removals from device removals. As a result, deleting a fence device attribute no longer removes the device itself.
|:||1299341 (view as bug list)||Environment:|
|Last Closed:||2016-11-03 18:57:04 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:||1304771|
Description Andrew Beekhof 2015-12-01 23:20:47 UTC
Description of problem: Running pcs stonith update fence-nova foo=bar results in: Nov 30 22:18:35  overcloud-controller-2.localdomain stonith-ng: info: stonith_device_remove: Removed 'fence-nova' from the device list (7 active devices) [root@overcloud-controller-2 heat-admin]# stonith_admin -L ipmilan-overcloud-novacompute-0 ipmilan-overcloud-novacompute-1 ipmilan-overcloud-novacompute-2 stonith-overcloud-controller-0 stonith-overcloud-controller-1 stonith-overcloud-controller-2 ipmilan-overcloud-novacompute-3 7 devices found Version-Release number of selected component (if applicable): pacemaker-1.1.13-10.el7.x86_64 How reproducible: still unclear Steps to Reproduce: 1. Run: pcs stonith update [...] Actual results: Device missing Expected results: Device still present Additional info: TBA
Comment 1 Andrew Beekhof 2015-12-01 23:33:14 UTC
Aha! [root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose=false Dec 01 18:30:15  overcloud-controller-0.localdomain stonith-ng: info: update_cib_stonith_devices_v2: Updating device list from the cib: modify nvpair[@id='fence-nova-instance_attributes-verbose'] Dec 01 18:30:15  overcloud-controller-0.localdomain stonith-ng: info: cib_devices_update: Updating devices to version 0.328.0 [root@overcloud-controller-0 heat-admin]# Dec 01 18:30:15  overcloud-controller-0.localdomain stonith-ng: info: build_device_from_xml: The fencing device 'fence-nova' requires unfencing Dec 01 18:30:15  overcloud-controller-0.localdomain stonith-ng: info: build_device_from_xml: The fencing device 'fence-nova' requires actions (on) to be executed on the target node Dec 01 18:30:15  overcloud-controller-0.localdomain stonith-ng: notice: stonith_device_register: Added 'fence-nova' to the device list (8 active devices) However: [root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose= Dec 01 18:30:51  overcloud-controller-0.localdomain stonith-ng: info: stonith_device_remove: Removed 'fence-nova' from the device list (7 active devices) So the removal of an attribute is (incorrectly) being handled as a removal of the whole device. Perhaps a good one for Klaus or Jan
Comment 2 Ken Gaillot 2015-12-02 17:46:10 UTC
Confirmed as a bug: update_cib_stonith_devices_v2() treats device attribute additions/removals as device additions/removals. It (accidentally) works for a=b because that's treated as a removal followed by a re-addition. And a= does not consistently fail, because if there are other changes in the CIB diff, the device will likely get re-added correctly.
Comment 3 Ken Gaillot 2015-12-02 17:56:52 UTC
The workaround until this is fixed is to delete the device and re-add it with the desired attributes, instead of unsetting a device attribute. If someone has already lost a device due to this bug, they can work around it by re-adding the device with the desired attributes.
Comment 4 Klaus Wenninger 2015-12-08 13:29:34 UTC
Cheapest (implementation wise) would probably be to do a cib_devices_update on any removal. But it makes deletion of devices costly at runtime. Most efficient at runtime would probably be implementing the deletion of just an attribute without the device. The best tradeoff between implementation-effort, retest-effort & runtime-costs is probably to, as till now, delete the device on deletion of just an attribute but in case that the deletion was due to deletion of an attribute to trigger cib_devices_update.
Comment 5 Klaus Wenninger 2015-12-08 14:01:39 UTC
Fix as lined out in previous comment seems to fix the problem. https://github.com/wenningerk/pacemaker/commit/5db518f9e35bf60e83af899e281c527e05336ad7
Comment 6 Klaus Wenninger 2015-12-09 14:17:22 UTC
If we are anyway doing cib_device_updates we can as well spare stonith_device_remove and all the parsing for the resource-name and the subsequent looping through the xml-code. https://github.com/wenningerk/pacemaker/commit/98e69e033835b3d4dfdc8c9cabacae28770725f1
Comment 7 Klaus Wenninger 2015-12-09 16:16:37 UTC
Comment 9 Klaus Wenninger 2016-01-20 18:10:08 UTC
I had it nearly 100% reproducible with any stonith-resource like pcs stonith update foo-stonith-resource foo=bla pcs stonith update foo-stonith-resource foo= && stonith_admin -L To have it reproducible with a nearly 100% chance you have to keep the gap between deleting the attribute and the query for the list as short as possible because anything happening that triggers an update of the devices in between hides the misbehaviour. And you have to be on the node where the stonith-resource is actually running of course.
Comment 10 Mike McCune 2016-03-28 22:54:17 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see email@example.com with any questions
Comment 12 Patrik Hagara 2016-09-06 14:50:06 UTC
Reproducer used: https://bugzilla.redhat.com/show_bug.cgi?id=1287315#c1 Before the fix: > [root@virt-166 ~]# rpm -q pacemaker > pacemaker-1.1.13-10.el7.x86_64 > [root@virt-166 ~]# stonith_admin -L > fence-virt-166 > fence-virt-167 > fence-virt-168 > 3 devices found > [root@virt-166 ~]# pcs stonith update fence-virt-167 delay= > [root@virt-166 ~]# stonith_admin -L > fence-virt-166 > fence-virt-168 > 2 devices found After the fix: > [root@virt-166 ~]# rpm -q pacemaker > pacemaker-1.1.15-1.2c148ac.git.el7.x86_64 > [root@virt-166 ~]# stonith_admin -L > fence-virt-166 > fence-virt-167 > fence-virt-168 > 3 devices found > [root@virt-166 ~]# pcs stonith update fence-virt-167 delay= > [root@virt-166 ~]# stonith_admin -L > fence-virt-166 > fence-virt-167 > fence-virt-168 > 3 devices found Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7 -- removal of stonith device attribute no longer removes the whole fence device.
Comment 14 errata-xmlrpc 2016-11-03 18:57:04 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html