Bug 1287315 - Updating a fencing device will sometimes result in it no longer being registered
Summary: Updating a fencing device will sometimes result in it no longer being registered
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.2
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 7.3
Assignee: Klaus Wenninger
QA Contact: cluster-qe@redhat.com
Milan Navratil
URL:
Whiteboard:
Keywords: ZStream
Depends On: 1304771
Blocks: 1299341
TreeView+ depends on / blocked
 
Reported: 2015-12-01 23:20 UTC by Andrew Beekhof
Modified: 2016-11-03 18:57 UTC (History)
9 users (show)

(edit)
*stonithd* now properly distinguishes attribute removals from device removals.

Prior to this update, if a user deleted an attribute from a fence device, Pacemaker's *stonithd* service sometimes mistakenly removed the entire device. Consequently, the cluster would no longer use the fence device. The underlying source code has been modified to fix this bug, and *stonithd* now properly distinguishes attribute removals from device removals. As a result, deleting a fence device attribute no longer removes the device itself.
Clone Of:
: 1299341 (view as bug list)
(edit)
Last Closed: 2016-11-03 18:57:04 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2578 normal SHIPPED_LIVE Moderate: pacemaker security, bug fix, and enhancement update 2016-11-03 12:07:24 UTC

Description Andrew Beekhof 2015-12-01 23:20:47 UTC
Description of problem:

Running pcs stonith update fence-nova foo=bar results in:

Nov 30 22:18:35 [3102] overcloud-controller-2.localdomain stonith-ng:     info: stonith_device_remove:    Removed 'fence-nova' from the device list (7 active devices)

[root@overcloud-controller-2 heat-admin]# stonith_admin -L
 ipmilan-overcloud-novacompute-0
 ipmilan-overcloud-novacompute-1
 ipmilan-overcloud-novacompute-2
 stonith-overcloud-controller-0
 stonith-overcloud-controller-1
 stonith-overcloud-controller-2
 ipmilan-overcloud-novacompute-3
7 devices found


Version-Release number of selected component (if applicable):

pacemaker-1.1.13-10.el7.x86_64

How reproducible:

still unclear
Steps to Reproduce:
1.  Run: pcs stonith update [...]

Actual results:

Device missing

Expected results:

Device still present

Additional info:

TBA

Comment 1 Andrew Beekhof 2015-12-01 23:33:14 UTC
Aha!

[root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose=false

Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: modify nvpair[@id='fence-nova-instance_attributes-verbose']
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: cib_devices_update:	Updating devices to version 0.328.0
[root@overcloud-controller-0 heat-admin]# Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: build_device_from_xml:	The fencing device 'fence-nova' requires unfencing
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: build_device_from_xml:	The fencing device 'fence-nova' requires actions (on) to be executed on the target node
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:   notice: stonith_device_register:	Added 'fence-nova' to the device list (8 active devices)


However:

[root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose=
Dec 01 18:30:51 [3049] overcloud-controller-0.localdomain stonith-ng:     info: stonith_device_remove:	Removed 'fence-nova' from the device list (7 active devices)


So the removal of an attribute is (incorrectly) being handled as a removal of the whole device.


Perhaps a good one for Klaus or Jan

Comment 2 Ken Gaillot 2015-12-02 17:46:10 UTC
Confirmed as a bug: update_cib_stonith_devices_v2() treats device attribute additions/removals as device additions/removals. It (accidentally) works for a=b because that's treated as a removal followed by a re-addition. And a= does not consistently fail, because if there are other changes in the CIB diff, the device will likely get re-added correctly.

Comment 3 Ken Gaillot 2015-12-02 17:56:52 UTC
The workaround until this is fixed is to delete the device and re-add it with the desired attributes, instead of unsetting a device attribute. If someone has already lost a device due to this bug, they can work around it by re-adding the device with the desired attributes.

Comment 4 Klaus Wenninger 2015-12-08 13:29:34 UTC
Cheapest (implementation wise) would probably be to do a cib_devices_update on any removal. But it makes deletion of devices costly at runtime.
Most efficient at runtime would probably be implementing the deletion of just an attribute without the device.
The best tradeoff between implementation-effort, retest-effort & runtime-costs is probably to, as till now, delete the device on deletion of just an attribute but in case that the deletion was due to deletion of an attribute to trigger cib_devices_update.

Comment 5 Klaus Wenninger 2015-12-08 14:01:39 UTC
Fix as lined out in previous comment seems to fix the problem.
https://github.com/wenningerk/pacemaker/commit/5db518f9e35bf60e83af899e281c527e05336ad7

Comment 6 Klaus Wenninger 2015-12-09 14:17:22 UTC
If we are anyway doing cib_device_updates we can as well spare stonith_device_remove and all the parsing for the resource-name and the subsequent looping through the xml-code.

https://github.com/wenningerk/pacemaker/commit/98e69e033835b3d4dfdc8c9cabacae28770725f1

Comment 9 Klaus Wenninger 2016-01-20 18:10:08 UTC
I had it nearly 100% reproducible with any stonith-resource like

pcs stonith update foo-stonith-resource foo=bla

pcs stonith update foo-stonith-resource foo= && stonith_admin -L

To have it reproducible with a nearly 100% chance you have to
keep the gap between deleting the attribute and the query for the list
as short as possible because anything happening that triggers an
update of the devices in between hides the misbehaviour. 
And you have to be on the node where the stonith-resource is actually
running of course.

Comment 10 Mike McCune 2016-03-28 22:54:17 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Comment 12 Patrik Hagara 2016-09-06 14:50:06 UTC
Reproducer used: https://bugzilla.redhat.com/show_bug.cgi?id=1287315#c1

Before the fix:

> [root@virt-166 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7.x86_64
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found
> [root@virt-166 ~]# pcs stonith update fence-virt-167 delay=
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-168
> 2 devices found


After the fix:

> [root@virt-166 ~]# rpm -q pacemaker
> pacemaker-1.1.15-1.2c148ac.git.el7.x86_64
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found
> [root@virt-166 ~]# pcs stonith update fence-virt-167 delay=
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found


Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7 -- removal of stonith device attribute no longer removes the whole fence device.

Comment 14 errata-xmlrpc 2016-11-03 18:57:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html


Note You need to log in before you can comment on or make changes to this bug.