Bug 1287315 - Updating a fencing device will sometimes result in it no longer being registered
Updating a fencing device will sometimes result in it no longer being registered
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker (Show other bugs)
7.2
Unspecified Unspecified
urgent Severity urgent
: rc
: 7.3
Assigned To: Klaus Wenninger
cluster-qe@redhat.com
Milan Navratil
: ZStream
Depends On: 1304771
Blocks: 1299341
  Show dependency treegraph
 
Reported: 2015-12-01 18:20 EST by Andrew Beekhof
Modified: 2016-11-03 14:57 EDT (History)
9 users (show)

See Also:
Fixed In Version: pacemaker-1.1.15-1.2c148ac.git.el7
Doc Type: Bug Fix
Doc Text:
*stonithd* now properly distinguishes attribute removals from device removals. Prior to this update, if a user deleted an attribute from a fence device, Pacemaker's *stonithd* service sometimes mistakenly removed the entire device. Consequently, the cluster would no longer use the fence device. The underlying source code has been modified to fix this bug, and *stonithd* now properly distinguishes attribute removals from device removals. As a result, deleting a fence device attribute no longer removes the device itself.
Story Points: ---
Clone Of:
: 1299341 (view as bug list)
Environment:
Last Closed: 2016-11-03 14:57:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Andrew Beekhof 2015-12-01 18:20:47 EST
Description of problem:

Running pcs stonith update fence-nova foo=bar results in:

Nov 30 22:18:35 [3102] overcloud-controller-2.localdomain stonith-ng:     info: stonith_device_remove:    Removed 'fence-nova' from the device list (7 active devices)

[root@overcloud-controller-2 heat-admin]# stonith_admin -L
 ipmilan-overcloud-novacompute-0
 ipmilan-overcloud-novacompute-1
 ipmilan-overcloud-novacompute-2
 stonith-overcloud-controller-0
 stonith-overcloud-controller-1
 stonith-overcloud-controller-2
 ipmilan-overcloud-novacompute-3
7 devices found


Version-Release number of selected component (if applicable):

pacemaker-1.1.13-10.el7.x86_64

How reproducible:

still unclear
Steps to Reproduce:
1.  Run: pcs stonith update [...]

Actual results:

Device missing

Expected results:

Device still present

Additional info:

TBA
Comment 1 Andrew Beekhof 2015-12-01 18:33:14 EST
Aha!

[root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose=false

Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: update_cib_stonith_devices_v2:	Updating device list from the cib: modify nvpair[@id='fence-nova-instance_attributes-verbose']
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: cib_devices_update:	Updating devices to version 0.328.0
[root@overcloud-controller-0 heat-admin]# Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: build_device_from_xml:	The fencing device 'fence-nova' requires unfencing
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:     info: build_device_from_xml:	The fencing device 'fence-nova' requires actions (on) to be executed on the target node
Dec 01 18:30:15 [3049] overcloud-controller-0.localdomain stonith-ng:   notice: stonith_device_register:	Added 'fence-nova' to the device list (8 active devices)


However:

[root@overcloud-controller-0 heat-admin]# pcs stonith update fence-nova verbose=
Dec 01 18:30:51 [3049] overcloud-controller-0.localdomain stonith-ng:     info: stonith_device_remove:	Removed 'fence-nova' from the device list (7 active devices)


So the removal of an attribute is (incorrectly) being handled as a removal of the whole device.


Perhaps a good one for Klaus or Jan
Comment 2 Ken Gaillot 2015-12-02 12:46:10 EST
Confirmed as a bug: update_cib_stonith_devices_v2() treats device attribute additions/removals as device additions/removals. It (accidentally) works for a=b because that's treated as a removal followed by a re-addition. And a= does not consistently fail, because if there are other changes in the CIB diff, the device will likely get re-added correctly.
Comment 3 Ken Gaillot 2015-12-02 12:56:52 EST
The workaround until this is fixed is to delete the device and re-add it with the desired attributes, instead of unsetting a device attribute. If someone has already lost a device due to this bug, they can work around it by re-adding the device with the desired attributes.
Comment 4 Klaus Wenninger 2015-12-08 08:29:34 EST
Cheapest (implementation wise) would probably be to do a cib_devices_update on any removal. But it makes deletion of devices costly at runtime.
Most efficient at runtime would probably be implementing the deletion of just an attribute without the device.
The best tradeoff between implementation-effort, retest-effort & runtime-costs is probably to, as till now, delete the device on deletion of just an attribute but in case that the deletion was due to deletion of an attribute to trigger cib_devices_update.
Comment 5 Klaus Wenninger 2015-12-08 09:01:39 EST
Fix as lined out in previous comment seems to fix the problem.
https://github.com/wenningerk/pacemaker/commit/5db518f9e35bf60e83af899e281c527e05336ad7
Comment 6 Klaus Wenninger 2015-12-09 09:17:22 EST
If we are anyway doing cib_device_updates we can as well spare stonith_device_remove and all the parsing for the resource-name and the subsequent looping through the xml-code.

https://github.com/wenningerk/pacemaker/commit/98e69e033835b3d4dfdc8c9cabacae28770725f1
Comment 9 Klaus Wenninger 2016-01-20 13:10:08 EST
I had it nearly 100% reproducible with any stonith-resource like

pcs stonith update foo-stonith-resource foo=bla

pcs stonith update foo-stonith-resource foo= && stonith_admin -L

To have it reproducible with a nearly 100% chance you have to
keep the gap between deleting the attribute and the query for the list
as short as possible because anything happening that triggers an
update of the devices in between hides the misbehaviour. 
And you have to be on the node where the stonith-resource is actually
running of course.
Comment 10 Mike McCune 2016-03-28 18:54:17 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions
Comment 12 Patrik Hagara 2016-09-06 10:50:06 EDT
Reproducer used: https://bugzilla.redhat.com/show_bug.cgi?id=1287315#c1

Before the fix:

> [root@virt-166 ~]# rpm -q pacemaker
> pacemaker-1.1.13-10.el7.x86_64
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found
> [root@virt-166 ~]# pcs stonith update fence-virt-167 delay=
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-168
> 2 devices found


After the fix:

> [root@virt-166 ~]# rpm -q pacemaker
> pacemaker-1.1.15-1.2c148ac.git.el7.x86_64
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found
> [root@virt-166 ~]# pcs stonith update fence-virt-167 delay=
> [root@virt-166 ~]# stonith_admin -L
>  fence-virt-166
>  fence-virt-167
>  fence-virt-168
> 3 devices found


Marking as verified in pacemaker-1.1.15-1.2c148ac.git.el7 -- removal of stonith device attribute no longer removes the whole fence device.
Comment 14 errata-xmlrpc 2016-11-03 14:57:04 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html

Note You need to log in before you can comment on or make changes to this bug.