Bug 1329472
Summary: | Cannot recreate remote node resource | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andrew Beekhof <abeekhof> | ||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, idevat, omular, rmarigny, rsteiger, tojeline | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pcs-0.9.152-5.el7 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause:
User removes a remote node from a cluster.
Consequence:
Pcs does not tell pacemaker the node is permanently gone and should be removed from pacemaker's internal structures. Pcs then refuses to create a resource or a remote node with the same name saying it already exists.
Fix:
Tell pacemaker the node was removed from the cluster.
Result:
It is possible to recreate the resource / remote node.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-11-03 20:58:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1303136 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Andrew Beekhof
2016-04-22 03:07:34 UTC
Weirder... it seems to be happening at the cib level:
[root@overcloud-controller-0 heat-admin]# cibadmin --create -o resources --xml-text " <primitive class="ocf" id="overcloud-novacompute-2.localdomain" provider="pacemaker" type="remote">
> <instance_attributes id="overcloud-novacompute-2-instance_attributes">
> <nvpair id="overcloud-novacompute-2-instance_attributes-reconnect_interval" name="reconnect_interval" value="240"/>
> </instance_attributes>
> <operations>
> <op id="overcloud-novacompute-2-start-interval-0s" interval="0s" name="start" timeout="60"/>
> <op id="overcloud-novacompute-2-stop-interval-0s" interval="0s" name="stop" timeout="60"/>
> <op id="overcloud-novacompute-2-monitor-interval-20" interval="20" name="monitor"/>
> </operations>
> </primitive>
> "
Call cib_create failed (-76): Name not unique on network
<failed>
<failed_update object_type="primitive" operation="cib_create" reason="Name not unique on network">
<primitive/>
</failed_update>
</failed>
[root@overcloud-controller-0 heat-admin]# cibadmin -Ql | grep -C 0 overcloud-novacompute-2
<node type="remote" id="overcloud-novacompute-2" uname="overcloud-novacompute-2">
<instance_attributes id="nodes-overcloud-novacompute-2">
<nvpair id="nodes-overcloud-novacompute-2-osprole" name="osprole" value="compute"/>
--
<node_state remote_node="true" id="overcloud-novacompute-2" uname="overcloud-novacompute-2" crm-debug-origin="do_update_resource" node_fenced="0">
<transient_attributes id="overcloud-novacompute-2">
<instance_attributes id="status-overcloud-novacompute-2"/>
--
<lrm id="overcloud-novacompute-2">
Re-assigning
This may be related to the fact pcs does not run crm_node -R when removing remote nodes: https://github.com/feist/pcs/issues/78 We also need to fix id uniqueness checks in pcs, as currently we search for an id in the whole cib including the status section, which is wrong: bz1303136 Maybe pcs should not search for existing id in nodes section as well? Let me know, thanks. Agreed, pcs should do "crm_node -R" when removing a node, and that should fix this issue. As far as what pcs should be looking at for name collisions, there are three places Pacemaker Remote nodes can show up: 1. The nodes section: not reliable, because they will have an entry here only if they have ever had a permanent node attribute set. 2. The status section: mostly reliable. They will have an entry here as long as they have ever been started. 3. The resources section: mostly reliable. You can check against the ID of any ocf:pacemaker:remote primitives configured, and the value of the remote-node attribute for any resource configured (i.e. guest nodes, usually for VirtualDomain resources, but could be any resource in theory). The only time this is not reliable is the situation described in this bz, i.e. they have been removed from the configuration but an old status entry is still present. Bottom line, you could get away with just #2 or #3, but to be completely safe, check all three. Pacemaker is correct in rejecting the addition in this case, because the old state info would cause problems if the same ID were reused. You could argue that pacemaker should automatically clear the state info when the node is removed from the configuration, so we should evaluate that possibility at some point. Created attachment 1183870 [details]
proposed fix
Test 1: pacemaker remote resource
[root@rh72-node1:~]# pcs resource create rh72-node3 remote
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource delete rh72-node3
Attempting to stop: rh72-node3...Stopped
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0
Test 2: remote-node attribute, remote-node remove
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs cluster remote-node remove rh72-node3
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0
Test 3: remote-node attribute, resource update
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource update anode meta remote-node=
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0
Test 4: remote-node attribute, resource meta
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource meta anode remote-node=
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0
Test 5: remote-node attribute, resource delete
[root@rh72-node1:~]# pcs resource create anode dummy
[root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode
# force pacemaker to create a record in nodes section
# the fix needs to work both with and without this command
[root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar
[root@rh72-node1:~]# pcs resource delete anode
Attempting to stop: anode...Stopped
# before fix this failed
[root@rh72-node1:~]# pcs resource create rh72-node3 dummy
[root@rh72-node1:~]# echo $?
0
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 # is there to force pacemaker to create a record in nodes section # the fix needs to work both with and without this command 1) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 2) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 3) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 4) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 5) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2596.html |