Red Hat Bugzilla – Bug 1329472
Cannot recreate remote node resource
Last modified: 2018-03-23 07:27:32 EDT
Description of problem: Cannot recreate remote node resource when there is an existing node entry (left over from a previous incarnation) [root@overcloud-controller-0 heat-admin]# pcs resource disable overcloud-novacompute-2 [root@overcloud-controller-0 heat-admin]# pcs resource delete overcloud-novacompute-2 Attempting to stop: overcloud-novacompute-2...Stopped [root@overcloud-controller-0 heat-admin]# pcs resource create overcloud-novacompute-2 remote reconnect_interval=240 Error: unable to create resource/fence device 'overcloud-novacompute-2', 'overcloud-novacompute-2' already exists on this system [root@overcloud-controller-0 heat-admin]# cibadmin -Ql | grep -C 10 overcloud-novacompute-2 <node id="overcloud-novacompute-0" type="remote" uname="overcloud-novacompute-0"> <instance_attributes id="nodes-overcloud-novacompute-0"> <nvpair id="nodes-overcloud-novacompute-0-osprole" name="osprole" value="compute"/> </instance_attributes> </node> <node id="overcloud-novacompute-1" type="remote" uname="overcloud-novacompute-1"> <instance_attributes id="nodes-overcloud-novacompute-1"> <nvpair id="nodes-overcloud-novacompute-1-osprole" name="osprole" value="compute"/> </instance_attributes> </node> <node type="remote" id="overcloud-novacompute-2" uname="overcloud-novacompute-2"> <instance_attributes id="nodes-overcloud-novacompute-2"> <nvpair id="nodes-overcloud-novacompute-2-osprole" name="osprole" value="compute"/> </instance_attributes> </node> </nodes> <resources> <primitive class="ocf" id="ip-192.0.2.6" provider="heartbeat" type="IPaddr2"> <instance_attributes id="ip-192.0.2.6-instance_attributes"> <nvpair id="ip-192.0.2.6-instance_attributes-ip" name="ip" value="192.0.2.6"/> <nvpair id="ip-192.0.2.6-instance_attributes-cidr_netmask" name="cidr_netmask" value="32"/> </instance_attributes> <operations> Version-Release number of selected component (if applicable): pcs-0.9.143-15.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. see above 2. 3. Actual results: Error: unable to create resource/fence device 'overcloud-novacompute-2', 'overcloud-novacompute-2' already exists on this system Expected results: resource is created Additional info: I think pcs' uniqueness checks are being slightly overzealous here. The operation should be allowed to proceed
Weirder... it seems to be happening at the cib level: [root@overcloud-controller-0 heat-admin]# cibadmin --create -o resources --xml-text " <primitive class="ocf" id="overcloud-novacompute-2.localdomain" provider="pacemaker" type="remote"> > <instance_attributes id="overcloud-novacompute-2-instance_attributes"> > <nvpair id="overcloud-novacompute-2-instance_attributes-reconnect_interval" name="reconnect_interval" value="240"/> > </instance_attributes> > <operations> > <op id="overcloud-novacompute-2-start-interval-0s" interval="0s" name="start" timeout="60"/> > <op id="overcloud-novacompute-2-stop-interval-0s" interval="0s" name="stop" timeout="60"/> > <op id="overcloud-novacompute-2-monitor-interval-20" interval="20" name="monitor"/> > </operations> > </primitive> > " Call cib_create failed (-76): Name not unique on network <failed> <failed_update object_type="primitive" operation="cib_create" reason="Name not unique on network"> <primitive/> </failed_update> </failed> [root@overcloud-controller-0 heat-admin]# cibadmin -Ql | grep -C 0 overcloud-novacompute-2 <node type="remote" id="overcloud-novacompute-2" uname="overcloud-novacompute-2"> <instance_attributes id="nodes-overcloud-novacompute-2"> <nvpair id="nodes-overcloud-novacompute-2-osprole" name="osprole" value="compute"/> -- <node_state remote_node="true" id="overcloud-novacompute-2" uname="overcloud-novacompute-2" crm-debug-origin="do_update_resource" node_fenced="0"> <transient_attributes id="overcloud-novacompute-2"> <instance_attributes id="status-overcloud-novacompute-2"/> -- <lrm id="overcloud-novacompute-2"> Re-assigning
This may be related to the fact pcs does not run crm_node -R when removing remote nodes: https://github.com/feist/pcs/issues/78 We also need to fix id uniqueness checks in pcs, as currently we search for an id in the whole cib including the status section, which is wrong: bz1303136 Maybe pcs should not search for existing id in nodes section as well? Let me know, thanks.
Agreed, pcs should do "crm_node -R" when removing a node, and that should fix this issue. As far as what pcs should be looking at for name collisions, there are three places Pacemaker Remote nodes can show up: 1. The nodes section: not reliable, because they will have an entry here only if they have ever had a permanent node attribute set. 2. The status section: mostly reliable. They will have an entry here as long as they have ever been started. 3. The resources section: mostly reliable. You can check against the ID of any ocf:pacemaker:remote primitives configured, and the value of the remote-node attribute for any resource configured (i.e. guest nodes, usually for VirtualDomain resources, but could be any resource in theory). The only time this is not reliable is the situation described in this bz, i.e. they have been removed from the configuration but an old status entry is still present. Bottom line, you could get away with just #2 or #3, but to be completely safe, check all three. Pacemaker is correct in rejecting the addition in this case, because the old state info would cause problems if the same ID were reused. You could argue that pacemaker should automatically clear the state info when the node is removed from the configuration, so we should evaluate that possibility at some point.
Created attachment 1183870 [details] proposed fix Test 1: pacemaker remote resource [root@rh72-node1:~]# pcs resource create rh72-node3 remote # force pacemaker to create a record in nodes section # the fix needs to work both with and without this command [root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar [root@rh72-node1:~]# pcs resource delete rh72-node3 Attempting to stop: rh72-node3...Stopped # before fix this failed [root@rh72-node1:~]# pcs resource create rh72-node3 dummy [root@rh72-node1:~]# echo $? 0 Test 2: remote-node attribute, remote-node remove [root@rh72-node1:~]# pcs resource create anode dummy [root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode # force pacemaker to create a record in nodes section # the fix needs to work both with and without this command [root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar [root@rh72-node1:~]# pcs cluster remote-node remove rh72-node3 # before fix this failed [root@rh72-node1:~]# pcs resource create rh72-node3 dummy [root@rh72-node1:~]# echo $? 0 Test 3: remote-node attribute, resource update [root@rh72-node1:~]# pcs resource create anode dummy [root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode # force pacemaker to create a record in nodes section # the fix needs to work both with and without this command [root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar [root@rh72-node1:~]# pcs resource update anode meta remote-node= # before fix this failed [root@rh72-node1:~]# pcs resource create rh72-node3 dummy [root@rh72-node1:~]# echo $? 0 Test 4: remote-node attribute, resource meta [root@rh72-node1:~]# pcs resource create anode dummy [root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode # force pacemaker to create a record in nodes section # the fix needs to work both with and without this command [root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar [root@rh72-node1:~]# pcs resource meta anode remote-node= # before fix this failed [root@rh72-node1:~]# pcs resource create rh72-node3 dummy [root@rh72-node1:~]# echo $? 0 Test 5: remote-node attribute, resource delete [root@rh72-node1:~]# pcs resource create anode dummy [root@rh72-node1:~]# pcs cluster remote-node add rh72-node3 anode # force pacemaker to create a record in nodes section # the fix needs to work both with and without this command [root@rh72-node1:~]# pcs node attribute rh72-node3 foo=bar [root@rh72-node1:~]# pcs resource delete anode Attempting to stop: anode...Stopped # before fix this failed [root@rh72-node1:~]# pcs resource create rh72-node3 dummy [root@rh72-node1:~]# echo $? 0
[vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 # is there to force pacemaker to create a record in nodes section # the fix needs to work both with and without this command 1) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 remote [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete vm-rhel72-2 Deleting Resource - vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 2) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs cluster remote-node remove vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 3) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource update anode meta remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 4) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource meta anode remote-node= [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 5) Before Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-4.el7.x86_64 [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy Error: unable to create resource/fence device 'vm-rhel72-2', 'vm-rhel72-2' already exists on this system After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.152-5.el7.x86_64 a) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0 b) [vm-rhel72-1 ~] $ pcs resource create anode dummy [vm-rhel72-1 ~] $ pcs cluster remote-node add vm-rhel72-2 anode [vm-rhel72-1 ~] $ pcs node standby vm-rhel72-2 && pcs node unstandby vm-rhel72-2 [vm-rhel72-1 ~] $ pcs resource delete anode Deleting Resource - anode [vm-rhel72-1 ~] $ pcs resource create vm-rhel72-2 dummy [vm-rhel72-1 ~] $ echo $? 0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2596.html